Use ASCII KCODE to prevent problems like missing characters or matching failures...

author Giuseppe Bilotta <giuseppe.bilotta@gmail.com>

Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)

committer Giuseppe Bilotta <giuseppe.bilotta@gmail.com>

Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)
author Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)
committer Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)
diff --git a/ChangeLog b/ChangeLog

index 358aab5fe233d2af11155c7e30400aa46b59b892..403e8c4129156cb5b8f8e83ff40189f179e93b4a 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -6,6 +6,16 @@
         <yaohan.chen@gmail.com>. People take turns to continue a chain of
         words by saying words that begin with the final letter(s) of the
         previous word.
+       * IRC messages are not UTF-8: Most of the string processing across
+       rbot is done against IRC messages, which do not have a well-defined
+       encoding. Although many clients are now using UTF-8, there is no
+       guarantee that an arbitrary string received from IRC will be UTF-8
+       encoded. We have to force ASCII (byte-wise/charset agnostic) matching
+       because otherwise some strings can give problems: in particular, for
+       example, the bytesequence "\340\350\354\362\371" (that is the aeiou
+       vowels, each with a grave accent) will cause the string to be
+       considered up to the "\354" (i with grave accent) only: so either the
+       rest of the message is ignored, or the matching fails.
  
  2007-02-18  Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
  
diff --git a/bin/rbot b/bin/rbot

index 5872e3e1c7bd43683debecfa417294e6369ce987..b7a6207f7e09368189aae6d4ce484704eb63654e 100755 (executable)
--- a/bin/rbot
+++ b/bin/rbot
@@ -21,7 +21,16 @@
  # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  
-$KCODE = 'u'
+# Most of the string processing across rbot is done against IRC messages, which
+# do not have a well-defined encoding. Although many clients are now using
+# UTF-8, there is no guarantee that an arbitrary string received from IRC will
+# be UTF-8 encoded. We have to force ASCII (byte-wise/charset agnostic)
+# matching because otherwise some strings can give problems: in particular, for
+# example, the bytesequence "\340\350\354\362\371" (that is the aeiou vowels,
+# each with a grave accent) will cause the string to be considered up to the
+# "\354" (i with grave accent) only: so either the rest of the message is
+# ignored, or the matching fails.
+$KCODE = 'a'
  
  $VERBOSE=true
  
diff --git a/lib/rbot/rfc2812.rb b/lib/rbot/rfc2812.rb

index 5dec464c1525013d0908f19619c31ed28ecaeeb2..97181b039a67c8f3b8498b78eae52f6425f0d39f 100644 (file)
--- a/lib/rbot/rfc2812.rb
+++ b/lib/rbot/rfc2812.rb
@@ -888,8 +888,8 @@ module Irc
        data = Hash.new
        data[:serverstring] = serverstring
  
-      unless serverstring =~ /^(:(\S+)\s)?(\S+)(\s(.*))?/
-        raise "Unparseable Server Message!!!: #{serverstring}"
+      unless serverstring.chomp =~ /^(:(\S+)\s)?(\S+)(\s(.*))?$/
+        raise "Unparseable Server Message!!!: #{serverstring.inspect}"
        end
  
        prefix, command, params = $2, $3, $5
author	Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
	Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)
committer	Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
	Tue, 20 Feb 2007 23:02:35 +0000 (23:02 +0000)
ChangeLog		patch \| blob \| history
bin/rbot		patch \| blob \| history
lib/rbot/rfc2812.rb		patch \| blob \| history