Use ASCII KCODE to prevent problems like missing characters or matching failures when clients send messages in something else than UTF-8

author: Giuseppe Bilotta <giuseppe.bilotta@gmail.com> 2007-02-20 23:02:35 +0000
committer: Giuseppe Bilotta <giuseppe.bilotta@gmail.com> 2007-02-20 23:02:35 +0000
commit: 397b61df257f72a8ce90792985f76497ba735da4 (patch)
tree: 7b8321eab08498376d537178ebe7ed57dfc23713
parent: 1572836f8c2888742b4f65da7dc6f66735f94bc1 (diff)
3 files changed, 22 insertions, 3 deletions
diff --git a/ChangeLog b/ChangeLog
index 358aab5f..403e8c41 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -6,6 +6,16 @@
 	<yaohan.chen@gmail.com>. People take turns to continue a chain of
 	words by saying words that begin with the final letter(s) of the
 	previous word.
+	* IRC messages are not UTF-8: Most of the string processing across
+	rbot is done against IRC messages, which do not have a well-defined
+	encoding. Although many clients are now using UTF-8, there is no
+	guarantee that an arbitrary string received from IRC will be UTF-8
+	encoded. We have to force ASCII (byte-wise/charset agnostic) matching
+	because otherwise some strings can give problems: in particular, for
+	example, the bytesequence "\340\350\354\362\371" (that is the aeiou
+	vowels, each with a grave accent) will cause the string to be
+	considered up to the "\354" (i with grave accent) only: so either the
+	rest of the message is ignored, or the matching fails.
 
 2007-02-18  Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
 
diff --git a/bin/rbot b/bin/rbot
index 5872e3e1..b7a6207f 100755
--- a/bin/rbot
+++ b/bin/rbot
@@ -21,7 +21,16 @@
 # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 
-$KCODE = 'u'
+# Most of the string processing across rbot is done against IRC messages, which
+# do not have a well-defined encoding. Although many clients are now using
+# UTF-8, there is no guarantee that an arbitrary string received from IRC will
+# be UTF-8 encoded. We have to force ASCII (byte-wise/charset agnostic)
+# matching because otherwise some strings can give problems: in particular, for
+# example, the bytesequence "\340\350\354\362\371" (that is the aeiou vowels,
+# each with a grave accent) will cause the string to be considered up to the
+# "\354" (i with grave accent) only: so either the rest of the message is
+# ignored, or the matching fails.
+$KCODE = 'a'
 
 $VERBOSE=true
 
diff --git a/lib/rbot/rfc2812.rb b/lib/rbot/rfc2812.rb
index 5dec464c..97181b03 100644
--- a/lib/rbot/rfc2812.rb
+++ b/lib/rbot/rfc2812.rb
@@ -888,8 +888,8 @@ module Irc
       data = Hash.new
       data[:serverstring] = serverstring
 
-      unless serverstring =~ /^(:(\S+)\s)?(\S+)(\s(.*))?/
-        raise "Unparseable Server Message!!!: #{serverstring}"
+      unless serverstring.chomp =~ /^(:(\S+)\s)?(\S+)(\s(.*))?$/
+        raise "Unparseable Server Message!!!: #{serverstring.inspect}"
       end
 
       prefix, command, params = $2, $3, $5
author	Giuseppe Bilotta <giuseppe.bilotta@gmail.com>	2007-02-20 23:02:35 +0000
committer	Giuseppe Bilotta <giuseppe.bilotta@gmail.com>	2007-02-20 23:02:35 +0000
commit	397b61df257f72a8ce90792985f76497ba735da4 (patch)
tree	7b8321eab08498376d537178ebe7ed57dfc23713
parent	1572836f8c2888742b4f65da7dc6f66735f94bc1 (diff)