On Sat, Jun 06, 2009 at 06:36:43AM +0000, corvid wrote:
I wrote:
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you.
Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well.
Wait, why don't I just use if (*s >= 0xe2)
(My excuse now is that I'm tired and it's time for bed, but I don't know what my excuse is for the other day :)
How UTF-8 works, if anyone cares: 0x000000-00007F is 0xxxxxxx 0x000080-0007FF is 110xxxxx 10xxxxxx 0x000800-00FFFF is 1110xxxx 10xxxxxx 10xxxxxx 0x010000-1FFFFF is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
so (*s >= 0xe2) is at least 1110xxxx 10xxxxxx 10xxxxxx 0010 000000 000000 == U+2000
Whatever the C code finally is, please be sure to comment it well (it had happen to me to forget later why my own code was doing something... ;). Sometimes I've "cleaned it up" to rediscover some days or months later (by bug reports) why it was there. -- Cheers Jorge.-