experimental patch: Re: dillo2 wrapping Chinese
dillo people: It turns out that dillo handles Chinese quite badly because the text doesn't contain whitespace (BUG 827). It looks like Japanese is that way as well. 1) People using East Asian languages: Does this do roughly the right thing? I've read that there are picky rules about certain individual characters, but I just mean in general here. 2) Thoughts on optimization are welcomed. littlebat: For some reason, I was thinking that whitespace was required between words in dillo, but I misremembered. littlebat wrote:
Hi, Thanks for the work on Dillo.
I think it isn't a good idea to resolve long chinese line wrapping problem by inserting space after punctuation mark("?" or "?" or "?", etc..).
Here is a "Dillo2 Wrap Long Chinese Line Test" page: http://www.learndiary.com/en/dillo2-wrap-long-chinese-line-test.htm . Open it with Firefox or IE, then open it with Dillo2, You will see the different of wrapping long chinese line between them.
Note, unlike english, there is no space between Chinese characters. So, it should be wrapped after any Chinese character which has reached the end of a line. I don't know the technique detail about this, but, I think, let the Chinese characters function as (english) commas or periods is a good idea, so, a new line can be started after any Chinese charcter which character has reached the end of a line.
littlebat
corvid wrote:
Would it make things better or worse than it is now if I inserted a space when encountering a "?" or "?" so that dillo would know that it can wrap there? If that doesn't show up properly, I mean the characters that function as commas and periods.
(I'm trying to think of something simple that will at least help you somewhat without having to make big changes right now :) I realize that it _should_ be able to wrap after most CJK characters, but...)
It could be worse because 1) Those spaces would still be there if you were copying and pasting the text to something else. 2) If you wanted to search a page for "text?text", you would have to enter "text? text" with a space in there.
Hello corvid, 2008/11/12 corvid <corvid@lavabit.com>:
dillo people: It turns out that dillo handles Chinese quite badly because the text doesn't contain whitespace (BUG 827). It looks like Japanese is that way as well.
Yes, dillo doesn't handle Japanese either when a line is long and doesn't have break on it. I modified html_process_word so that each Japanese word is added by addText. In this way appearance is better but sometimes table looks bad. Regards, furaisanjin
furaisanjin wrote:
2008/11/12 corvid <corvid@lavabit.com>:
dillo people: It turns out that dillo handles Chinese quite badly because the text doesn't contain whitespace (BUG 827). It looks like Japanese is that way as well.
Yes, dillo doesn't handle Japanese either when a line is long and doesn't have break on it.
I modified html_process_word so that each Japanese word is added by addText. In this way appearance is better but sometimes table looks bad.
What happens to the tables? (It's hard for me to tell when something is bad since the characters are just shapes to me.)
2008/11/12 corvid <corvid@lavabit.com>:
What happens to the tables? (It's hard for me to tell when something is bad since the characters are just shapes to me.)
Here are 2 png pictures when I visit http://ja.wikipedia.org This is from original dillo2.0. http://furaisanjin.blog.so-net.ne.jp/_images/blog/_9f7/furaisanjin/1-02b9e.p... Red surrounded part has a long line without line break and it doesn't fit on page width. This is from modified dillo2.0. http://furaisanjin.blog.so-net.ne.jp/_images/blog/_9f7/furaisanjin/2.png Blue surrounded part is table and rows are really narrow. Regards, furaisanjin
Back in November, furaisanjin wrote:
2008/11/12 corvid <corvid@lavabit.com>:
dillo people: It turns out that dillo handles Chinese quite badly because the text doesn't contain whitespace (BUG 827). It looks like Japanese is that way as well.
Yes, dillo doesn't handle Japanese either when a line is long and doesn't have break on it.
I modified html_process_word so that each Japanese word is added by addText. In this way appearance is better but sometimes table looks bad.
I made a new version that goes in html_process_word (handles numeric character references).
Hi. I just wonder why this condiction exists in the patch. if (*s&0xe2 == 0xe2) { There are characters which start from 0xe5 in Japanese. Regards, furaisanjin
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you. Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well. (unicode blocks for the curious: http://unicode.org/Public/UNIDATA/Blocks.txt) Oh, by the way, I noticed in the line breaking document that they also listed 20000..2A6D6CJK UNIFIED IDEOGRAPHS EXTENSION B 2F800..2FA1DCJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT but I wasn't sure whether anyone makes any real use of characters above U+FFFF.
I wrote:
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you.
Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well.
Wait, why don't I just use if (*s >= 0xe2) (My excuse now is that I'm tired and it's time for bed, but I don't know what my excuse is for the other day :) How UTF-8 works, if anyone cares: 0x000000-00007F is 0xxxxxxx 0x000080-0007FF is 110xxxxx 10xxxxxx 0x000800-00FFFF is 1110xxxx 10xxxxxx 10xxxxxx 0x010000-1FFFFF is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx so (*s >= 0xe2) is at least 1110xxxx 10xxxxxx 10xxxxxx 0010 000000 000000 == U+2000
On Sat, Jun 06, 2009 at 06:36:43AM +0000, corvid wrote:
I wrote:
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you.
Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well.
Wait, why don't I just use if (*s >= 0xe2)
(My excuse now is that I'm tired and it's time for bed, but I don't know what my excuse is for the other day :)
How UTF-8 works, if anyone cares: 0x000000-00007F is 0xxxxxxx 0x000080-0007FF is 110xxxxx 10xxxxxx 0x000800-00FFFF is 1110xxxx 10xxxxxx 10xxxxxx 0x010000-1FFFFF is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
so (*s >= 0xe2) is at least 1110xxxx 10xxxxxx 10xxxxxx 0010 000000 000000 == U+2000
Whatever the C code finally is, please be sure to comment it well (it had happen to me to forget later why my own code was doing something... ;). Sometimes I've "cleaned it up" to rediscover some days or months later (by bug reports) why it was there. -- Cheers Jorge.-
On Sat, Jun 06, 2009 at 06:36:43AM +0000, corvid wrote:
I wrote:
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you.
Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well.
Wait, why don't I just use if (*s >= 0xe2)
(My excuse now is that I'm tired and it's time for bed, but I don't know what my excuse is for the other day :)
How UTF-8 works, if anyone cares: 0x000000-00007F is 0xxxxxxx 0x000080-0007FF is 110xxxxx 10xxxxxx 0x000800-00FFFF is 1110xxxx 10xxxxxx 10xxxxxx 0x010000-1FFFFF is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
so (*s >= 0xe2) is at least 1110xxxx 10xxxxxx 10xxxxxx 0010 000000 000000 == U+2000
FWIW, please consider this work "in progress". Not to be committed into dillo-2.1 (to be released soon). -- Cheers Jorge.-
Jorge wrote:
On Sat, Jun 06, 2009 at 06:36:43AM +0000, corvid wrote:
I wrote:
furaisanjin wrote:
I just wonder why this condiction exists in the patch.
if (*s&0xe2 == 0xe2) {
There are characters which start from 0xe5 in Japanese.
Ah yes, that is an error. Thank you.
Any problems with "if (*s&0xe0 == 0xe0) {" instead? I believe that would force everything above U+0800 through the decoding, but oh well.
Wait, why don't I just use if (*s >= 0xe2)
(My excuse now is that I'm tired and it's time for bed, but I don't know what my excuse is for the other day :)
How UTF-8 works, if anyone cares: 0x000000-00007F is 0xxxxxxx 0x000080-0007FF is 110xxxxx 10xxxxxx 0x000800-00FFFF is 1110xxxx 10xxxxxx 10xxxxxx 0x010000-1FFFFF is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
so (*s >= 0xe2) is at least 1110xxxx 10xxxxxx 10xxxxxx 0010 000000 000000 == U+2000
FWIW, please consider this work "in progress". Not to be committed into dillo-2.1 (to be released soon).
Right, I'm not regarding it as anything that would go into 2.1. It's just something available for those who have a use for it.
participants (3)
-
corvid@lavabit.com
-
furaisanjin@gmail.com
-
jcid@dillo.org