On Fri, Dec 14, Jorge Arellano Cid wrote:
On Fri, Dec 14, 2012 at 04:48:23PM +0100, Sebastian Geerken wrote:
The HTML parser passes invalid UTF-8 to dw::Textblock. I will make nextUtf8Char more robust (of course, dillo should not crash), but Jorge's page is HTML, encoded in ISO-8859-1, not UTF-8, as seen here:
000009b0 34 38 22 3e 2d 20 4b 65 79 73 74 72 6f 6b 65 20 |48">- Keystroke | 000009c0 4c 6f 67 67 69 6e 67 20 77 69 74 68 20 42 65 61 |Logging with Bea| 000009d0 63 6f 6e 20 ab 20 53 74 72 61 74 65 67 69 63 20 |con . Strategic | ^^
It seems that the Fltk functions do some checks, and sometimes decode as ISO-8859-1.
AFAIR from comments in fltk, some utf8 functions dealt with mixed latin1, utf8 and some windows codec.
They got into it because the mix was inevitable for them.
I've modified my code so that it works in a similar way, but I've not yet cared about the differences between ISO-8859-1, ISO-8859-15, and Windows-1252. Anyway these differences are marginal. However, IMO there should be a conversion to clean UTF-8 so that only a small part of dillo should have to bother about such problems, while most parts can rely on clean UTF-8. (Something to consider after the release.) Sebastian