On Fri, Dec 14, 2012 at 04:48:23PM +0100, Sebastian Geerken wrote:
Somehow this post got lost ...
Date: Fri, 14 Dec 2012 15:43:41 +0100 From: Sebastian Geerken <sgeerken at dillo.org> Subject: Re: [Dillo-dev] Dillo early exit To: Dillo mailing list <dillo-dev at dillo.org> Mail-Followup-To: Sebastian Geerken <sgeerken at dillo.org>, Dillo mailing list <dillo-dev at dillo.org>
On Fri, Dec 14, Jorge Arellano Cid wrote:
Just noticed this:
Nav_open_url: new url='http://news.bress.net/search.php?feed=149' Dns_server [0]: news.bress.net is 67.205.59.213 Connecting to 67.205.59.213 NumPendingStyleSheets=1 *** [dillo/3.0.2] This should not happen! *** Aborted
This is new, as dillo from Nov 14 doesn't exit. Any clues?
Debugging shows the same issue as Alexander's problem:
On Fri, Dec 14, Alexander Voigt wrote:
with the current Dillo development version 2672:4d0bdcf10ee7 (Fri Dec 14 12:24:54 2012 +0100) I get a segfault when I try to access the Dillo bug database.
[...] #3 _nextUtf8Char (s=<value optimized out>) at unicode.cc:92 #4 0x000000000047163c in lout::unicode::nextUtf8Char (s=0x98d10c "\267", len=1) at unicode.cc:114 #5 0x000000000044e8d0 in dw::Textblock::addText (this=<value optimized out>, text=0x98d10c "\267", len=<value optimized out>, style=<value optimized out>) at textblock.cc:1430
The HTML parser passes invalid UTF-8 to dw::Textblock. I will make nextUtf8Char more robust (of course, dillo should not crash), but Jorge's page is HTML, encoded in ISO-8859-1, not UTF-8, as seen here:
000009b0 34 38 22 3e 2d 20 4b 65 79 73 74 72 6f 6b 65 20 |48">- Keystroke | 000009c0 4c 6f 67 67 69 6e 67 20 77 69 74 68 20 42 65 61 |Logging with Bea| 000009d0 63 6f 6e 20 ab 20 53 74 72 61 74 65 67 69 63 20 |con . Strategic | ^^
It seems that the Fltk functions do some checks, and sometimes decode as ISO-8859-1.
AFAIR from comments in fltk, some utf8 functions dealt with mixed latin1, utf8 and some windows codec. They got into it because the mix was inevitable for them. -- Cheers Jorge.-
On Fri, Dec 14, Jorge Arellano Cid wrote:
On Fri, Dec 14, 2012 at 04:48:23PM +0100, Sebastian Geerken wrote:
The HTML parser passes invalid UTF-8 to dw::Textblock. I will make nextUtf8Char more robust (of course, dillo should not crash), but Jorge's page is HTML, encoded in ISO-8859-1, not UTF-8, as seen here:
000009b0 34 38 22 3e 2d 20 4b 65 79 73 74 72 6f 6b 65 20 |48">- Keystroke | 000009c0 4c 6f 67 67 69 6e 67 20 77 69 74 68 20 42 65 61 |Logging with Bea| 000009d0 63 6f 6e 20 ab 20 53 74 72 61 74 65 67 69 63 20 |con . Strategic | ^^
It seems that the Fltk functions do some checks, and sometimes decode as ISO-8859-1.
AFAIR from comments in fltk, some utf8 functions dealt with mixed latin1, utf8 and some windows codec.
They got into it because the mix was inevitable for them.
I've modified my code so that it works in a similar way, but I've not yet cared about the differences between ISO-8859-1, ISO-8859-15, and Windows-1252. Anyway these differences are marginal. However, IMO there should be a conversion to clean UTF-8 so that only a small part of dillo should have to bother about such problems, while most parts can rely on clean UTF-8. (Something to consider after the release.) Sebastian
Sebastian wrote:
However, IMO there should be a conversion to clean UTF-8 so that only a small part of dillo should have to bother about such problems, while most parts can rely on clean UTF-8. (Something to consider after the release.)
I just checked whether we could get free stripping of non-utf-8 by asking iconv to convert from utf-8 to utf-8 when that's the claimed charset, but unsurprisingly it didn't do anything.
participants (3)
-
corvid@lavabit.com
-
jcid@dillo.org
-
sgeerken@dillo.org