[patch] HTML parser bugfix

newer
[patch] VOIDP2INT, INT2VOIDP and...

older
[customercare@dotster.com: Case...

p37sitdu＠lavabit.com

June 3, 2012

6:29 p.m.

The following HTML fragment is shown as BUG<">abc by Dillo. It is valid (checked with validator.w3.org) HTML. In Firefox it is shown as abc with '>BUG<' tooltip. <div title=">BUG<">abc</div> What happens is that Dillo see < character inside quotes, shows bug "attribute lacks closing quote" and breaks out of tag parsing loop. In fact it is not possible to say if there is a bug in HTML at that point. In WWW you can see this bug on http://reddit.com. Each entry on this page contains votehash', null, event)" > and bug meter shows bugs for them. Attached patch removes logic that tries to detect unterminated quoted attributes and fixes this bug.

Show replies by date

onepoint＠starurchin.org

June 2012

8 p.m.

123 wrote:

...

Attached patch removes logic that tries to detect unterminated quoted attributes and fixes this bug.

I would be wary of applying this patch without further thought. I run Dillo with a very similar patch and it breaks *many* more pages than it fixes. It is unfortunately necessary to detect and recover from quoting errors in attributes. For instance, the web is *full* of pages with tags containing this: foo="bar"" If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well. It's not even that uncommon to see monstrosities like this: foo=""bar""" It is also common to see short attributes (often CSS lengths) delimited with a mixture of single and double quotes, eg: size="12pt' For these reasons, sadly, it is IMHO not realistic to just drop the misquote-detection logic. Regards, Jeremy Henty

p37sitdu＠lavabit.com

8:45 p.m.

On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

...

I would be wary of applying this patch without further thought. I run Dillo with a very similar patch and it breaks *many* more pages than it fixes. It is unfortunately necessary to detect and recover from quoting errors in attributes. For instance, the web is *full* of pages with tags containing this:

foo="bar""

If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes. Searching for < inside quotes is not the right way as it breaks valid pages like reddit main page. Can you give examples of real web pages with double quotes?

...

It's not even that uncommon to see monstrosities like this:

foo=""bar"""

<div title=""abc""">def</div> does not show tip "abc" for def in Firefox. It silently displays nothing in Dillo after patch, however. I will try to fix it.

...

It is also common to see short attributes (often CSS lengths) delimited with a mixture of single and double quotes, eg:

size="12pt'

<div title="abc'>def</div> displays nothing in Firefox. What browser can render it?

p37sitdu＠lavabit.com

12:19 p.m.

According to WHATWG [1] fragment <div title=""abc""">def</div> should be parsed without errors until 'a' character. abc""" should be parsed as attribute name with parse errors because of lack of space character and quotes inside attribute name. [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.htm...

onepoint＠starurchin.org

11:14 p.m.

123 wrote:

...

On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

...

...
If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

...

Searching for < inside quotes is not the right way as it breaks valid pages like reddit main page.

Again, I agree. In fact, I decided to experiment with removing Dillo's misquote-detection just because it broke reddit. Unfortunately it is very hard to come up with an algorithm that correctly handles the various quoting horrors that you see all the time, yet also correctly parses embedded javascript fragments like: onclick='alert("clicked")' (IIRC Dillo mis-renders reddit because of embedded javascript like this.)

...

Can you give examples of real web pages with double quotes?

I have attached a bunch of links that I collected after spending a lot of time debugging pages that broke my locally-patched Dillo. (Ignore the file:///... URLs as they obviously won't work for you.) I have also attached the two patches I use. I think the first does the same thing that you propose, although I haven't checked that in detail. The second attempts to copy Firefox's misquote-detection algorithm. I agree that Dillo's misquote-detection isn't good enough, but just taking it out is a step backwards, and it is a mistake to propose changes to it based solely on particular pages that Dillo mis-renders, because any such change needs to be tested against the many other broken pages that Dillo currently renders more or less correctly. Regards, Jeremy Henty

p37sitdu＠lavabit.com

12:45 p.m.

On Thu, Jun 07, 2012 at 12:14:06AM +0100, Jeremy Henty wrote:

...

123 wrote:

...
On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

...
...
If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

IMO the right way to do it is to implement HTML Standard [1]. When Html_write_raw is called, parser is in the Data state. When it returns to Data state again, it is the end of token. I have implemented standard comment/DOCTYPE parsing. It is incomplete, EOF is not handled and DOCTYPE parsing is not changed. Patch is attached. Next step is to rewrite tag parsing in standard way. [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.htm...

p37sitdu＠lavabit.com

3:18 p.m.

Forgot to attach.

Johannes.Hofmann＠gmx.de

8:39 p.m.

On Thu, Jun 07, 2012 at 04:45:42PM +0400, 123 wrote:

...

On Thu, Jun 07, 2012 at 12:14:06AM +0100, Jeremy Henty wrote:

...
123 wrote:

...
On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

...
...
If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

IMO the right way to do it is to implement HTML Standard [1]. When Html_write_raw is called, parser is in the Data state. When it returns to Data state again, it is the end of token.

I have implemented standard comment/DOCTYPE parsing. It is incomplete, EOF is not handled and DOCTYPE parsing is not changed. Patch is attached. Next step is to rewrite tag parsing in standard way.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.htm...

With your patch text disappears e.g. on http://www.cnas.org/blogs/abumuqawama/2011/04/quote-day.html-0 which was mentioned by Jeremy while it renders ok with current dillo and firefox. I would rather put together a test page that includes all the cases Jeremy brought up plus the reddit one. Also looking into firefox or other browser sources might be a good start. Cheers, Johannes

jcid＠dillo.org

3:35 p.m.

On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

...

123 wrote:

...
Attached patch removes logic that tries to detect unterminated quoted attributes and fixes this bug.

I would be wary of applying this patch without further thought. I run Dillo with a very similar patch and it breaks *many* more pages than it fixes. It is unfortunately necessary to detect and recover from quoting errors in attributes. For instance, the web is *full* of pages with tags containing this:

foo="bar""

If you don't detect and ignore that extra double quote you will break many pages that every other browser renders perfectly well.

It's not even that uncommon to see monstrosities like this:

foo=""bar"""

It is also common to see short attributes (often CSS lengths) delimited with a mixture of single and double quotes, eg:

size="12pt'

For these reasons, sadly, it is IMHO not realistic to just drop the misquote-detection logic.

Ditto. (FWIW, maybe this is the fourth time this patch is received; In the past the submitter has always desisted after some more testing). -- Cheers Jorge.-

4942

Age (days ago)

4951

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

jcid＠dillo.org
Johannes.Hofmann＠gmx.de
onepoint＠starurchin.org
p37sitdu＠lavabit.com