Quoted attribute parsing: summary

onepoint＠starurchin.org

Aug. 16, 2010

8:21 a.m.

Prompted by some private conversation with corvid I've been digging through specs and source code to see what the state of play is. The HTML5 specification[1] states that the user agent should consume text, converting character references until it finds the matching close quote. If there is no matching close quote (ie. it sees an EOF first) then it terminates (strictly speaking, it switches to the data state and reconsumes the EOF, which makes it emit an EOF token). Taking out Dillo's bogus attribute value detection as I proposed would make Dillo parse quoted attribute values as per the HTML5 spec. The Hubbub HTML parser library[2] parses quoted attribute values as per the HTML5 spec. Firefox parses quoted attribute values as per the HTML5 spec *except* that if it sees an EOF then it backs up to the open quote, discards it, then reparses as though it was expecting an unquoted attribute value. Otherwise (ie. if it finds the matching close quote) it makes no attempt to detect a broken attribute value, no matter what content the attribute value has swallowed up. So it seems that the world at large has given up on trying to detect and correct broken attribute values. Jeremy Henty [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/ [2] http://www.netsurf-browser.org/projects/hubbub/

Show replies by date

Johannes.Hofmann＠gmx.de

August 2010

10:10 a.m.

On Mon, Aug 16, 2010 at 07:21:02AM +0100, Jeremy Henty wrote:

...

Prompted by some private conversation with corvid I've been digging through specs and source code to see what the state of play is.

The HTML5 specification[1] states that the user agent should consume text, converting character references until it finds the matching close quote. If there is no matching close quote (ie. it sees an EOF first) then it terminates (strictly speaking, it switches to the data state and reconsumes the EOF, which makes it emit an EOF token).

Taking out Dillo's bogus attribute value detection as I proposed would make Dillo parse quoted attribute values as per the HTML5 spec.

The Hubbub HTML parser library[2] parses quoted attribute values as per the HTML5 spec.

Firefox parses quoted attribute values as per the HTML5 spec *except* that if it sees an EOF then it backs up to the open quote, discards it, then reparses as though it was expecting an unquoted attribute value. Otherwise (ie. if it finds the matching close quote) it makes no attempt to detect a broken attribute value, no matter what content the attribute value has swallowed up.

So it seems that the world at large has given up on trying to detect and correct broken attribute values.

I'd agree that we should not make compromises displaying correct HTML when trying to deal with buggy HTML. But are the '>' characters in the attribute value in the reddit page actually valid? The HTML validators at least warn about them. Cheers, Johannes

onepoint＠starurchin.org

10:14 p.m.

Johannes Hofmann wrote:

...

I'd agree that we should not make compromises displaying correct HTML when trying to deal with buggy HTML. But are the '>' characters in the attribute value in the reddit page actually valid?

Yes, at least according to the HTML5 specification. Indeed, according to that specification, the only possible parse errors while parsing a quoted attribute value are (i) EOF, and (ii) a malformed entity reference. Anything else is valid! I doubt that those '>' characters are valid according to SGML, but the HTML5 specification explicitly states that HTML5 is not an SGML instance. No popular client has ever parsed HTML as an SGML instance and servers have been sending non-SGML-compliant "HTML" since forever. No matter what earlier HTML specifications may have claimed, the practical reality is that HTML has never been a kind of SGML.

...

The HTML validators at least warn about them.

Warning about them is probably a good idea, but that's a different issue from how to handle them. Whatever Dillo should do, its current behaviour (a) does not conform to HTML5, and (b) breaks Reddit. Of course, there's no reason that Dillo *must* conform to HTML5. Indeed the HTML5 specification is peppered with the lovely phrase "willful violation", meaning "yes we know this breaks someone else's specification but we think it's for the best". So it's fine in principle for Dillo to say "we're going to violate HTML5 because we think it's for the best", but I think that this particular behaviour is a bad idea. It violates the HTML5 standard, it deviates (as far as I can tell) from standard practice, and it breaks an otherwise perfectly compliant website. If we can think of a useful way to willful violate standards so as to better handle broken HTML then let's do it, but I think Dillo is better off without this particular workaround because it does more harm than good. NB: HTML5 is still a work in progress. These bug reports show some of the discussion of parsing attribute values: Bug 9872: "trigger a conformance error when javascript is included in href attribute" (rejected because there are legitimate use cases and even if it's sometimes abused it's not the HTML5 specification's role to police its use) http://www.w3.org/Bugs/Public/show_bug.cgi?id=9872 Bug 9987: "attribute values should be allowed to contain ambiguous ampersands ..." (still new) http://www.w3.org/Bugs/Public/show_bug.cgi?id=9987 Regards, Jeremy Henty

Johannes.Hofmann＠gmx.de

10:41 p.m.

On Tue, Aug 17, 2010 at 09:13:11PM +0100, Jeremy Henty wrote:

...

Johannes Hofmann wrote:

...
I'd agree that we should not make compromises displaying correct HTML when trying to deal with buggy HTML. But are the '>' characters in the attribute value in the reddit page actually valid?

Yes, at least according to the HTML5 specification. Indeed, according to that specification, the only possible parse errors while parsing a quoted attribute value are (i) EOF, and (ii) a malformed entity reference. Anything else is valid!

I doubt that those '>' characters are valid according to SGML, but the HTML5 specification explicitly states that HTML5 is not an SGML instance. No popular client has ever parsed HTML as an SGML instance and servers have been sending non-SGML-compliant "HTML" since forever. No matter what earlier HTML specifications may have claimed, the practical reality is that HTML has never been a kind of SGML.

...
The HTML validators at least warn about them.

Warning about them is probably a good idea, but that's a different issue from how to handle them. Whatever Dillo should do, its current behaviour (a) does not conform to HTML5, and (b) breaks Reddit.

Of course, there's no reason that Dillo *must* conform to HTML5. Indeed the HTML5 specification is peppered with the lovely phrase "willful violation", meaning "yes we know this breaks someone else's specification but we think it's for the best". So it's fine in principle for Dillo to say "we're going to violate HTML5 because we think it's for the best", but I think that this particular behaviour is a bad idea. It violates the HTML5 standard, it deviates (as far as I can tell) from standard practice, and it breaks an otherwise perfectly compliant website. If we can think of a useful way to willful violate standards so as to better handle broken HTML then let's do it, but I think Dillo is better off without this particular workaround because it does more harm than good.

NB: HTML5 is still a work in progress. These bug reports show some of the discussion of parsing attribute values:

Bug 9872: "trigger a conformance error when javascript is included in href attribute" (rejected because there are legitimate use cases and even if it's sometimes abused it's not the HTML5 specification's role to police its use)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9872

Bug 9987: "attribute values should be allowed to contain ambiguous ampersands ..." (still new)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9987

I also played with this some more... If all standard browsers would insist on correctly closed quotes, we could expect web-developers to immediately fix those errors. But for me firefox 3.6.3 shows something given the following HTML (as does current dillo): <div title="foo >hello world</div>dillo is great When we remove the unquoted attribute detection as in your patch, nothing is shown. Are newer firefox versions treating this differently? Regards, Johannes

onepoint＠starurchin.org

1:06 a.m.

Johannes Hofmann wrote:

...

But for me firefox 3.6.3 shows something given the following HTML (as does current dillo):

<div title="foo >hello world</div>dillo is great

That's Firefox's workaround that I described in my original post: if it sees EOF while parsing a quoted attribute value (ie. if it *never* sees a matching quote) then it goes back to the opening quote, discards it, and parses an unquoted attribute value. So it ends up parsing your example exactly as it would parse <div title=foo >hello world</div>dillo is great which gives the same result as vanilla Dillo, but for entirely different reasons. But Firefox only does that if it can't find the matching quote at all; if you feed it <div title="foo >hello world</div>dillo is great [... repeat 'dillo is great' 10000 times ...]</div><div title="bar"> then it matches the second double quote with the first and *all* the text disappears. Which is exactly what HTML5 says it should do. Of course vanilla Dillo does *better* than Firefox for this example, but in the real world I think it does *worse*. JavaScript fragments that confound Dillo's algorithm are far more common than examples such as the above that it handles well. OK, here's a new proposal: when parsing quoted attribute values, let's copy Firefox! That would: (a) sensibly handle the missing quotes examples that people have suggested (which my proposed patch does not do), (b) handle well-formed JavaScript fragments correctly (which vanilla Dillo does not do), (c) parse well-formed HTML5 as per the HTML5 specification, (d) conform to Firefox's established practice, and (e) not break Reddit! That's 5 wins! It's true that we can't expect people to fix their HTML just because the HTML5 specification says it's broken. And it's even less likely that they will fix it just because it breaks in Dillo. But it is very likely that they will fix it if it breaks in Firefox, so copying Firefox is a good idea, even if you don't care about the HTML5 specification. And, why should we care about edge cases that vanilla Dillo handles better than Firefox, since those are precisely the cases that people will fix to keep their Firefox users happy and that we can therefore expect *not* to see! There's no point in having an algorithm that in theory is better than Firefox's, because in practice it's not. So, why not just copy Firefox? I can't see any downside. Regards, Jeremy Henty

Johannes.Hofmann＠gmx.de

9:58 a.m.

On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:

...

Johannes Hofmann wrote:

...
But for me firefox 3.6.3 shows something given the following HTML (as does current dillo):

<div title="foo >hello world</div>dillo is great

That's Firefox's workaround that I described in my original post: if it sees EOF while parsing a quoted attribute value (ie. if it *never* sees a matching quote) then it goes back to the opening quote, discards it, and parses an unquoted attribute value. So it ends up parsing your example exactly as it would parse

<div title=foo >hello world</div>dillo is great

which gives the same result as vanilla Dillo, but for entirely different reasons.

But Firefox only does that if it can't find the matching quote at all; if you feed it

<div title="foo >hello world</div>dillo is great [... repeat 'dillo is great' 10000 times ...]</div><div title="bar">

then it matches the second double quote with the first and *all* the text disappears. Which is exactly what HTML5 says it should do. Of course vanilla Dillo does *better* than Firefox for this example, but in the real world I think it does *worse*. JavaScript fragments that confound Dillo's algorithm are far more common than examples such as the above that it handles well.

OK, here's a new proposal: when parsing quoted attribute values, let's copy Firefox! That would: (a) sensibly handle the missing quotes examples that people have suggested (which my proposed patch does not do), (b) handle well-formed JavaScript fragments correctly (which vanilla Dillo does not do), (c) parse well-formed HTML5 as per the HTML5 specification, (d) conform to Firefox's established practice, and (e) not break Reddit! That's 5 wins!

It's true that we can't expect people to fix their HTML just because the HTML5 specification says it's broken. And it's even less likely that they will fix it just because it breaks in Dillo. But it is very likely that they will fix it if it breaks in Firefox, so copying Firefox is a good idea, even if you don't care about the HTML5 specification.

And, why should we care about edge cases that vanilla Dillo handles better than Firefox, since those are precisely the cases that people will fix to keep their Firefox users happy and that we can therefore expect *not* to see! There's no point in having an algorithm that in theory is better than Firefox's, because in practice it's not.

So, why not just copy Firefox? I can't see any downside.

I agree. Can you adjust the patch? Then I'd like to wait what Jorge and corvid say, but I think it's best to mimic Firefox. Regards, Johannes

onepoint＠starurchin.org

12:15 p.m.

Johannes Hofmann wrote:

...

On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:

...
So, why not just copy Firefox? I can't see any downside.

I agree. Can you adjust the patch? Then I'd like to wait what Jorge and corvid say, but I think it's best to mimic Firefox.

I'll work on a new patch next week. (I'd like to do it sooner but real life has already alloc()ed the rest of this week.) Thanks for your comments, Jeremy Henty

jcid＠dillo.org

7:45 p.m.

Hi Jeremy, It's a relief to see how things have evolved in ten years. I'm happily surprised! In the beginning Mozilla/FF did crazy stuff to make sense out of the most obnoxious tag soup. At that time we only parsed correct content (as from the SPEC) and everybody ended saying "dillo is broken" (despite the nice warning messages :-). Then we ended correcting tag soup as much as we could (but kept the nice warnings, which just a few souls cared for). At that moment we followed the "When in doubt, follow FF" motto which served us well, for the reasons you describe well. Now it's great to see a sensible turn into a better direction. I've considered this patch at least three times during this ten years, and I know FF behaved differently back then. I even committed the patch once, only to see lots of content dissappear from the page, and had to backpedal. Looking at the attached examples with FF is quite telling. (please read the comments below). On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:

...

Johannes Hofmann wrote:

...
But for me firefox 3.6.3 shows something given the following HTML (as does current dillo):

<div title="foo >hello world</div>dillo is great

That's Firefox's workaround that I described in my original post: if it sees EOF while parsing a quoted attribute value (ie. if it *never* sees a matching quote) then it goes back to the opening quote, discards it, and parses an unquoted attribute value. So it ends up parsing your example exactly as it would parse

<div title=foo >hello world</div>dillo is great

which gives the same result as vanilla Dillo, but for entirely different reasons.

But Firefox only does that if it can't find the matching quote at all; if you feed it

<div title="foo >hello world</div>dillo is great [... repeat 'dillo is great' 10000 times ...]</div><div title="bar">

then it matches the second double quote with the first and *all* the text disappears. Which is exactly what HTML5 says it should do. Of course vanilla Dillo does *better* than Firefox for this example, but in the real world I think it does *worse*. JavaScript fragments that confound Dillo's algorithm are far more common than examples such as the above that it handles well.

OK, here's a new proposal: when parsing quoted attribute values, let's copy Firefox! That would: (a) sensibly handle the missing quotes examples that people have suggested (which my proposed patch does not do), (b) handle well-formed JavaScript fragments correctly (which vanilla Dillo does not do), (c) parse well-formed HTML5 as per the HTML5 specification, (d) conform to Firefox's established practice, and (e) not break Reddit! That's 5 wins!

Agreed. I'll be looking forward for the patch.

...

It's true that we can't expect people to fix their HTML just because the HTML5 specification says it's broken. And it's even less likely that they will fix it just because it breaks in Dillo. But it is very likely that they will fix it if it breaks in Firefox, so copying Firefox is a good idea, even if you don't care about the HTML5 specification.

And, why should we care about edge cases that vanilla Dillo handles better than Firefox, since those are precisely the cases that people will fix to keep their Firefox users happy and that we can therefore expect *not* to see! There's no point in having an algorithm that in theory is better than Firefox's, because in practice it's not.

So, why not just copy Firefox? I can't see any downside.

Agreed, and that's what we've done for some time now. In the beginning we followed the SPECS, which caused more harm than good to the project in the long term. FWIW, when HTML was defined as XML, only a few servers provided it with the xhtml MIME type, because it risked not being rendered if incorrect, which didn't happen when served as "tag soup". The W3C and WDG, among others, built nice validators which were not used to correct the state of the web. Business logic prevailed: why waste resources on 1-2% of the market. As you see, we've had to sail with the tides. -- Cheers Jorge.-

5439

Age (days ago)

5441

Last active (days ago)

List overview

Download

7 comments

3 participants

participants (3)

jcid＠dillo.org
Johannes.Hofmann＠gmx.de
onepoint＠starurchin.org