[Dillo-dev] Quoted attribute parsing: summary

Aug. 16, 2010


      On Mon, Aug 16, 2010 at 07:21:02AM +0100, Jeremy Henty wrote:
...
Prompted by  some private conversation  with corvid I've  been digging
through specs and source code to see what the state of play is.
The HTML5  specification[1] states that the user  agent should consume
text,  converting character  references  until it  finds the  matching
close quote.  If there is no  matching close quote (ie. it sees an EOF
first) then it terminates (strictly  speaking, it switches to the data
state and reconsumes the EOF, which makes it emit an EOF token).
Taking out Dillo's bogus attribute value detection as I proposed would
make Dillo parse quoted attribute values as per the HTML5 spec.
The Hubbub  HTML parser library[2]  parses quoted attribute  values as
per the HTML5 spec.
Firefox parses quoted attribute values  as per the HTML5 spec *except*
that if it  sees an EOF then  it backs up to the  open quote, discards
it, then  reparses as  though it was  expecting an  unquoted attribute
value.  Otherwise (ie. if it  finds the matching close quote) it makes
no attempt to detect a  broken attribute value, no matter what content
the attribute value has swallowed up.
So it seems that  the world at large has given up  on trying to detect
and correct broken attribute values.
I'd agree that we should not make compromises displaying correct
HTML when trying to deal with buggy HTML.
But are the '>' characters in the attribute value in the reddit page
actually valid?
The HTML validators at least warn about them.

Cheers,
Johannes

[Dillo-dev] Quoted attribute parsing: summary

Johannes.Hofmann＠gmx.de