On Mon, Aug 16, 2010 at 07:21:02AM +0100, Jeremy Henty wrote:
Prompted by some private conversation with corvid I've been digging through specs and source code to see what the state of play is.
The HTML5 specification[1] states that the user agent should consume text, converting character references until it finds the matching close quote. If there is no matching close quote (ie. it sees an EOF first) then it terminates (strictly speaking, it switches to the data state and reconsumes the EOF, which makes it emit an EOF token).
Taking out Dillo's bogus attribute value detection as I proposed would make Dillo parse quoted attribute values as per the HTML5 spec.
The Hubbub HTML parser library[2] parses quoted attribute values as per the HTML5 spec.
Firefox parses quoted attribute values as per the HTML5 spec *except* that if it sees an EOF then it backs up to the open quote, discards it, then reparses as though it was expecting an unquoted attribute value. Otherwise (ie. if it finds the matching close quote) it makes no attempt to detect a broken attribute value, no matter what content the attribute value has swallowed up.
So it seems that the world at large has given up on trying to detect and correct broken attribute values.
I'd agree that we should not make compromises displaying correct HTML when trying to deal with buggy HTML. But are the '>' characters in the attribute value in the reddit page actually valid? The HTML validators at least warn about them. Cheers, Johannes