Johannes Hofmann wrote:
I'd agree that we should not make compromises displaying correct HTML when trying to deal with buggy HTML. But are the '>' characters in the attribute value in the reddit page actually valid?
Yes, at least according to the HTML5 specification. Indeed, according to that specification, the only possible parse errors while parsing a quoted attribute value are (i) EOF, and (ii) a malformed entity reference. Anything else is valid! I doubt that those '>' characters are valid according to SGML, but the HTML5 specification explicitly states that HTML5 is not an SGML instance. No popular client has ever parsed HTML as an SGML instance and servers have been sending non-SGML-compliant "HTML" since forever. No matter what earlier HTML specifications may have claimed, the practical reality is that HTML has never been a kind of SGML.
The HTML validators at least warn about them.
Warning about them is probably a good idea, but that's a different issue from how to handle them. Whatever Dillo should do, its current behaviour (a) does not conform to HTML5, and (b) breaks Reddit. Of course, there's no reason that Dillo *must* conform to HTML5. Indeed the HTML5 specification is peppered with the lovely phrase "willful violation", meaning "yes we know this breaks someone else's specification but we think it's for the best". So it's fine in principle for Dillo to say "we're going to violate HTML5 because we think it's for the best", but I think that this particular behaviour is a bad idea. It violates the HTML5 standard, it deviates (as far as I can tell) from standard practice, and it breaks an otherwise perfectly compliant website. If we can think of a useful way to willful violate standards so as to better handle broken HTML then let's do it, but I think Dillo is better off without this particular workaround because it does more harm than good. NB: HTML5 is still a work in progress. These bug reports show some of the discussion of parsing attribute values: Bug 9872: "trigger a conformance error when javascript is included in href attribute" (rejected because there are legitimate use cases and even if it's sometimes abused it's not the HTML5 specification's role to police its use) http://www.w3.org/Bugs/Public/show_bug.cgi?id=9872 Bug 9987: "attribute values should be allowed to contain ambiguous ampersands ..." (still new) http://www.w3.org/Bugs/Public/show_bug.cgi?id=9987 Regards, Jeremy Henty