Hi Jeremy, It's a relief to see how things have evolved in ten years. I'm happily surprised! In the beginning Mozilla/FF did crazy stuff to make sense out of the most obnoxious tag soup. At that time we only parsed correct content (as from the SPEC) and everybody ended saying "dillo is broken" (despite the nice warning messages :-). Then we ended correcting tag soup as much as we could (but kept the nice warnings, which just a few souls cared for). At that moment we followed the "When in doubt, follow FF" motto which served us well, for the reasons you describe well. Now it's great to see a sensible turn into a better direction. I've considered this patch at least three times during this ten years, and I know FF behaved differently back then. I even committed the patch once, only to see lots of content dissappear from the page, and had to backpedal. Looking at the attached examples with FF is quite telling. (please read the comments below). On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:
Johannes Hofmann wrote:
But for me firefox 3.6.3 shows something given the following HTML (as does current dillo):
<div title="foo >hello world</div>dillo is great
That's Firefox's workaround that I described in my original post: if it sees EOF while parsing a quoted attribute value (ie. if it *never* sees a matching quote) then it goes back to the opening quote, discards it, and parses an unquoted attribute value. So it ends up parsing your example exactly as it would parse
<div title=foo >hello world</div>dillo is great
which gives the same result as vanilla Dillo, but for entirely different reasons.
But Firefox only does that if it can't find the matching quote at all; if you feed it
<div title="foo >hello world</div>dillo is great [... repeat 'dillo is great' 10000 times ...]</div><div title="bar">
then it matches the second double quote with the first and *all* the text disappears. Which is exactly what HTML5 says it should do. Of course vanilla Dillo does *better* than Firefox for this example, but in the real world I think it does *worse*. JavaScript fragments that confound Dillo's algorithm are far more common than examples such as the above that it handles well.
OK, here's a new proposal: when parsing quoted attribute values, let's copy Firefox! That would: (a) sensibly handle the missing quotes examples that people have suggested (which my proposed patch does not do), (b) handle well-formed JavaScript fragments correctly (which vanilla Dillo does not do), (c) parse well-formed HTML5 as per the HTML5 specification, (d) conform to Firefox's established practice, and (e) not break Reddit! That's 5 wins!
Agreed. I'll be looking forward for the patch.
It's true that we can't expect people to fix their HTML just because the HTML5 specification says it's broken. And it's even less likely that they will fix it just because it breaks in Dillo. But it is very likely that they will fix it if it breaks in Firefox, so copying Firefox is a good idea, even if you don't care about the HTML5 specification.
And, why should we care about edge cases that vanilla Dillo handles better than Firefox, since those are precisely the cases that people will fix to keep their Firefox users happy and that we can therefore expect *not* to see! There's no point in having an algorithm that in theory is better than Firefox's, because in practice it's not.
So, why not just copy Firefox? I can't see any downside.
Agreed, and that's what we've done for some time now. In the beginning we followed the SPECS, which caused more harm than good to the project in the long term. FWIW, when HTML was defined as XML, only a few servers provided it with the xhtml MIME type, because it risked not being rendered if incorrect, which didn't happen when served as "tag soup". The W3C and WDG, among others, built nice validators which were not used to correct the state of the web. Business logic prevailed: why waste resources on 1-2% of the market. As you see, we've had to sail with the tides. -- Cheers Jorge.-