[Dillo-dev] Quoted attribute parsing: summary

Aug. 18, 2010

      Hi Jeremy,

  It's  a relief to see how things have evolved in ten years. I'm
happily surprised!

  In  the  beginning Mozilla/FF did crazy stuff to make sense out
of  the  most  obnoxious  tag  soup.  At that time we only parsed
correct  content  (as  from  the SPEC) and everybody ended saying
"dillo is broken" (despite the nice warning messages :-).
  Then we ended correcting tag soup as much as we could (but kept
the nice warnings, which just a few souls cared for).
  At that moment we followed the "When in doubt, follow FF" motto
which served us well, for the reasons you describe well.
  Now it's great to see a sensible turn into a better direction.

  I've considered this patch at least three times during this ten
years,  and  I  know  FF  behaved  differently  back then. I even
committed  the patch once, only to see lots of content dissappear
from the page, and had to backpedal.

  Looking at the attached examples with FF is quite telling.

  (please read the comments below).

On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:
...
Johannes Hofmann wrote:
...
But for  me firefox 3.6.3  shows something given the  following HTML
(as does current dillo):
<div title="foo >hello world</div>dillo is great
That's Firefox's workaround  that I described in my  original post: if
it sees EOF while parsing a quoted attribute value (ie.  if it *never*
sees  a  matching quote)  then  it goes  back  to  the opening  quote,
discards it,  and parses an unquoted  attribute value.  So  it ends up
parsing your example exactly as it would parse
<div title=foo >hello world</div>dillo is great
which  gives  the same  result  as  vanilla  Dillo, but  for  entirely
different reasons.
But Firefox only does that if it can't find the matching quote at all;
if you feed it
<div title="foo >hello world</div>dillo is great [... repeat
    'dillo is great' 10000 times ...]</div><div title="bar">
then it matches  the second double quote with the  first and *all* the
text disappears.  Which  is exactly what HTML5 says  it should do.  Of
course vanilla Dillo does *better*  than Firefox for this example, but
in the real world I  think it does *worse*.  JavaScript fragments that
confound Dillo's algorithm  are far more common than  examples such as
the above that it handles well.
OK, here's a new proposal: when parsing quoted attribute values, let's
copy  Firefox!  That  would: (a)  sensibly handle  the  missing quotes
examples that people have suggested  (which my proposed patch does not
do),  (b)  handle well-formed  JavaScript  fragments correctly  (which
vanilla Dillo  does not  do), (c) parse  well-formed HTML5 as  per the
HTML5  specification, (d) conform  to Firefox's  established practice,
and (e) not break Reddit!  That's 5 wins!
Agreed.
  I'll be looking forward for the patch.
...
It's true that  we can't expect people to fix  their HTML just because
the HTML5 specification  says it's broken.  And it's  even less likely
that they will fix it just because it breaks in Dillo.  But it is very
likely  that they  will fix  it if  it breaks  in Firefox,  so copying
Firefox  is a  good  idea, even  if  you don't  care  about the  HTML5
specification.
And, why  should we care about  edge cases that  vanilla Dillo handles
better than Firefox,  since those are precisely the  cases that people
will fix to  keep their Firefox users happy and  that we can therefore
expect *not* to see!  There's no  point in having an algorithm that in
theory is better than Firefox's, because in practice it's not.
So, why not just copy Firefox?  I can't see any downside.
Agreed, and that's what we've done for some time now.

  In  the beginning we followed the SPECS, which caused more harm
than good to the project in the long term.

  FWIW, when HTML was defined as XML, only a few servers provided
it with the xhtml MIME type, because it risked not being rendered
if incorrect, which didn't happen when served as "tag soup".

  The  W3C  and  WDG,  among  others, built nice validators which
were  not  used  to  correct the state of the web. Business logic
prevailed: why waste resources on 1-2% of the market.

  As you see, we've had to sail with the tides.

-- 
  Cheers
  Jorge.-

[Dillo-dev] Quoted attribute parsing: summary

jcid＠dillo.org