[Dillo-dev] character references, trailing ';', urls

May 4, 2014

      Hi,

On Sun, May 04, 2014 at 12:34:24AM +0000, eocene wrote:
...
I was looking at how badly dillo handles something like:
<a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a>
It becomes a much more common problem with html5, which has a
_lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in
Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and
try to see whether not demanding ';' termination was initially
done with the strong belief that it was for the best overall
(or maybe it was even inherited from gzilla), but we don't have
cvs.auriga, and we don't have mailing list search working (not
that that's generally very fun to dig through in any case).
After all, maybe we should always insist upon proper termination.
This heuristics are not simple.

  AFAIR the original routine was written to require the trailing ';'
and it worked well for some time. Then more pages started to show
unterminated entities inside, and it got so annoying we decided to
make it more flexible and not to require the ';' when the entity
name was found (IIRC).

  It'd be good to find the reason for the change before reverting it.
I don't remember it now, but I do remember it was because the other way
started to be perceived as worst in some sense.

  Maybe GMANE has the mailing list archives...

  (a similar situation happens with the question of e.g. allowing H1
inside the A element.).

  A  bit  of  history:  in  the  very  beginning Dillo had strict
parsing.  The  motto  was not to try to fix bad HTML. After a few
years  dillo  became  more  and  more  annoying (tag soup or HTML
violations  were  not  fixed),  and  the  "Tag soup" pages looked
really  bad  in it (hence the bug meter). At some point we had to
change  the  policy  because  it  was  a  lost  war and dillo was
becoming  more  and  more  unusable/irrelevant. At this point our
policy  is  more  or  less:  we  try  to  render tag soup and use
heuristics  to  do  a  good job on correcting usual problems, but
haven't  gave  up  on  informing  the user/author of all the HTML
errors we found in the page.

-- 
  Cheers
  Jorge.-

[Dillo-dev] character references, trailing ';', urls

jcid＠dillo.org