Hi, On Sun, May 04, 2014 at 12:34:24AM +0000, eocene wrote:
I was looking at how badly dillo handles something like:
<a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a>
It becomes a much more common problem with html5, which has a _lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and try to see whether not demanding ';' termination was initially done with the strong belief that it was for the best overall (or maybe it was even inherited from gzilla), but we don't have cvs.auriga, and we don't have mailing list search working (not that that's generally very fun to dig through in any case). After all, maybe we should always insist upon proper termination.
This heuristics are not simple. AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC). It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense. Maybe GMANE has the mailing list archives... (a similar situation happens with the question of e.g. allowing H1 inside the A element.). A bit of history: in the very beginning Dillo had strict parsing. The motto was not to try to fix bad HTML. After a few years dillo became more and more annoying (tag soup or HTML violations were not fixed), and the "Tag soup" pages looked really bad in it (hence the bug meter). At some point we had to change the policy because it was a lost war and dillo was becoming more and more unusable/irrelevant. At this point our policy is more or less: we try to render tag soup and use heuristics to do a good job on correcting usual problems, but haven't gave up on informing the user/author of all the HTML errors we found in the page. -- Cheers Jorge.-