Jorge wrote:
On Sun, May 04, 2014 at 12:34:24AM +0000, eocene wrote:
I was looking at how badly dillo handles something like:
<a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a>
It becomes a much more common problem with html5, which has a _lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and try to see whether not demanding ';' termination was initially done with the strong belief that it was for the best overall (or maybe it was even inherited from gzilla), but we don't have cvs.auriga, and we don't have mailing list search working (not that that's generally very fun to dig through in any case). After all, maybe we should always insist upon proper termination.
This heuristics are not simple.
AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case, but of course I don't want to make the code messy and complicated unless I need to.
It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around. Maybe it happens out there and I just don't hear about it, but I wonder why projects don't tend to keep track -- in some organized fashion by topic, like in a wiki or group of static web pages or something -- all of the decisions made on various issues and the reasoning surrounding them, since it's hard to remember details for years, people come and go, etc.