On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
I wrote:
Jorge wrote:
AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case, but of course I don't want to make the code messy and complicated unless I need to.
It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.
http://lists.dillo.org/pipermail/dillo-dev/2005-January/002502.html
where we get the end of a conversation between Jorge and Matthias Franz.
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP). A long long time ago people thought that SGML was the final solution, then XML, then HTML5, now they're looking for an alternative technology to base the web upon... -- Cheers Jorge.-