character references, trailing ';', urls
I was looking at how badly dillo handles something like: <a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a> It becomes a much more common problem with html5, which has a _lot_ more character references. I could perhaps stick an argument on the Html_parse_entity() in Html_get_attr2(), telling it to insist upon finding a ';'. If we still had cvs.auriga, I could dig through prehistory and try to see whether not demanding ';' termination was initially done with the strong belief that it was for the best overall (or maybe it was even inherited from gzilla), but we don't have cvs.auriga, and we don't have mailing list search working (not that that's generally very fun to dig through in any case). After all, maybe we should always insist upon proper termination.
Hi, On Sun, May 04, 2014 at 12:34:24AM +0000, eocene wrote:
I was looking at how badly dillo handles something like:
<a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a>
It becomes a much more common problem with html5, which has a _lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and try to see whether not demanding ';' termination was initially done with the strong belief that it was for the best overall (or maybe it was even inherited from gzilla), but we don't have cvs.auriga, and we don't have mailing list search working (not that that's generally very fun to dig through in any case). After all, maybe we should always insist upon proper termination.
This heuristics are not simple. AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC). It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense. Maybe GMANE has the mailing list archives... (a similar situation happens with the question of e.g. allowing H1 inside the A element.). A bit of history: in the very beginning Dillo had strict parsing. The motto was not to try to fix bad HTML. After a few years dillo became more and more annoying (tag soup or HTML violations were not fixed), and the "Tag soup" pages looked really bad in it (hence the bug meter). At some point we had to change the policy because it was a lost war and dillo was becoming more and more unusable/irrelevant. At this point our policy is more or less: we try to render tag soup and use heuristics to do a good job on correcting usual problems, but haven't gave up on informing the user/author of all the HTML errors we found in the page. -- Cheers Jorge.-
Jorge wrote:
On Sun, May 04, 2014 at 12:34:24AM +0000, eocene wrote:
I was looking at how badly dillo handles something like:
<a href="http://www.dillo.org?asdf©=3µ=zxcv">link</a>
It becomes a much more common problem with html5, which has a _lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and try to see whether not demanding ';' termination was initially done with the strong belief that it was for the best overall (or maybe it was even inherited from gzilla), but we don't have cvs.auriga, and we don't have mailing list search working (not that that's generally very fun to dig through in any case). After all, maybe we should always insist upon proper termination.
This heuristics are not simple.
AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case, but of course I don't want to make the code messy and complicated unless I need to.
It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around. Maybe it happens out there and I just don't hear about it, but I wonder why projects don't tend to keep track -- in some organized fashion by topic, like in a wiki or group of static web pages or something -- all of the decisions made on various issues and the reasoning surrounding them, since it's hard to remember details for years, people come and go, etc.
I wrote:
Jorge wrote:
AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case, but of course I don't want to make the code messy and complicated unless I need to.
It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.
http://lists.dillo.org/pipermail/dillo-dev/2005-January/002502.html where we get the end of a conversation between Jorge and Matthias Franz. This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as: Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present. ...and there's an "IIRC" in the msg that XHTML requires it. The HTML5 spec requires a terminating ';' in all cases.
On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
I wrote:
Jorge wrote:
AFAIR the original routine was written to require the trailing ';' and it worked well for some time. Then more pages started to show unterminated entities inside, and it got so annoying we decided to make it more flexible and not to require the ';' when the entity name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case, but of course I don't want to make the code messy and complicated unless I need to.
It'd be good to find the reason for the change before reverting it. I don't remember it now, but I do remember it was because the other way started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.
http://lists.dillo.org/pipermail/dillo-dev/2005-January/002502.html
where we get the end of a conversation between Jorge and Matthias Franz.
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP). A long long time ago people thought that SGML was the final solution, then XML, then HTML5, now they're looking for an alternative technology to base the web upon... -- Cheers Jorge.-
Jorge wrote:
On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for attributes with html4?
A long long time ago people thought that SGML was the final solution, then XML, then HTML5, now they're looking for an alternative technology to base the web upon...
Where have they been talking about an alternative technology?
On Mon, May 05, 2014 at 12:54:06AM +0000, eocene wrote:
Jorge wrote:
On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for attributes with html4?
I'm saying we should find a simple heuristic that copes with the current situation.
A long long time ago people thought that SGML was the final solution, then XML, then HTML5, now they're looking for an alternative technology to base the web upon...
Where have they been talking about an alternative technology?
I remember short ago, reading somewhere in the news that there were funds and a call for people with expertise to work on designing an alternative technology for the web (to try to tackle the enormous amount of complexity full blown browsers have become not to mention the disparate user experience this creates). -- Cheers Jorge.-
Jorge wrote:
On Mon, May 05, 2014 at 12:54:06AM +0000, eocene wrote:
Jorge wrote:
On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for attributes with html4?
I'm saying we should find a simple heuristic that copes with the current situation.
If you want simple, I can just require it unconditionally and find out what happens.
A long long time ago people thought that SGML was the final solution, then XML, then HTML5, now they're looking for an alternative technology to base the web upon...
Where have they been talking about an alternative technology?
I remember short ago, reading somewhere in the news that there were funds and a call for people with expertise to work on designing an alternative technology for the web (to try to tackle the enormous amount of complexity full blown browsers have become not to mention the disparate user experience this creates).
I wish them luck. HTML5 is the most ridiculous possible document.
On Mon, May 05, 2014 at 02:25:01AM +0000, eocene wrote:
Jorge wrote:
On Mon, May 05, 2014 at 12:54:06AM +0000, eocene wrote:
Jorge wrote:
On Sun, May 04, 2014 at 09:09:48PM +0000, eocene wrote:
This msg says that it was changed because it wasn't required under certain conditions. HTML4 spec gives it as:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for attributes with html4?
I'm saying we should find a simple heuristic that copes with the current situation.
If you want simple, I can just require it unconditionally and find out what happens.
Your first suggestion looks quite reasonable. Please try it and make some field tests. I'm currently working on the double imgbuf problem... -- Cheers Jorge.-
participants (2)
-
eocene@gmx.com
-
jcid@dillo.org