Handling of HTML entities without terminating semicolon (bug 1043)
Currently any inline HTML entity will be converted if the whole entity name matches up to a non-alnum, even if there is no trailing semicolon. For example, foo<,bar would be converted, but foo<bar wouldn't. This means that an improperly encoded url like /foo.html?a=b&lang=en would not work, because &lang would be translated to a Left Angle Bracket. But we can't just require all HTML entites to have a terminating semicolon, because that would cause worse behavior on all the broken websites that rely on lazy entity termination. Webkit has a fairly good solution for this. Their entity list (http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLEntityNa...) has defined duplicate elements without trailing semicolons for some tags only. By my count, 106/2125 tags have a duplicate zero-semicolon definition. Does anybody know how Webkit chose which tags are valid without a semicolon? Is that defined in an RFC somewhere? Another option would be to require entites to have a terminating semicolon when they're part of a tag attribute. That's assuming most pages that encode their tag attributes do it properly. For example, <a href="error.cgi?msg=Unauthorized ."> wouldn't behave as it used to.
participants (1)
-
qartis@gmail.com