---------- Forwarded message ---------- Date: Thu, 13 May 2004 12:33:13 +0300 (EEST) From: Jukka K. Korpela <jkorpela@cs.tut.fi> To: Jorge Arellano Cid <jcid@dillo.org> Subject: Re: Is the null byte allowed in HTML? On Wed, 12 May 2004, Jorge Arellano Cid wrote:
I can't yet found whether the null byte character is allowed in HTML. Can you shed some light on this?
It is not. You could use http://validator.w3.org to check for disallowed characters (it reports "non SGML character number 0"), but the ultimate reference is a) for HTML 4, the SGML declaration http://www.w3.org/TR/html4/sgml/sgmldecl.html where UNUSED effectively means 'disallowed' b) for XHTML, the XML specification, see http://www.w3.org/TR/REC-xml/#charsets which say, among other things, that all characters below 9 (HT) are disallowed. Thanks for a good question - I'm just finalizing a book on XHTML (in Finnish, sorry) and I realized that I had forgotten to discuss the character issue in sufficient detail. (I just realized that various generators may produce data with control characters.) -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
On Thu, May 13, 2004 at 08:08:30AM -0400, Jorge Arellano Cid wrote:
From: Jukka K. Korpela <jkorpela@cs.tut.fi> On Wed, 12 May 2004, Jorge Arellano Cid wrote:
I can't yet found whether the null byte character is allowed in HTML. Can you shed some light on this?
It is not. You could use http://validator.w3.org to check for disallowed characters (it reports "non SGML character number 0"), but the ultimate reference is a) for HTML 4, the SGML declaration http://www.w3.org/TR/html4/sgml/sgmldecl.html where UNUSED effectively means 'disallowed' b) for XHTML, the XML specification, see http://www.w3.org/TR/REC-xml/#charsets which say, among other things, that all characters below 9 (HT) are disallowed.
That seems authoritative. But for completeness, even though the character isn't allowed, it can appear, and how a browser handles it might matter. As a specific example, the "rt" issue tracker has a web interface. The version I was evaluating accepted (as I recall) the name "Se�n" as part of its input (the non-ascii character there is 0xe1, "LATIN SMALL LETTER A WITH ACUTE"). When it presented that back to the browser, the 0xe1 had become 0x00, which caused oddness in the display on a few browsers. The content-type was declared as text/html;charset=utf-8. I don't know if that was a part of, or a trigger for, the display problem. But either way, storing or accessing text/html as a C string will not work in (broken but existing) cases like that. If there's a straightforward way of flagging "I saw a NUL, this isn't text, complain to the web server administrator" in the bugmeter, that would probably be a reasonable failure (error correction) mode. All the best, f -- Francis Daly francis@daoine.org
participants (2)
-
Francis Daly
-
Jorge Arellano Cid