On Thu, May 13, 2004 at 08:08:30AM -0400, Jorge Arellano Cid wrote:
From: Jukka K. Korpela <jkorpela@cs.tut.fi> On Wed, 12 May 2004, Jorge Arellano Cid wrote:
I can't yet found whether the null byte character is allowed in HTML. Can you shed some light on this?
It is not. You could use http://validator.w3.org to check for disallowed characters (it reports "non SGML character number 0"), but the ultimate reference is a) for HTML 4, the SGML declaration http://www.w3.org/TR/html4/sgml/sgmldecl.html where UNUSED effectively means 'disallowed' b) for XHTML, the XML specification, see http://www.w3.org/TR/REC-xml/#charsets which say, among other things, that all characters below 9 (HT) are disallowed.
That seems authoritative. But for completeness, even though the character isn't allowed, it can appear, and how a browser handles it might matter. As a specific example, the "rt" issue tracker has a web interface. The version I was evaluating accepted (as I recall) the name "Se�n" as part of its input (the non-ascii character there is 0xe1, "LATIN SMALL LETTER A WITH ACUTE"). When it presented that back to the browser, the 0xe1 had become 0x00, which caused oddness in the display on a few browsers. The content-type was declared as text/html;charset=utf-8. I don't know if that was a part of, or a trigger for, the display problem. But either way, storing or accessing text/html as a C string will not work in (broken but existing) cases like that. If there's a straightforward way of flagging "I saw a NUL, this isn't text, complain to the web server administrator" in the bugmeter, that would probably be a reasonable failure (error correction) mode. All the best, f -- Francis Daly francis@daoine.org