Re: [Dillo-dev]Re: null byte in HTML

May 13, 2004

      On Thu, May 13, 2004 at 08:08:30AM -0400, Jorge Arellano Cid wrote:
...
From: Jukka K. Korpela <jkorpela@cs.tut.fi>
On Wed, 12 May 2004, Jorge Arellano Cid wrote:
...
I can't yet found whether the null byte character is allowed in
HTML. Can you shed some light on this?
It is not. You could use http://validator.w3.org to check for disallowed
characters (it reports "non SGML character number 0"), but the ultimate
reference is
a) for HTML 4, the SGML declaration
   http://www.w3.org/TR/html4/sgml/sgmldecl.html
   where UNUSED effectively means 'disallowed'
b) for XHTML, the XML specification, see
   http://www.w3.org/TR/REC-xml/#charsets
which say, among other things, that all characters below 9 (HT) are
disallowed.
That seems authoritative.

But for completeness, even though the character isn't allowed, it can
appear, and how a browser handles it might matter.

As a specific example, the "rt" issue tracker has a web interface.  The
version I was evaluating accepted (as I recall) the name "Se�n" as part
of its input (the non-ascii character there is 0xe1, "LATIN SMALL LETTER
A WITH ACUTE").  When it presented that back to the browser, the 0xe1
had become 0x00, which caused oddness in the display on a few
browsers.

The content-type was declared as text/html;charset=utf-8.  I don't know
if that was a part of, or a trigger for, the display problem.  But either
way, storing or accessing text/html as a C string will not work in
(broken but existing) cases like that.

If there's a straightforward way of flagging "I saw a NUL, this isn't
text, complain to the web server administrator" in the bugmeter, that
would probably be a reasonable failure (error correction) mode.

All the best,

	f
-- 
Francis Daly        francis@daoine.org