[Dillo-dev]White space handling

May 21, 2004

      Hi!

This mail is just to collect what I have found in the specs, and how I
understand it. I'll comment Jorge's mails seperately.

Sebastian.

----------------------------------------------------------------------

HTML
----
Unfortunately, I do not have the SGML spec at hand, so I only refer to
the parts of the HTML spec, which refer to general SGML topics.

There is something in chapter 6, although I did not find anything
about how #PCDATA (which is used in the DTD for the texts, which are
relevant in this context), only about #CDATA:
...
6.2 SGML basic types
The document type definition specifies the syntax of HTML element
content and attribute values using SGML tokens (e.g., PCDATA, CDATA,
NAME, ID, etc.). See [ISO8879] for their full definitions. The
following is a summary of key information:
* CDATA is a sequence of characters from the document character set
  and may include character entities. User agents should interpret
  attribute values as follows:
o Replace character entities with characters,
  o Ignore line feeds,
  o Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA
  attribute values (e.g., " myval " may be interpreted as >
  "myval"). Authors should not declare attribute values with leading
  or trailing white space.
For some HTML 4 attributes with CDATA attribute values, the
  specification imposes further constraints on the set of legal
  values for the attribute that may not be expressed by the DTD.
About HTML:
...
9.1 White space
[...]
For all HTML elements except PRE, sequences of white space separate
"words" (we use the term "word" here to mean "sequences of non-white
space characters"). When formatting text, user agents should
identify these words and lay them out according to the conventions
of the particular written language (script) and target medium.
[...]
This is, what is currently the difference between calls to
a_Dw_page_add_text() (which adds a word), and a_Dw_page_add_space(),
which adds a "word separation".
...
B.3.1 Line breaks
...
SGML (see [ISO8879], section 7.6.1) specifies that a line break
immediately following a start tag must be ignored, as must a line
break immediately before an end tag. This applies to all HTML
elements without exception.
The following two HTML examples must be rendered identically:
<P>Thomas is watching TV.</P>
<P>
Thomas is watching TV.
</P>
So must the following two examples:
<A>My favorite Website</A>
<A>
My favorite Website
</A>
Notice this: "This applies to all HTML elements without exception."
That is, also for <pre>. But it does not say anything about white
spaces in general.

XHTML
-----
First of all from the XML 1.1 spec:
...
2.10 White Space Handling
In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document. On the other
hand, "significant" white space that should be preserved in the
delivered version is common, for example in poetry and source code.
An XML processor MUST always pass all characters in a document that
are not markup through to the application. A validating XML
processor MUST also inform the application which of these characters
constitute white space appearing in element content.
First of all, we do not have a distinction between the (basic) XML
processor, and the application. (In the CSS prototype, this is
clearer, and because it is always a good decision to make the
implementation structure resemble the structure of the specifications,
this should be kept in mind.)

How I understand this basicly: It is (in most cases, see below) the
task of the application (in this case XHTML) to determine how to deal
with whitespaces. Anyway, I did not find anything in the XHTML 1.0
specification, so I believe that the same rules apply, as for HTML
4.01.
...
A special attribute named xml:space MAY be attached to an element to
signal an intention that in that element, white space should be
preserved by applications. In valid documents, this attribute, like
any other, MUST be declared if it is used. When declared, it MUST be
given as an enumerated type whose values are one or both of
"default" and "preserve". For example:
<!ATTLIST poem  xml:space (default|preserve) 'preserve'>
<!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>
The value "default" signals that applications' default white-space
processing modes are acceptable for this element; the value
"preserve" indicates the intent that applications preserve all the
white space. This declared intent is considered to apply to all
elements within the content of the element where it is specified,
unless overridden with another instance of the xml:space
attribute. This specification does not give meaning to any value of
xml:space other than "default" and "preserve". It is an error for
other values to be specified; the XML processor MAY report the error
or MAY recover by ignoring the attribute specification or by
reporting the (erroneous) value to the application. Applications may
ignore or reject erroneous values.
If you look at the DTDs of XHTML 1.1, you'll find three elements
setting this: <style>, <script>, and <pre>. As to the patch: <style>
and <script> are currently rather unimportant, while for <pre>, the
HTML parser switches into the DILLO_HTML_PARSE_MODE_PRE mode, where
white spaces are handled differently.
...
The root element of any document is considered to have signaled no
intentions as regards application space handling, unless it provides
a value for this attribute or the attribute is declared with a
default value.

Sebastian Geerken

tags

participants (1)