[Dillo-dev]White space handling
Hi! This mail is just to collect what I have found in the specs, and how I understand it. I'll comment Jorge's mails seperately. Sebastian. ---------------------------------------------------------------------- HTML ---- Unfortunately, I do not have the SGML spec at hand, so I only refer to the parts of the HTML spec, which refer to general SGML topics. There is something in chapter 6, although I did not find anything about how #PCDATA (which is used in the DTD for the texts, which are relevant in this context), only about #CDATA:
6.2 SGML basic types
The document type definition specifies the syntax of HTML element content and attribute values using SGML tokens (e.g., PCDATA, CDATA, NAME, ID, etc.). See [ISO8879] for their full definitions. The following is a summary of key information:
* CDATA is a sequence of characters from the document character set and may include character entities. User agents should interpret attribute values as follows:
o Replace character entities with characters, o Ignore line feeds, o Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as > "myval"). Authors should not declare attribute values with leading or trailing white space.
For some HTML 4 attributes with CDATA attribute values, the specification imposes further constraints on the set of legal values for the attribute that may not be expressed by the DTD.
About HTML:
9.1 White space
[...] For all HTML elements except PRE, sequences of white space separate "words" (we use the term "word" here to mean "sequences of non-white space characters"). When formatting text, user agents should identify these words and lay them out according to the conventions of the particular written language (script) and target medium. [...]
This is, what is currently the difference between calls to a_Dw_page_add_text() (which adds a word), and a_Dw_page_add_space(), which adds a "word separation".
B.3.1 Line breaks
SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.
The following two HTML examples must be rendered identically:
<P>Thomas is watching TV.</P>
<P> Thomas is watching TV. </P>
So must the following two examples:
<A>My favorite Website</A>
<A> My favorite Website </A>
Notice this: "This applies to all HTML elements without exception." That is, also for <pre>. But it does not say anything about white spaces in general. XHTML ----- First of all from the XML 1.1 spec:
2.10 White Space Handling
In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.
An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.
First of all, we do not have a distinction between the (basic) XML processor, and the application. (In the CSS prototype, this is clearer, and because it is always a good decision to make the implementation structure resemble the structure of the specifications, this should be kept in mind.) How I understand this basicly: It is (in most cases, see below) the task of the application (in this case XHTML) to determine how to deal with whitespaces. Anyway, I did not find anything in the XHTML 1.0 specification, so I believe that the same rules apply, as for HTML 4.01.
A special attribute named xml:space MAY be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, MUST be declared if it is used. When declared, it MUST be given as an enumerated type whose values are one or both of "default" and "preserve". For example:
<!ATTLIST poem xml:space (default|preserve) 'preserve'> <!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>
The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute. This specification does not give meaning to any value of xml:space other than "default" and "preserve". It is an error for other values to be specified; the XML processor MAY report the error or MAY recover by ignoring the attribute specification or by reporting the (erroneous) value to the application. Applications may ignore or reject erroneous values.
If you look at the DTDs of XHTML 1.1, you'll find three elements setting this: <style>, <script>, and <pre>. As to the patch: <style> and <script> are currently rather unimportant, while for <pre>, the HTML parser switches into the DILLO_HTML_PARSE_MODE_PRE mode, where white spaces are handled differently.
The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value.
participants (1)
-
Sebastian Geerken