[Dillo-dev]Re: White spaces handling (was: Weird glitch with rendering)
Hi Sebastian,
Just FYI: I remember from the old discussion, that DwPage originally collapsed spaces (i.e. calls to a_Dw_page_add_space), and that I suggested some changes, so that this is not done anymore. Your patch would reverse this change again.
The discussion goes attached.
I read the whole thread you sent me. BTW, the whole whitespace issue re-started when I found one of the patches you mentioned. I then read the past emails, and started to try to follow. It is not easy to make a clean picture of it though! --This mainly because of the way it used to be handled inside Dillo and because the SPECS seem not definitive in this matter. Here I'll try to cite past emails and comment my current thoughts:
[...] This is because a_Dw_page_add_space is called twice, and this function does actually not add a space, but change the current one. I've worked on two solutions:
1. The first patch (underlined-spaces-clean.diff) changes the behavior of a_Dw_page_add_space, but needs some changes in the HTML parser, to control better when this function is called, i.e. to ignore spaces after <A> and before </A>.
This one seems the best candidate. With regard to fixing the parser for ignoring these spaces, see my comments below.
2. The second one (underlined-spaces-kludgy.diff) tries to adjust spaces, depending on rather hairy conditions. It works already (if I'm not wrong, it has quite the same results as the code before, except the bug).
This option is no longer valid. As you clearly described: <q>
2. The second one (underlined-spaces-kludgy.diff) tries to adjust spaces, depending on rather hairy conditions. It works already (if I'm not wrong, it has quite the same results as the code before, except the bug).
As I've now noticed, this will not work for CSS, e.g. the following code:
<u>One <span style="text-decoration=none">non-underlined</span> word</u>
will be displayed:
One non-underlined word ---- -----
just because this patch lets DwPage assume something about the document structure (change from non-underline to underline == beginning of a tag) which it can (and should) not.
</q> Now,
Despite of the file names, I'm not sure if the changes in the HTML parser can be done cleanly. So, if you think that this is difficult to realize, apply the second patch. Especially, there is probably also an other DwPage function necessary to remove again the last space, when </A> is read after a space.
I believe the parser is not very hard to modify for ignoring the spaces as patch 1.- requires. The problem is that the SPEC is not clear about exactly how these spaces should be collapsed. For instance:
A different case is "<u>Some </u> text". Your patch will make "<u>Some </u>text" of it, but it should be really be "<u>Some</u> text."
Yes, I agree, "collapsing" here should be: '<u>Some </u> text' => '<u>Some</u> text' as you note. but what do we do with this: '<u>Some </u>text' If we ignore white space after the start tag and before the end tag, it becomes '<u>Some</u>text' (with no space at all!) If we "collapse" as the SPEC says should be done, we have two possibilities: '<u>Some </u>text' (as it was: underline the whitespace) and '<u>Some</u> text' (move the space out of the tag) AFAICT, the SPEC leaves the choice open, and advices HTML authors against whitespace inside the tags. IMO, always collapsing white space after the start tag and before the end tag is the simplest to implement. Even more, as the SPEC doesn't define what to do in this case, it's an option left to the User Agent: <q source='HTML4.01 SPEC, 9.1'> In order to avoid problems with SGML line break rules and inconsistencies among extant implementations, authors should not rely on user agents to render white space immediately after a start tag or immediately before an end tag. Thus, authors, and in particular authoring tools, should write: <P>We offer free <A>technical support</A> for subscribers.</P> and not: <P>We offer free<A> technical support </A>for subscribers.</P> </q> Now, this solution would also account for the special SGML line break rules: <q source='HTML-4.01 SPEC B.3.1'> SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception. The following two HTML examples must be rendered identically: <P>Thomas is watching TV.</P> <P> Thomas is watching TV. </P> So must the following two examples: <A>My favorite Website</A> <A> My favorite Website </A> </q> Note that Firebird doesn't follow the SGML line break rules! This is not to say we should follow, but that we may find some buggy pages out there, but at least at this point the SPEC is quite clear. ;)
I'm not completely sure, we should carefully evaluate the specs, I'd suggest to delay this for the 0.8.2 release.
Neither I am sure! :-) At this point I see that applying patch 1, plus making the parser ignore whitespaces after a start tag and before an end tag is a correct solution that can endure a reality test. I'll start coding it. Please let me know your thoughts or any point I'm still missing. Best Jorge.- PS: This work is for 0.8.2.
Sebastian,
[Jorge wrote:]
(...) At this point I see that applying patch 1, plus making the parser ignore whitespaces after a start tag and before an end tag is a correct solution that can endure a reality test.
Now I have a patch that does this (your patch 1.-, plus whitespace ignoring at the parser level). It works very well so far and passes these test pages: 1.- underline-strike-and-spaces.html 2.- bms.html 3.- Underlining of "PRE" in HTML 4.01 SPEC section 9.1 (the page itself, as HTML, has a good test case) Please download it and send me your comments: http://www.dillo.org/download/Jdiff.SPC2 Note: if you modify the patch like this, html.c:1007: - if (html->PrevWasOpenTag) { + if (0 && html->PrevWasOpenTag) { then, the splash page's tables render with a black row at the bottom. It'd be interesting to find out why. Cheers Jorge.-
Hi Sebastian, This HTML example shows the black row (after patch): <table border="1" cellspacing="1"> <tr> <td bgcolor=black> <table width=100% bgcolor=white> <tr> <td>1</td> <td>2</td> <td>3</td> </tr> </table> </td> </tr> </table> HTH Jorge.-
On Mon, May 17, Jorge Arellano Cid wrote:
Hi Sebastian,
This HTML example shows the black row (after patch): [...]
The DwPage below the topmost table gets an additional word, most probably a break, which it should not (and which it does not get with the unmodified parser). What the modified HTML parser does, seems still a bit immature. Sebastian
On Fri, 21 May 2004, Sebastian Geerken wrote:
On Mon, May 17, Jorge Arellano Cid wrote:
Hi Sebastian,
This HTML example shows the black row (after patch): [...]
The DwPage below the topmost table gets an additional word, most probably a break, which it should not (and which it does not get with the unmodified parser). What the modified HTML parser does, seems still a bit immature.
Actually it's not the changes to the parser, but to dw_page.c. If you pick a vanilla 0.8.1 and apply: + if (page->words[nw].orig_space != 0) { + /* We have already a space, add a blank word before. */ + a_Dw_page_add_text (page, g_strdup(""), style); + nw = page->num_words - 1; + } It will render the black row, even on the splash page. Cheers Jorge.-
On Fri, May 21, Jorge Arellano Cid wrote:
On Fri, 21 May 2004, Sebastian Geerken wrote:
On Mon, May 17, Jorge Arellano Cid wrote:
This HTML example shows the black row (after patch): [...]
The DwPage below the topmost table gets an additional word, most probably a break, which it should not (and which it does not get with the unmodified parser). What the modified HTML parser does, seems still a bit immature.
Actually it's not the changes to the parser, but to dw_page.c. If you pick a vanilla 0.8.1 and apply:
+ if (page->words[nw].orig_space != 0) { + /* We have already a space, add a blank word before. */ + a_Dw_page_add_text (page, g_strdup(""), style); + nw = page->num_words - 1; + }
It will render the black row, even on the splash page.
Interestingly, if you take the modifications in html.c, without the modifications in dw_page.c, everything works: the table above is rendered correctly, the other tables are centered, and even the page "underline-strike-and-spaces.html" looks like we want it to look like. Actually, the modifications in html.c should make the modifications in dw_page.c obsolete, since the latters handle the case that a_Dw_page_add_space is called twice in sucsession, without any new words between them. I would have thought that the modified HTML parser should avoid exactly this, but this does not seem the case. It would be interesting to insert the line g_return_if_fail (page->words[nw].orig_space != 0); instead of the code, which is inserted by the patch, and start dillo in gdb with the option "--g-fatal-warnings", to see, under which circumstances the code would have been called. Sebastian
On Sat, 22 May 2004, Sebastian Geerken wrote:
On Fri, May 21, Jorge Arellano Cid wrote:
[...] Actually it's not the changes to the parser, but to dw_page.c. If you pick a vanilla 0.8.1 and apply:
+ if (page->words[nw].orig_space != 0) { + /* We have already a space, add a blank word before. */ + a_Dw_page_add_text (page, g_strdup(""), style); + nw = page->num_words - 1; + }
It will render the black row, even on the splash page.
Interestingly, if you take the modifications in html.c, without the modifications in dw_page.c, everything works: the table above is rendered correctly, the other tables are centered, and even the page "underline-strike-and-spaces.html" looks like we want it to look like.
Funny! I found a render fault on the HTML-4.01 sec 9.1 page (as a test case). There was a missing space in "(calledinter-word space)" and in "user agents shouldcollapse input". I Fixed that, and uploaded to CVS. Not to say this is _the_ way to do the parsing, but as it does better than the former code, let it be. The question of how to do it properly remains open... (I'm yet to study the parser in the CSS prototype)
Actually, the modifications in html.c should make the modifications in dw_page.c obsolete, since the latters handle the case that a_Dw_page_add_space is called twice in sucsession, without any new words between them. I would have thought that the modified HTML parser should avoid exactly this, but this does not seem the case. It would be interesting to insert the line
g_return_if_fail (page->words[nw].orig_space != 0);
instead of the code, which is inserted by the patch, and start dillo in gdb with the option "--g-fatal-warnings", to see, under which circumstances the code would have been called.
I uploaded this test code to CVS: + /* todo: remove this test case */ + if (page->words[nw].orig_space != 0) { + g_warning("a_Dw_page_add_space:: already existing space!!!\n"); + } and found an interesting test case: <table> <tr> <td> <table> <tr> <td>1</td> <td>2</td> <td>3</td> </tr> </table> </td> </tr> </table> This code triggers one warning message, from the "<td>3</td>" line. If you remove it, there's no warning anymore. The funny part is that the "offending" line is the same that the former two (as the parser is concerned), but only the last one triggers the warning. Moreover, if you add a "<td>4</td>" line, there'll be two warnings! HTH Jorge.-
On Fri, May 14, Jorge Arellano Cid wrote:
[...] For instance:
A different case is "<u>Some </u> text". Your patch will make "<u>Some </u>text" of it, but it should be really be "<u>Some</u> text."
Yes, I agree, "collapsing" here should be:
'<u>Some </u> text' => '<u>Some</u> text'
as you note.
but what do we do with this:
'<u>Some </u>text'
If we ignore white space after the start tag and before the end tag, it becomes
'<u>Some</u>text' (with no space at all!)
What I find indeed reasonable. (See below for my reasons.)
If we "collapse" as the SPEC says should be done, we have two possibilities:
What part of the spec do you refer to? From what I have understood, spaces should be collapsed at the raw data leve. That is, if you have the part "<p><u>Some </u>text</p>", it will be parsed into the following tree: ... `- <p> +- <u> | `- "Some " `- "text" Then, the question is, what should be done with the space at the end of "Some ". Generally, I'd like to stick to this tree view, and especially regard, in this example, the words "Some " and "text" as lying in different levels in the tree, not within a flat list. In the history, this was not always very clear for HTML, but it is much clearer for XHTML. The current parser does not actually build a tree, but should a bit like as it does. This approach makes also the new HTML parser in the CSS prototype simpler.
'<u>Some </u>text' (as it was: underline the whitespace)
and
'<u>Some</u> text' (move the space out of the tag)
AFAICT, the SPEC leaves the choice open, and advices HTML authors against whitespace inside the tags.
IMO, always collapsing white space after the start tag and ^^^^^^^^^^
I'd say, we should (for most elements) simply ignore whitespaces after the opening tag, and before the closing tag. This solves generally the problem with "Some <u> underlined </u> text.", and we should not relate spaces in different elements, e.g. by collapsing them.
before the end tag is the simplest to implement. Even more, as the SPEC doesn't define what to do in this case, it's an option left to the User Agent:
<q source='HTML4.01 SPEC, 9.1'> In order to avoid problems with SGML line break rules and inconsistencies among extant implementations, authors should not rely on user agents to render white space immediately after a start tag or immediately before an end tag. Thus, authors, and in particular authoring tools, should write:
<P>We offer free <A>technical support</A> for subscribers.</P>
and not:
<P>We offer free<A> technical support </A>for subscribers.</P> </q>
So, at least, the authors are warned ;-)
Now, this solution would also account for the special SGML line break rules:
<q source='HTML-4.01 SPEC B.3.1'>
SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.
The following two HTML examples must be rendered identically:
<P>Thomas is watching TV.</P>
<P> Thomas is watching TV. </P>
So must the following two examples:
<A>My favorite Website</A>
<A> My favorite Website </A>
</q>
Rhis is actually something different: It is only about line breaks, and it applies to *all* elements, including <pre>. Sebastian
On Fri, 21 May 2004, Sebastian Geerken wrote:
On Fri, May 14, Jorge Arellano Cid wrote:
[...] but what do we do with this:
'<u>Some </u>text'
If we ignore white space after the start tag and before the end tag, it becomes
'<u>Some</u>text' (with no space at all!)
What I find indeed reasonable. (See below for my reasons.)
Agreed.
If we "collapse" as the SPEC says should be done, we have two possibilities:
What part of the spec do you refer to?
HTML-4.01, sec 9.1: <q> [...] Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when producing output inter-word space. This can and should be done even in the absence of language information (from the lang attribute, the HTTP"Content-Language" header field (see [RFC2616], section 14.12), user agent settings, etc.). [...] </q>
From what I have understood, spaces should be collapsed at the raw data level. That is, if you have the part "<p><u>Some </u>text</p>", it will be parsed into the following tree:
... `- <p> +- <u> | `- "Some " `- "text"
Then, the question is, what should be done with the space at the end of "Some ".
If "the raw data level" means that this: <p><u>Some </u>text</p> would be parsed as: `- <p> +- <u> | `- "Some " `- "text" and after that, the final space removed, producing: `- <p> +- <u> | `- "Some " `- "text" it seems a good idea for an implementation to me.
Generally, I'd like to stick to this tree view, and especially regard, in this example, the words "Some " and "text" as lying in different levels in the tree, not within a flat list. In the history, this was not always very clear for HTML, but it is much clearer for XHTML. The current parser does not actually build a tree, but should a bit like as it does.
Sorry, my english, what does "a bit like as it does" mean?
This approach makes also the new HTML parser in the CSS prototype simpler.
Good.
[...] AFAICT, the SPEC leaves the choice open, and advices HTML authors against whitespace inside the tags.
IMO, always collapsing white space after the start tag and ^^^^^^^^^^
I'd say, we should (for most elements) simply ignore whitespaces after the opening tag, and before the closing tag. This solves generally the problem with "Some <u> underlined </u> text.", and we should not relate spaces in different elements, e.g. by collapsing them.
Yes, "ignore" is a better word.
Now, this solution would also account for the special SGML line break rules:
<q source='HTML-4.01 SPEC B.3.1'>
SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.
The following two HTML examples must be rendered identically:
<P>Thomas is watching TV.</P>
<P> Thomas is watching TV. </P>
So must the following two examples:
<A>My favorite Website</A>
<A> My favorite Website </A>
</q>
Rhis is actually something different: It is only about line breaks, and it applies to *all* elements, including <pre>.
If you consider that line breaks are also white space characters (HTML-4.01 sec 9.1), this becomes a special case of general white space handling. Cheers Jorge.- PS: It seems like we'll have to modify the parser to produce a parsing tree for CSS/XML to be supported properly.
On Fri, May 21, Jorge Arellano Cid wrote:
[...]
If we "collapse" as the SPEC says should be done, we have two possibilities:
What part of the spec do you refer to?
HTML-4.01, sec 9.1:
<q> [...] Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when ^^^^^ Again, I understand this, that this refers to the lowest processing level, this is what the HTML parser has already done before.
producing output inter-word space. This can and should be done even in the absence of language information (from the lang attribute, the HTTP"Content-Language" header field (see [RFC2616], section 14.12), user agent settings, etc.). [...] </q>
[...]
Generally, I'd like to stick to this tree view, and especially regard, in this example, the words "Some " and "text" as lying in different levels in the tree, not within a flat list. In the history, this was not always very clear for HTML, but it is much clearer for XHTML. The current parser does not actually build a tree, but should a bit like as it does.
Sorry, my english, what does "a bit like as it does" mean?
Sorry, this should say "[should behave] a bit like as *if* it does [build a tree]".
[...]
Now, this solution would also account for the special SGML line break rules:
<q source='HTML-4.01 SPEC B.3.1'>
SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.
The following two HTML examples must be rendered identically:
<P>Thomas is watching TV.</P>
<P> Thomas is watching TV. </P>
So must the following two examples:
<A>My favorite Website</A>
<A> My favorite Website </A>
</q>
Rhis is actually something different: It is only about line breaks, and it applies to *all* elements, including <pre>.
If you consider that line breaks are also white space characters (HTML-4.01 sec 9.1), this becomes a special case of general white space handling.
Since this rule applies always, it should be handles at a level below (if I understand this correctly). I.e., first remove these linke breaks (also for <pre>, and then (except for <pre>), remove white spaces. BTW, I do not know whether this is also valid for XML, I did not find something equivalent in the XML spec.
PS: It seems like we'll have to modify the parser to produce a parsing tree for CSS/XML to be supported properly.
This is already halfway done in the CSS prototype, since CSS does indeed depend on a document tree (mostly because CSS is processed asynchronously). It has still to be considered, on which levels what white space handling is done. Sebastian
participants (2)
-
Jorge Arellano Cid
-
Sebastian Geerken