Hi Sebastian,
Just FYI: I remember from the old discussion, that DwPage originally collapsed spaces (i.e. calls to a_Dw_page_add_space), and that I suggested some changes, so that this is not done anymore. Your patch would reverse this change again.
The discussion goes attached.
I read the whole thread you sent me. BTW, the whole whitespace issue re-started when I found one of the patches you mentioned. I then read the past emails, and started to try to follow. It is not easy to make a clean picture of it though! --This mainly because of the way it used to be handled inside Dillo and because the SPECS seem not definitive in this matter. Here I'll try to cite past emails and comment my current thoughts:
[...] This is because a_Dw_page_add_space is called twice, and this function does actually not add a space, but change the current one. I've worked on two solutions:
1. The first patch (underlined-spaces-clean.diff) changes the behavior of a_Dw_page_add_space, but needs some changes in the HTML parser, to control better when this function is called, i.e. to ignore spaces after <A> and before </A>.
This one seems the best candidate. With regard to fixing the parser for ignoring these spaces, see my comments below.
2. The second one (underlined-spaces-kludgy.diff) tries to adjust spaces, depending on rather hairy conditions. It works already (if I'm not wrong, it has quite the same results as the code before, except the bug).
This option is no longer valid. As you clearly described: <q>
2. The second one (underlined-spaces-kludgy.diff) tries to adjust spaces, depending on rather hairy conditions. It works already (if I'm not wrong, it has quite the same results as the code before, except the bug).
As I've now noticed, this will not work for CSS, e.g. the following code:
<u>One <span style="text-decoration=none">non-underlined</span> word</u>
will be displayed:
One non-underlined word ---- -----
just because this patch lets DwPage assume something about the document structure (change from non-underline to underline == beginning of a tag) which it can (and should) not.
</q> Now,
Despite of the file names, I'm not sure if the changes in the HTML parser can be done cleanly. So, if you think that this is difficult to realize, apply the second patch. Especially, there is probably also an other DwPage function necessary to remove again the last space, when </A> is read after a space.
I believe the parser is not very hard to modify for ignoring the spaces as patch 1.- requires. The problem is that the SPEC is not clear about exactly how these spaces should be collapsed. For instance:
A different case is "<u>Some </u> text". Your patch will make "<u>Some </u>text" of it, but it should be really be "<u>Some</u> text."
Yes, I agree, "collapsing" here should be: '<u>Some </u> text' => '<u>Some</u> text' as you note. but what do we do with this: '<u>Some </u>text' If we ignore white space after the start tag and before the end tag, it becomes '<u>Some</u>text' (with no space at all!) If we "collapse" as the SPEC says should be done, we have two possibilities: '<u>Some </u>text' (as it was: underline the whitespace) and '<u>Some</u> text' (move the space out of the tag) AFAICT, the SPEC leaves the choice open, and advices HTML authors against whitespace inside the tags. IMO, always collapsing white space after the start tag and before the end tag is the simplest to implement. Even more, as the SPEC doesn't define what to do in this case, it's an option left to the User Agent: <q source='HTML4.01 SPEC, 9.1'> In order to avoid problems with SGML line break rules and inconsistencies among extant implementations, authors should not rely on user agents to render white space immediately after a start tag or immediately before an end tag. Thus, authors, and in particular authoring tools, should write: <P>We offer free <A>technical support</A> for subscribers.</P> and not: <P>We offer free<A> technical support </A>for subscribers.</P> </q> Now, this solution would also account for the special SGML line break rules: <q source='HTML-4.01 SPEC B.3.1'> SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception. The following two HTML examples must be rendered identically: <P>Thomas is watching TV.</P> <P> Thomas is watching TV. </P> So must the following two examples: <A>My favorite Website</A> <A> My favorite Website </A> </q> Note that Firebird doesn't follow the SGML line break rules! This is not to say we should follow, but that we may find some buggy pages out there, but at least at this point the SPEC is quite clear. ;)
I'm not completely sure, we should carefully evaluate the specs, I'd suggest to delay this for the 0.8.2 release.
Neither I am sure! :-) At this point I see that applying patch 1, plus making the parser ignore whitespaces after a start tag and before an end tag is a correct solution that can endure a reality test. I'll start coding it. Please let me know your thoughts or any point I'm still missing. Best Jorge.- PS: This work is for 0.8.2.