Matthias, Maybe the most important guideline in this answer is that we're trying to provide good hint-messages for common HTML bugs, not being as picky (or correct) as the W3C's validator. The two main reasons behind this are that first, we do not want to (nor can) complicate too much the code inside dillo (some big browsers have several parsers inside), and we want to help to fix the most problematic HTML bugs (mainly nesting), not all. BTW, inside Dillo all the HTML-like content is currently parsed as HTML-4.01 with a few minor exceptions. HTML-4.01 is a good default because it tries hard to be backwards compatible. The third reason is that if the need for a formal validation arises, the W3C does a great job on it! :) On Wed, Oct 13, 2004 at 06:20:36PM +0200, Matthias Franz wrote:
Dear Jorge,
here is the anchor name patch I promised you long time ago. It does the following:
* First of all, it evaluates the <!doctype> tag to find out whether the document is HTML or XHTML. If the tag is wrong or missing, an error is raised.
Parsing <!doctype ...> is a good idea. Putting that info in a structure like this one: typedef enum { DT_NONE, DT_HTML, DT_XHTML } DocumentType; typedef struct { DocumentType Type; float Version; } DocumentInfo; allows for having all the information in one place, and to later decide whether to take some action or not. e.g. DT_NONE + DT_HTML + 4.01 means no doctype was given and that HTML-4.01 is assumed as default. DT_HTML + 4.01 means it was stated explicitly in doctype.
* Dillo now distinguishes more carefully between head and body section
There was a bug in dillo (up to rc1). A patch is now in CVS. When the HTML meta refresh warning was sent, it switched from IN_HEAD to IN_BODY. Note that for HTML-4.01: BODY: Start tag: optional, End tag: optional HEAD: Start tag: optional, End tag: optional
* The errors "<...> not allowed in body section" are now centralised in Html_process_tag
Could be.
Moreover, errors are raised in the following situations: (After all, this was the goal!)
* if an anchor name (defined by "name" or "id") is already defined
OK.
For performance reasons, I have changed (very) few lines in dw_page.c and dw_gtk_viewport.c.
(pending as for the latest bugs found...)
* if (in HTML mode) the "name" and "id" tags of <a> differ
OK.
* if <a> tags are nested
OK.
* extra_warning if an anchor name (defined by "name") was illegal for "id"
OK.
NOT DONE:
* warning if in XHTML <a> is used with "name" and no "id" (according to the spec, this has no effect, which is probably not intended)
OK.
* the "refresh" warning causes (like before) an error if further head elements follow the <meta>
Fixed in CVS now (Björn Brill).
* I've discovered that some parts of the TagInfo structure are not used any more, for example TagLevel and bits 2^0 = 1 and 2^2 = 4 of Flags.
TagLevel is used extensively by the W3C+heuristics mode. Look at Html_tags_get_taglevel() calls. Yes, bits 0 and 2 are not yet used, but there they are just in case they're needed.
In particular, I didn't know how to define them for <!doctype> on line 4281.
HTML elements can be of type 'block' or 'inline' (well, also 'flow'). And they can be containers of 'inline' or containers of 'blocks'. This is what the flags are. I'll comment that inside the code. For instance, <address> is an 'block' element, and a cointainer of 'inline' elements. address B8(0110) |||`- inline element ||`-- block element |`--- inline container `---- block container This is well defined here: http://www.cs.tut.fi/~jkorpela/html/nesting.html Now, as !doctype isn't there, an inline element that's a block container can appear almost anywhere (i.e. B8(0101)), and help to tackle the issue.
* IN_BUTTON in html.h is also not used any more; I've replaced it by the new IN_A.
Let IN_BUTTON be. As buttons can't be nested, it was meant to catch that one (not implemented yet).
* One change in Html_process_tag is more of a hack; I didn't want to start rewriting everything without contacting you first.
You see that is still work to do in html.c, all the more because know one could add error messages based on the distinction between HTML and XHTML. (E.g., "@" is illegal in XHTML because of the uppercase "X".) Would this kind of changes be welcome?
Hmmm, I think this is too much by now.
I hope this patch can still make it into rc2. If you have comments or questions, please let me know.
As explained before, it better not be in rc2. Just bug-fixes. -- Regards Jorge.-