[Dillo-dev]Re: dillo patch: anchor names
Matthias, Maybe the most important guideline in this answer is that we're trying to provide good hint-messages for common HTML bugs, not being as picky (or correct) as the W3C's validator. The two main reasons behind this are that first, we do not want to (nor can) complicate too much the code inside dillo (some big browsers have several parsers inside), and we want to help to fix the most problematic HTML bugs (mainly nesting), not all. BTW, inside Dillo all the HTML-like content is currently parsed as HTML-4.01 with a few minor exceptions. HTML-4.01 is a good default because it tries hard to be backwards compatible. The third reason is that if the need for a formal validation arises, the W3C does a great job on it! :) On Wed, Oct 13, 2004 at 06:20:36PM +0200, Matthias Franz wrote:
Dear Jorge,
here is the anchor name patch I promised you long time ago. It does the following:
* First of all, it evaluates the <!doctype> tag to find out whether the document is HTML or XHTML. If the tag is wrong or missing, an error is raised.
Parsing <!doctype ...> is a good idea. Putting that info in a structure like this one: typedef enum { DT_NONE, DT_HTML, DT_XHTML } DocumentType; typedef struct { DocumentType Type; float Version; } DocumentInfo; allows for having all the information in one place, and to later decide whether to take some action or not. e.g. DT_NONE + DT_HTML + 4.01 means no doctype was given and that HTML-4.01 is assumed as default. DT_HTML + 4.01 means it was stated explicitly in doctype.
* Dillo now distinguishes more carefully between head and body section
There was a bug in dillo (up to rc1). A patch is now in CVS. When the HTML meta refresh warning was sent, it switched from IN_HEAD to IN_BODY. Note that for HTML-4.01: BODY: Start tag: optional, End tag: optional HEAD: Start tag: optional, End tag: optional
* The errors "<...> not allowed in body section" are now centralised in Html_process_tag
Could be.
Moreover, errors are raised in the following situations: (After all, this was the goal!)
* if an anchor name (defined by "name" or "id") is already defined
OK.
For performance reasons, I have changed (very) few lines in dw_page.c and dw_gtk_viewport.c.
(pending as for the latest bugs found...)
* if (in HTML mode) the "name" and "id" tags of <a> differ
OK.
* if <a> tags are nested
OK.
* extra_warning if an anchor name (defined by "name") was illegal for "id"
OK.
NOT DONE:
* warning if in XHTML <a> is used with "name" and no "id" (according to the spec, this has no effect, which is probably not intended)
OK.
* the "refresh" warning causes (like before) an error if further head elements follow the <meta>
Fixed in CVS now (Björn Brill).
* I've discovered that some parts of the TagInfo structure are not used any more, for example TagLevel and bits 2^0 = 1 and 2^2 = 4 of Flags.
TagLevel is used extensively by the W3C+heuristics mode. Look at Html_tags_get_taglevel() calls. Yes, bits 0 and 2 are not yet used, but there they are just in case they're needed.
In particular, I didn't know how to define them for <!doctype> on line 4281.
HTML elements can be of type 'block' or 'inline' (well, also 'flow'). And they can be containers of 'inline' or containers of 'blocks'. This is what the flags are. I'll comment that inside the code. For instance, <address> is an 'block' element, and a cointainer of 'inline' elements. address B8(0110) |||`- inline element ||`-- block element |`--- inline container `---- block container This is well defined here: http://www.cs.tut.fi/~jkorpela/html/nesting.html Now, as !doctype isn't there, an inline element that's a block container can appear almost anywhere (i.e. B8(0101)), and help to tackle the issue.
* IN_BUTTON in html.h is also not used any more; I've replaced it by the new IN_A.
Let IN_BUTTON be. As buttons can't be nested, it was meant to catch that one (not implemented yet).
* One change in Html_process_tag is more of a hack; I didn't want to start rewriting everything without contacting you first.
You see that is still work to do in html.c, all the more because know one could add error messages based on the distinction between HTML and XHTML. (E.g., "@" is illegal in XHTML because of the uppercase "X".) Would this kind of changes be welcome?
Hmmm, I think this is too much by now.
I hope this patch can still make it into rc2. If you have comments or questions, please let me know.
As explained before, it better not be in rc2. Just bug-fixes. -- Regards Jorge.-
Firstly, hello - I'm a fan and user of Dillo, as well as being a web developer, but I am in no way a coder and I don't have any experience in building web browsers or any other programs - so bear with me if I am talking rubbish. I do know a bit about HTML and XHTML, though, and I was wondering about your doctype-sniffing patch. As I understand it, you're trying to distinguish between HTML 4.01 and XHTML 1.x by sniffing the doctype and giving appropriate warnings for invalid markup. Are you also going to alter the rendering for XHTML? I know that the big browsers such as Gecko or IE6 do doctype-sniffing to switch between a "quirks" mode or a "standards-compliant" mode - are you thinking of doing this, or is it just for showing an error dialog?
* First of all, it evaluates the <!doctype> tag to find out whether the document is HTML or XHTML. If the tag is wrong or missing, an error is raised.
Parsing <!doctype ...> is a good idea.
There's a good article here: http://www.hixie.ch/advocacy/xhtml which talks amongst other things about the impossibility of correctly identifying an XHTML document which might be of interest to you.
You see that is still work to do in html.c, all the more because know one could add error messages based on the distinction between HTML and XHTML. (E.g., "@" is illegal in XHTML because of the uppercase "X".) Would this kind of changes be welcome?
In my personal and certainly very humble opinion, if an XHTML 1.x document is served with the mime-type text/html (as virtually all are, and anyway Dillo doesn't do application/xhtml+xml), it should simply be parsed as HTML 4.01 - precisely because the mime type is a clear indication that it is supposed to be a HTML 4.01 compatible document. If you are doing doctype sniffing in the Gecko way to switch rendering modes, then I'm sure you'll do it better than IE6 and not assume that the doctype can only occur on the first line (IE6 messes up if there's an xml prolog or even a comment). Hey, it'll just be another reason why Dillo is better than IE6! Of course, the currently accepted convention for other browsers is that HTML 4.01 doctypes which include a full w3c DTD url and treated as standards-compliant, but 4.01 without the url and HTML 4.0 and earlier are not. XHTML doctypes are always standards-compliant whether or not an url is present. Finally, and most importantly, I'd like to add my word of thanks to the developers of this really excellent little browser - and having subscribed recently to this list, I can see the level of dedication for making Dillo even better. Richard Page-Wood
Hi Richard, thanks for your comments! On Fri, Oct 15, 2004 at 04:35:29PM -0400, Richard Page-Wood wrote:
I do know a bit about HTML and XHTML, though, and I was wondering about your doctype-sniffing patch. As I understand it, you're trying to distinguish between HTML 4.01 and XHTML 1.x by sniffing the doctype and giving appropriate warnings for invalid markup. Are you also going to alter the rendering for XHTML?
Certainly not as part of my patch. It's origin was simply the observation that Dillo refused anchor names like "Dürst" which are allowed in HTML if defined with the "name" attribute (see Section 12.2.3 of the HTML 4.01 spec).
There's a good article here:
http://www.hixie.ch/advocacy/xhtml
which talks amongst other things about the impossibility of correctly identifying an XHTML document which might be of interest to you.
Having looked at this article and the references given therein, I don't feel anymore that it would be a good idea to try and figure out whether the document type is HTML or XHTML. I still like the idea of supporting XHTML in some way, mostly because XML lacks many of the strange features of SGML that make parsing difficult. For example, "<" and "&" are not allowed as ordinary characters in XML. But this has nothing to do with anchor names, so I will remove the XHTML parts of the patch (unless someone complains). Jorge: Are you still interested in evaluating <!DOCTYPE> to figure out the HTML version? Maybe it would be ok for a small browser like Dillo to stick to HTML 4.01. All the best, -- Matthias Franz Section de Mathématiques, Université de Genève, Suisse
Matthias, Please excuse me for the delayed answer. I had a hard time fixing release candidates for the dillo-0.8.3 release... On Tue, Oct 19, 2004 at 12:03:17PM +0200, Matthias Franz wrote:
Hi Richard,
thanks for your comments!
Yes, very interesting.
On Fri, Oct 15, 2004 at 04:35:29PM -0400, Richard Page-Wood wrote:
I do know a bit about HTML and XHTML, though, and I was wondering about your doctype-sniffing patch. As I understand it, you're trying to distinguish between HTML 4.01 and XHTML 1.x by sniffing the doctype and giving appropriate warnings for invalid markup. Are you also going to alter the rendering for XHTML?
Certainly not as part of my patch. It's origin was simply the observation that Dillo refused anchor names like "Dürst" which are allowed in HTML if defined with the "name" attribute (see Section 12.2.3 of the HTML 4.01 spec).
There's a good article here:
http://www.hixie.ch/advocacy/xhtml
which talks amongst other things about the impossibility of correctly identifying an XHTML document which might be of interest to you.
Having looked at this article and the references given therein, I don't feel anymore that it would be a good idea to try and figure out whether the document type is HTML or XHTML.
I still like the idea of supporting XHTML in some way, mostly because XML lacks many of the strange features of SGML that make parsing difficult. For example, "<" and "&" are not allowed as ordinary characters in XML. But this has nothing to do with anchor names, so I will remove the XHTML parts of the patch (unless someone complains).
Jorge: Are you still interested in evaluating <!DOCTYPE> to figure out the HTML version? Maybe it would be ok for a small browser like Dillo to stick to HTML 4.01.
Could be... As the suggested document explains, there's not a big gain in serving XHTML as such, and nowadays most of it is served as "text/html". BTW, it's hard to find a site that serves XHTML as "application/xhtml+xml", that's not intended for testing. In our case the "detection" was just to try to provide a hintful HTML/XHTML warning. The easy solution is not to raise a warning or to send it to extra warnings ;). What worries me a bit more is what to do with XHTML served with the proper MIME type. Currently it's not rendered at all, though Dillo can perfectly cope with it. The reason is that the XHTML SPEC requires a validating client, and as Dillo doesn't include a formal XML parser this is not possible. Today this is not a problem because such sites are very seldom found. Maybe a partial validation can serve the standards compliance objective. I mean, for instance: proper nesting, lowercase tags, tag names in the XHTML namespace. Not much more than that. Perhaps MIME type detection, plus some doctype sniffing (to have "an idea" of whether we are dealing with HTML 2.0, 3.2, 4.0, 4.1 or XHTML), and having that information in a structure like the one suggested in the former mail could serve to fine tune a bit the warning messages (or parser behaviour). For instance having that info, in the case of anchor names can lead to something as simple as: if (!isalpha(val[0]) && doctype == DOCTYPE_XHTML) MSG_HTML("first character of '%s' value is outside" " the [A-Za-z] set\n", attrname); Just make the patch with comment where this messages should go. With the fuzzy detection code in place it'll be a matter of binding. Not the highest priority, but easy to merge. -- Cheers Jorge.-
Jorge Arellano Cid wrote:
BTW, it's hard to find a site that serves XHTML as "application/xhtml+xml", that's not intended for testing.
I think within the web design/web standards blogging community, you're more likely to find them. There was a "hall of fame"-type list someone did a while back, though I can't remember what it was called or where it was. Wordpress generates valid XHTML by default and sends it as text/html -- but some of the more standards-zealous and/or technically-inclined WP users have modified their sites to send the proper mime type when appropriate. (I haven't since I still don't quite trust user-supplied text not to break the XHTML -- but several sections of my site do.) Also, XHTML sites are going to be hard to find with a browser that doesn't support it, because most sites that *do* use the proper mime type use content negotiation to decide which mime type to send. This is the only way to keep the site visible to HTML-only browsers like that IE thing some people still use ;-). So unless your UA's HTTP Accept: header includes application/xhtml+xml, it won't see many sites using that mime type.
Maybe a partial validation can serve the standards compliance objective. I mean, for instance: proper nesting, lowercase tags, tag names in the XHTML namespace. Not much more than that.
I don't recall exactly what the spec states, but from experience with Gecko browsers the main issue with broken XHTML appears to be with pages that aren't *well-formed*. So tossing in an unfamiliar attribute may break validation, but it's still well-formed XML (all tags nested and closed properly, including <img/>, <br/>, <meta/>, etc.) so the page will still display. The main errors I've noticed Mozilla complaining about are misnesting, missing closing tags, and undefined character entities. On a related note: is there a reason Dillo doesn't send an Accept: header? I'd think that something basic like "text/html, image/png, image/jpeg, image/gif, text/plain;q=0.9" with an optional ", */*;q=0.1" at the end would be about right. -- Kelson Vibber www.hyperborea.org
participants (4)
-
Jorge Arellano Cid
-
Kelson Vibber
-
Matthias Franz
-
Richard Page-Wood