Hi, On Sat, Apr 20, 2024 at 02:35:10PM +0200, Rodrigo Arias wrote:
Hi,
On Sat, Apr 20, 2024 at 02:00:05PM +1000, Kevin Koster wrote:
This problem was present in 3.0.5 as well as in 3.1.0-rc1.
URL: http://www.lemis.com/ CSS: enabled or disabled Summary: Won't render HTML with comments before <!DOCTYPE>. Pages on this website aren't rendered, just displayed as source code. Although they are XHTML, this doesn't appear to be due to this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036382 If I save the file as lemis.xhtml and remove the two comment lines before the <!DOCTYPE> declaration, then it renders when I load it with file:// or http://, otherwise it doesn't.
My reading is that comments there are valid for HTML 4 (which is declared in the page's <!DOCTYPE>) since the standard says: White space (spaces, newlines, tabs, and comments) may appear before or after each section. https://www.w3.org/TR/html401/struct/global.html#h-7.1
Yeah, the current detection mechanism in Dillo for content types is not very good. It searches for the doctype line at the beginning of the document[1] but it doesn't handle comments.
[1]:https://github.com/dillo-browser/dillo/blob/v3.1.0-rc1/src/misc.c#L148
We should rely on the Content-Type provided by the server, or at least improve the detection.
So, this is a tricky case. Dillo has several content types for a single document sorted by priority, the first one set defines the content type of the document: 1. The "override type" used to override the type (highest priority) 2. The "meta type" given by the <meta ... content="..."> tag in HTML 3. The "http type" given by the HTTP Content-Type header 4. The "guessed type" based on the document data (lowest priority) They all start set to NULL. At first, the server sends "text/html; charset=UTF-8" which defines the http type:
% curl -sI http://www.lemis.com/ | grep Content Content-Type: text/html; charset=UTF-8
The guessed type is also wrongly set to "text/plain" due to the comments in the beginning which cause a mismatch of the "<!doctype". This is the *first bug*. As the document continues loading, the <meta> tag is found:
% curl -s http://www.lemis.com/ | grep Content <meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8"/>
Which sets the meta type to "text/xhtml". So far we have this situation: override_type = NULL meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" While setting the meta type, there is also an special rule as a workaround for Doxygen pages, which checks if the the content type of the meta tag begins with "text/xhtml" (which it does) and if so sets the override type to the guessed type: https://github.com/dillo-browser/dillo/blob/a0151cbc86166731465b963ea3addb04... So the types are left as follows: override_type = "text/plain" meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" This causes the type of the document to be handled as "text/plain". The "text/xhtml" type should be defined as "application/xhtml+xml", as the W3 describes: https://www.w3.org/TR/xhtml-media-types/#media-types Which Dillo handles fine. So, I'm thinking in transforming the "text/xhtml" to "application/xhtml+xml", better than relying on the guessed type. The types end up being: override_type = "application/xhtml+xml; charset=utf-8" meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" That solves the problem. AFAIK, the "text/xhtml" is not standardized. It was mentioned in the XHTML 1.0 draft on February 1999: https://www.w3.org/TR/1999/WD-html-in-xml-19990224/#h-5.1.3 On March they raised concern about it: https://www.w3.org/TR/1999/WD-html-in-xml-19990304/
There is one issue that is still mildly contentious within the working group, and that we are especially interested in receiving comments on: whether we should register a new Internet media type "text/xhtml".
Very briefly the two opinions are: yes - that is the only way to recognise the application type without accessing the resource; no - all XML applications are going to have this problem, and the answer isn't to register every single application.
And on May it got removed: https://www.w3.org/TR/1999/xhtml1-19990505/ And is not part of the XHTML 1.0 standard: https://www.w3.org/TR/xhtml1/ So I don't think is should be ever used. Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead. So I think for now we can rely on the correction of "text/xhtml" to "application/xhtml+xml", which seems to work fine. I don't like adding quirks, but I will keep this one as it was already there. Here is the PR: https://github.com/dillo-browser/dillo/pull/140 I'll check with some Doxygen pages and see it they don't break anything. Interestingly, to this day they continue to generate documents with the wrong "text/xhtml" content type (since at least 13 years, based on the git blame): https://github.com/doxygen/doxygen/blame/78422d3905e57acebf0374feefafa6578db... I'll open an issue on their repo too. Best regards, Rodrigo.