[Dillo-dev] Re: Dillo not rendering HTML with comments before <!DOCTYPE>

April 23, 2024

      Hi,

On Sat, Apr 20, 2024 at 02:35:10PM +0200, Rodrigo Arias wrote:
...
Hi,
On Sat, Apr 20, 2024 at 02:00:05PM +1000, Kevin Koster wrote:
...
This problem was present in 3.0.5 as well as in 3.1.0-rc1.
URL: http://www.lemis.com/
CSS: enabled or disabled
Summary: Won't render HTML with comments before <!DOCTYPE>.
Pages on this website aren't rendered, just displayed as source code.
Although they are XHTML, this doesn't appear to be due to this bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036382
If I save the file as lemis.xhtml and remove the two comment lines
before the <!DOCTYPE> declaration, then it renders when I load it with
file:// or http://, otherwise it doesn't.
My reading is that comments there are valid for HTML 4 (which is
declared in the page's <!DOCTYPE>) since the standard says:
White space (spaces, newlines, tabs, and comments) may appear before or
after each section.
https://www.w3.org/TR/html401/struct/global.html#h-7.1
Yeah, the current detection mechanism in Dillo for content types is 
not very good. It searches for the doctype line at the beginning of 
the document[1] but it doesn't handle comments.
[1]:https://github.com/dillo-browser/dillo/blob/v3.1.0-rc1/src/misc.c#L148
We should rely on the Content-Type provided by the server, or at least 
improve the detection.
So, this is a tricky case.

Dillo has several content types for a single document sorted by 
priority, the first one set defines the content type of the document:

1. The "override type" used to override the type (highest priority)
2. The "meta type" given by the <meta ... content="..."> tag in HTML
3. The "http type" given by the HTTP Content-Type header
4. The "guessed type" based on the document data (lowest priority)

They all start set to NULL.

At first, the server sends "text/html; charset=UTF-8" which defines the 
http type:
...
% curl -sI http://www.lemis.com/ | grep Content
Content-Type: text/html; charset=UTF-8
The guessed type is also wrongly set to "text/plain" due to the comments
in the beginning which cause a mismatch of the "<!doctype". This is the
*first bug*.

As the document continues loading, the <meta> tag is found:
...
% curl -s http://www.lemis.com/ | grep Content
<meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8"/>
Which sets the meta type to "text/xhtml". So far we have this situation:

override_type = NULL
meta_type = "text/xhtml; charset=utf-8"
http_type = "text/html; charset=UTF-8"
guessed_type = "text/plain"

While setting the meta type, there is also an special rule as a
workaround for Doxygen pages, which checks if the the content type of
the meta tag begins with "text/xhtml" (which it does) and if so sets the
override type to the guessed type:

https://github.com/dillo-browser/dillo/blob/a0151cbc86166731465b963ea3addb04...

So the types are left as follows:

override_type = "text/plain"
meta_type = "text/xhtml; charset=utf-8"
http_type = "text/html; charset=UTF-8"
guessed_type = "text/plain"

This causes the type of the document to be handled as "text/plain".

The "text/xhtml" type should be defined as "application/xhtml+xml", as 
the W3 describes:

https://www.w3.org/TR/xhtml-media-types/#media-types

Which Dillo handles fine.

So, I'm thinking in transforming the "text/xhtml" to
"application/xhtml+xml", better than relying on the guessed type. The
types end up being:

override_type = "application/xhtml+xml; charset=utf-8"
meta_type     = "text/xhtml; charset=utf-8"
http_type     = "text/html; charset=UTF-8"
guessed_type  = "text/plain"

That solves the problem.

AFAIK, the "text/xhtml" is not standardized. It was mentioned in the
XHTML 1.0 draft on February 1999:

https://www.w3.org/TR/1999/WD-html-in-xml-19990224/#h-5.1.3

On March they raised concern about it:

https://www.w3.org/TR/1999/WD-html-in-xml-19990304/
...
There is one issue that is still mildly contentious within the working 
group, and that we are especially interested in receiving comments on: 
whether we should register a new Internet media type "text/xhtml".
Very briefly the two opinions are: yes - that is the only way to 
recognise the application type without accessing the resource; no - 
all XML applications are going to have this problem, and the answer 
isn't to register every single application.
And on May it got removed: https://www.w3.org/TR/1999/xhtml1-19990505/

And is not part of the XHTML 1.0 standard: https://www.w3.org/TR/xhtml1/

So I don't think is should be ever used.

Regarding the type guessing bug, I think I can improve it by assuming
that if we find the "<!doctype html" string in the first 1024 bytes or
so, it is an HTML-like type, but it incurrs in more overhead.

So I think for now we can rely on the correction of "text/xhtml" to
"application/xhtml+xml", which seems to work fine. I don't like adding
quirks, but I will keep this one as it was already there. Here is the 
PR:

https://github.com/dillo-browser/dillo/pull/140

I'll check with some Doxygen pages and see it they don't break anything.
Interestingly, to this day they continue to generate documents with the
wrong "text/xhtml" content type (since at least 13 years, based on the 
git blame):

https://github.com/doxygen/doxygen/blame/78422d3905e57acebf0374feefafa6578db...

I'll open an issue on their repo too.

Best regards,
Rodrigo.