Dillo not rendering HTML with comments before <!DOCTYPE>

Kevin Koster

April 20, 2024

4 a.m.

This problem was present in 3.0.5 as well as in 3.1.0-rc1. URL: http://www.lemis.com/ CSS: enabled or disabled Summary: Won't render HTML with comments before <!DOCTYPE>. Pages on this website aren't rendered, just displayed as source code. Although they are XHTML, this doesn't appear to be due to this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036382 If I save the file as lemis.xhtml and remove the two comment lines before the <!DOCTYPE> declaration, then it renders when I load it with file:// or http://, otherwise it doesn't. My reading is that comments there are valid for HTML 4 (which is declared in the page's <!DOCTYPE>) since the standard says: White space (spaces, newlines, tabs, and comments) may appear before or after each section. https://www.w3.org/TR/html401/struct/global.html#h-7.1

Show replies by date

Rodrigo Arias

April 2024

12:35 p.m.

Hi, On Sat, Apr 20, 2024 at 02:00:05PM +1000, Kevin Koster wrote:

...

This problem was present in 3.0.5 as well as in 3.1.0-rc1.

URL: http://www.lemis.com/ CSS: enabled or disabled Summary: Won't render HTML with comments before <!DOCTYPE>. Pages on this website aren't rendered, just displayed as source code. Although they are XHTML, this doesn't appear to be due to this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036382 If I save the file as lemis.xhtml and remove the two comment lines before the <!DOCTYPE> declaration, then it renders when I load it with file:// or http://, otherwise it doesn't.

My reading is that comments there are valid for HTML 4 (which is declared in the page's <!DOCTYPE>) since the standard says: White space (spaces, newlines, tabs, and comments) may appear before or after each section. https://www.w3.org/TR/html401/struct/global.html#h-7.1

Yeah, the current detection mechanism in Dillo for content types is not very good. It searches for the doctype line at the beginning of the document[1] but it doesn't handle comments. [1]:https://github.com/dillo-browser/dillo/blob/v3.1.0-rc1/src/misc.c#L148 We should rely on the Content-Type provided by the server, or at least improve the detection. Best, Rodrigo.

Rodrigo Arias

9:29 p.m.

Hi, On Sat, Apr 20, 2024 at 02:35:10PM +0200, Rodrigo Arias wrote:

...

Hi,

On Sat, Apr 20, 2024 at 02:00:05PM +1000, Kevin Koster wrote:

...
This problem was present in 3.0.5 as well as in 3.1.0-rc1.

URL: http://www.lemis.com/ CSS: enabled or disabled Summary: Won't render HTML with comments before <!DOCTYPE>. Pages on this website aren't rendered, just displayed as source code. Although they are XHTML, this doesn't appear to be due to this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036382 If I save the file as lemis.xhtml and remove the two comment lines before the <!DOCTYPE> declaration, then it renders when I load it with file:// or http://, otherwise it doesn't.

My reading is that comments there are valid for HTML 4 (which is declared in the page's <!DOCTYPE>) since the standard says: White space (spaces, newlines, tabs, and comments) may appear before or after each section. https://www.w3.org/TR/html401/struct/global.html#h-7.1

Yeah, the current detection mechanism in Dillo for content types is not very good. It searches for the doctype line at the beginning of the document[1] but it doesn't handle comments.

[1]:https://github.com/dillo-browser/dillo/blob/v3.1.0-rc1/src/misc.c#L148

We should rely on the Content-Type provided by the server, or at least improve the detection.

So, this is a tricky case. Dillo has several content types for a single document sorted by priority, the first one set defines the content type of the document: 1. The "override type" used to override the type (highest priority) 2. The "meta type" given by the <meta ... content="..."> tag in HTML 3. The "http type" given by the HTTP Content-Type header 4. The "guessed type" based on the document data (lowest priority) They all start set to NULL. At first, the server sends "text/html; charset=UTF-8" which defines the http type:

...

% curl -sI http://www.lemis.com/ | grep Content Content-Type: text/html; charset=UTF-8

The guessed type is also wrongly set to "text/plain" due to the comments in the beginning which cause a mismatch of the "<!doctype". This is the *first bug*. As the document continues loading, the <meta> tag is found:

...

% curl -s http://www.lemis.com/ | grep Content <meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8"/>

Which sets the meta type to "text/xhtml". So far we have this situation: override_type = NULL meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" While setting the meta type, there is also an special rule as a workaround for Doxygen pages, which checks if the the content type of the meta tag begins with "text/xhtml" (which it does) and if so sets the override type to the guessed type: https://github.com/dillo-browser/dillo/blob/a0151cbc86166731465b963ea3addb04... So the types are left as follows: override_type = "text/plain" meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" This causes the type of the document to be handled as "text/plain". The "text/xhtml" type should be defined as "application/xhtml+xml", as the W3 describes: https://www.w3.org/TR/xhtml-media-types/#media-types Which Dillo handles fine. So, I'm thinking in transforming the "text/xhtml" to "application/xhtml+xml", better than relying on the guessed type. The types end up being: override_type = "application/xhtml+xml; charset=utf-8" meta_type = "text/xhtml; charset=utf-8" http_type = "text/html; charset=UTF-8" guessed_type = "text/plain" That solves the problem. AFAIK, the "text/xhtml" is not standardized. It was mentioned in the XHTML 1.0 draft on February 1999: https://www.w3.org/TR/1999/WD-html-in-xml-19990224/#h-5.1.3 On March they raised concern about it: https://www.w3.org/TR/1999/WD-html-in-xml-19990304/

...

There is one issue that is still mildly contentious within the working group, and that we are especially interested in receiving comments on: whether we should register a new Internet media type "text/xhtml".

Very briefly the two opinions are: yes - that is the only way to recognise the application type without accessing the resource; no - all XML applications are going to have this problem, and the answer isn't to register every single application.

And on May it got removed: https://www.w3.org/TR/1999/xhtml1-19990505/ And is not part of the XHTML 1.0 standard: https://www.w3.org/TR/xhtml1/ So I don't think is should be ever used. Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead. So I think for now we can rely on the correction of "text/xhtml" to "application/xhtml+xml", which seems to work fine. I don't like adding quirks, but I will keep this one as it was already there. Here is the PR: https://github.com/dillo-browser/dillo/pull/140 I'll check with some Doxygen pages and see it they don't break anything. Interestingly, to this day they continue to generate documents with the wrong "text/xhtml" content type (since at least 13 years, based on the git blame): https://github.com/doxygen/doxygen/blame/78422d3905e57acebf0374feefafa6578db... I'll open an issue on their repo too. Best regards, Rodrigo.

Kevin Koster

11:23 p.m.

On Tue, 23 Apr 2024 23:29:45 +0200 Rodrigo Arias <rodarima@gmail.com> wrote:

...

On Sat, Apr 20, 2024 at 02:35:10PM +0200, Rodrigo Arias wrote:

...
Yeah, the current detection mechanism in Dillo for content types is not very good. It searches for the doctype line at the beginning of the document[1] but it doesn't handle comments.

[1]:https://github.com/dillo-browser/dillo/blob/v3.1.0-rc1/src/misc. c#L148

We should rely on the Content-Type provided by the server, or at least improve the detection.

So, this is a tricky case.

Dillo has several content types for a single document sorted by priority, the first one set defines the content type of the document:

1. The "override type" used to override the type (highest priority) 2. The "meta type" given by the <meta ... content="..."> tag in HTML 3. The "http type" given by the HTTP Content-Type header 4. The "guessed type" based on the document data (lowest priority)

Thanks for the explanation, this also makes clearer an issue I had with XHTML image indexes generated by ImageMagick Montage which were getting (by an unusual sequence of events) the incorrect HTTP Content-Type type of "text/xml" (and they don't contain a meta tag). They'd load properly via file:// but show as text over http://. Now I know to ideally force the HTTP Content-Type to "application/xhtml+xml" instead of "text/html" which I used to fix the problem originally.

...

Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead.

But if it aborts that search upon encountering the first thing that isn't "spaces, newlines, tabs, and comments", most text files will be detected within the first few bytes. I'm not sure how that approach would work with ImageMagick image index XHTML pages which start like this though: <?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> Example: http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html I don't really understand how XHTML is supposed to work, and I don't have time to learn, so perhaps I'm ignoring a distinction between differet flavours of XHTML that can begin in different ways? Anyway I like how ImageMagick image map pages are viewable in Dillo at the moment.

...

So I think for now we can rely on the correction of "text/xhtml" to "application/xhtml+xml", which seems to work fine. I don't like adding quirks, but I will keep this one as it was already there. Here is the PR:

https://github.com/dillo-browser/dillo/pull/140

I've built Dillo from that branch and pages on www.lemis.com now render correctly, thanks! If I save the homepage as lemis.xhtml it still shows as plain text when loaded with file://, though it is rendered if the comments before <!DOCTYPE> are removed or if the original file is saved as lemis.html. Not much of an issue, but it could cause confusion for someone.

Kevin Koster

4:27 a.m.

On Thu, 25 Apr 2024 09:23:38 +1000 Kevin Koster <dillo@ombertech.com> wrote:

...

On Tue, 23 Apr 2024 23:29:45 +0200 Rodrigo Arias <rodarima@gmail.com> wrote:

...
Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead.

But if it aborts that search upon encountering the first thing that isn't "spaces, newlines, tabs, and comments", most text files will be detected within the first few bytes.

I'm not sure how that approach would work with ImageMagick image index XHTML pages which start like this though: <?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Example: http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html

I don't really understand how XHTML is supposed to work, and I don't have time to learn, so perhaps I'm ignoring a distinction between differet flavours of XHTML that can begin in different ways? Anyway I like how ImageMagick image map pages are viewable in Dillo at the moment.

It seems that the text/html type can be valid for XHTML documents, and the relevent RFC 2854 has a section on recognising HTML and XHTML files: 5. Recognizing HTML files Almost all HTML files have the string "<html" or "<HTML" near the beginning of the file. Documents conformant to HTML 2.0, HTML 3.2 and HTML 4.0 will start with a DOCTYPE declaration "<!DOCTYPE HTML" near the beginning, before the "<html". These dialects are case insensitive. Files may start with white space, comments (introduced by "<!--" ), or processing instructions (introduced by "<?") prior to the DOCTYPE declaration. XHTML documents (optionally) start with an XML declaration which begins with "<?xml" and are required to have a DOCTYPE declaration "<!DOCTYPE html". https://www.ietf.org/rfc/rfc2854.txt Possibly old news for others, but it clears up some of my own XHTML-ignorant confusions. For Dillo it doesn't look like it would harm performance much to add detection of comments and "<? >" on top of the existing detection of whitespace before looking for tags that indicate HTML-compatible content in misc.c. For non-(X)HTML data it will usually only mean checking for '<' as well as whitespace before it finds a byte that shouldn't be in a HTML-compatible document before the first tag. If it does find a '<' first then it will be a little more complicated to check and skip following characters. But only XML documents would normally have much of that and yet still end up displayed as plain text like they are already, so it seems like that would be a rare (and anyway minimal) performance issue.

Rodrigo Arias

7:36 p.m.

Hi, On Thu, Apr 25, 2024 at 09:23:38AM +1000, Kevin Koster wrote:

...

Thanks for the explanation, this also makes clearer an issue I had with XHTML image indexes generated by ImageMagick Montage which were getting (by an unusual sequence of events) the incorrect HTTP Content-Type type of "text/xml" (and they don't contain a meta tag). They'd load properly via file:// but show as text over http://. Now I know to ideally force the HTTP Content-Type to "application/xhtml+xml" instead of "text/html" which I used to fix the problem originally.

For Dillo, "application/xhtml+xml" and "text/html" are handled by the same HTML parser, which later identifies which version of HTML/XHTML is the document, based on the doctype. The problem is failing to set the content type to any of those two, like when using "text/xml". AFIK, the proper content type for XHTML is "application/xhtml+xml", which should be set on the HTTP Content-Type header.

...

...
Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead.

But if it aborts that search upon encountering the first thing that isn't "spaces, newlines, tabs, and comments", most text files will be detected within the first few bytes.

I'm not sure how that approach would work with ImageMagick image index XHTML pages which start like this though: <?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Example: http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html

I don't really understand how XHTML is supposed to work, and I don't have time to learn, so perhaps I'm ignoring a distinction between differet flavours of XHTML that can begin in different ways? Anyway I like how ImageMagick image map pages are viewable in Dillo at the moment.

We can improve the content detection to handle both HTML and XML-style comments, but I prefer to defer it after the 3.1.0 release. Websites shouldn't rely on the browser to guess the content type, it should be stated in the HTTP header or the meta tag. So I don't consider this a priority that should block the release for longer. If you want to work on it, feel free to do so :-)

...

...
So I think for now we can rely on the correction of "text/xhtml" to "application/xhtml+xml", which seems to work fine. I don't like adding quirks, but I will keep this one as it was already there. Here is the PR:

https://github.com/dillo-browser/dillo/pull/140

I've built Dillo from that branch and pages on www.lemis.com now render correctly, thanks! If I save the homepage as lemis.xhtml it still shows as plain text when loaded with file://, though it is rendered if the comments before <!DOCTYPE> are removed or if the original file is saved as lemis.html. Not much of an issue, but it could cause confusion for someone.

I pushed another patch that should fix this issue. It is caused primarily by the ".xhtml" extension not being recognized by the file plugin, which then tries to detect the doctype and fails in the same way, falling back to text/plain. Best, Rodrigo.

Kevin Koster

12:39 p.m.

On Thu, 25 Apr 2024 21:36:04 +0200 Rodrigo Arias <rodarima@gmail.com> wrote:

...

We can improve the content detection to handle both HTML and XML-style comments, but I prefer to defer it after the 3.1.0 release. Websites shouldn't rely on the browser to guess the content type, it should be stated in the HTTP header or the meta tag. So I don't consider this a priority that should block the release for longer.

Yes, that's reasonable.

...

If you want to work on it, feel free to do so :-)

I might have a go at it then.

...

...
I've built Dillo from that branch and pages on www.lemis.com now render correctly, thanks! If I save the homepage as lemis.xhtml it still shows as plain text when loaded with file://, though it is rendered if the comments before <!DOCTYPE> are removed or if the original file is saved as lemis.html. Not much of an issue, but it could cause confusion for someone.

I pushed another patch that should fix this issue. It is caused primarily by the ".xhtml" extension not being recognized by the file plugin, which then tries to detect the doctype and fails in the same way, falling back to text/plain.

All working well now, thanks again.

Rodrigo Arias

8:56 p.m.

Hi,

...

...
If you want to work on it, feel free to do so :-)

I might have a go at it then.

Nice :-)

...

...
I pushed another patch that should fix this issue. It is caused primarily by the ".xhtml" extension not being recognized by the file plugin, which then tries to detect the doctype and fails in the same way, falling back to text/plain.

All working well now, thanks again.

Merged in: https://github.com/dillo-browser/dillo/commit/d9c506df8528c2db84d724048f9bc5... Best, Rodrigo.

473

Age (days ago)

481

Last active (days ago)

List overview

Download

7 comments

2 participants

participants (2)

Kevin Koster
Rodrigo Arias