Hi, On Thu, Apr 25, 2024 at 09:23:38AM +1000, Kevin Koster wrote:
Thanks for the explanation, this also makes clearer an issue I had with XHTML image indexes generated by ImageMagick Montage which were getting (by an unusual sequence of events) the incorrect HTTP Content-Type type of "text/xml" (and they don't contain a meta tag). They'd load properly via file:// but show as text over http://. Now I know to ideally force the HTTP Content-Type to "application/xhtml+xml" instead of "text/html" which I used to fix the problem originally.
For Dillo, "application/xhtml+xml" and "text/html" are handled by the same HTML parser, which later identifies which version of HTML/XHTML is the document, based on the doctype. The problem is failing to set the content type to any of those two, like when using "text/xml". AFIK, the proper content type for XHTML is "application/xhtml+xml", which should be set on the HTTP Content-Type header.
Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead.
But if it aborts that search upon encountering the first thing that isn't "spaces, newlines, tabs, and comments", most text files will be detected within the first few bytes.
I'm not sure how that approach would work with ImageMagick image index XHTML pages which start like this though: <?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Example: http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html
I don't really understand how XHTML is supposed to work, and I don't have time to learn, so perhaps I'm ignoring a distinction between differet flavours of XHTML that can begin in different ways? Anyway I like how ImageMagick image map pages are viewable in Dillo at the moment.
We can improve the content detection to handle both HTML and XML-style comments, but I prefer to defer it after the 3.1.0 release. Websites shouldn't rely on the browser to guess the content type, it should be stated in the HTTP header or the meta tag. So I don't consider this a priority that should block the release for longer. If you want to work on it, feel free to do so :-)
So I think for now we can rely on the correction of "text/xhtml" to "application/xhtml+xml", which seems to work fine. I don't like adding quirks, but I will keep this one as it was already there. Here is the PR:
I've built Dillo from that branch and pages on www.lemis.com now render correctly, thanks! If I save the homepage as lemis.xhtml it still shows as plain text when loaded with file://, though it is rendered if the comments before <!DOCTYPE> are removed or if the original file is saved as lemis.html. Not much of an issue, but it could cause confusion for someone.
I pushed another patch that should fix this issue. It is caused primarily by the ".xhtml" extension not being recognized by the file plugin, which then tries to detect the doctype and fails in the same way, falling back to text/plain. Best, Rodrigo.