On Thu, 25 Apr 2024 09:23:38 +1000 Kevin Koster <dillo@ombertech.com> wrote:
On Tue, 23 Apr 2024 23:29:45 +0200 Rodrigo Arias <rodarima@gmail.com> wrote:
Regarding the type guessing bug, I think I can improve it by assuming that if we find the "<!doctype html" string in the first 1024 bytes or so, it is an HTML-like type, but it incurrs in more overhead.
But if it aborts that search upon encountering the first thing that isn't "spaces, newlines, tabs, and comments", most text files will be detected within the first few bytes.
I'm not sure how that approach would work with ImageMagick image index XHTML pages which start like this though: <?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Example: http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html
I don't really understand how XHTML is supposed to work, and I don't have time to learn, so perhaps I'm ignoring a distinction between differet flavours of XHTML that can begin in different ways? Anyway I like how ImageMagick image map pages are viewable in Dillo at the moment.
It seems that the text/html type can be valid for XHTML documents, and the relevent RFC 2854 has a section on recognising HTML and XHTML files: 5. Recognizing HTML files Almost all HTML files have the string "<html" or "<HTML" near the beginning of the file. Documents conformant to HTML 2.0, HTML 3.2 and HTML 4.0 will start with a DOCTYPE declaration "<!DOCTYPE HTML" near the beginning, before the "<html". These dialects are case insensitive. Files may start with white space, comments (introduced by "<!--" ), or processing instructions (introduced by "<?") prior to the DOCTYPE declaration. XHTML documents (optionally) start with an XML declaration which begins with "<?xml" and are required to have a DOCTYPE declaration "<!DOCTYPE html". https://www.ietf.org/rfc/rfc2854.txt Possibly old news for others, but it clears up some of my own XHTML-ignorant confusions. For Dillo it doesn't look like it would harm performance much to add detection of comments and "<? >" on top of the existing detection of whitespace before looking for tags that indicate HTML-compatible content in misc.c. For non-(X)HTML data it will usually only mean checking for '<' as well as whitespace before it finds a byte that shouldn't be in a HTML-compatible document before the first tag. If it does find a '<' first then it will be a little more complicated to check and skip following characters. But only XML documents would normally have much of that and yet still end up displayed as plain text like they are already, so it seems like that would be a rare (and anyway minimal) performance issue.