[Dillo-dev] Re: Dillo not rendering HTML with comments before <!DOCTYPE>

April 25, 2024

      On Thu, 25 Apr 2024 09:23:38 +1000
Kevin Koster <dillo@ombertech.com> wrote:
...
On Tue, 23 Apr 2024 23:29:45 +0200
Rodrigo Arias <rodarima@gmail.com> wrote:
...
Regarding the type guessing bug, I think I can improve it by
assuming that if we find the "<!doctype html" string in the first
1024 bytes or so, it is an HTML-like type, but it incurrs in more
overhead.
But if it aborts that search upon encountering the first thing that
isn't "spaces, newlines, tabs, and comments", most text files will be
detected within the first few bytes.
I'm not sure how that approach would work with ImageMagick image index
XHTML pages which start like this though:
  <?xml version="1.0" encoding="US-ASCII"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Example:
http://www.ombertech.com/cnk/dillo/STS-133_Pictures/photo_index.html
I don't really understand how XHTML is supposed to work, and I don't
have time to learn, so perhaps I'm ignoring a distinction between
differet flavours of XHTML that can begin in different ways? Anyway I
like how ImageMagick image map pages are viewable in Dillo at the
moment.
It seems that the text/html type can be valid for XHTML documents, and
the relevent RFC 2854 has a section on recognising HTML and XHTML files:

5. Recognizing HTML files

   Almost all HTML files have the string "<html" or "<HTML" near the
   beginning of the file.

Documents conformant to HTML 2.0, HTML 3.2 and HTML 4.0 will start
   with a DOCTYPE declaration "<!DOCTYPE HTML" near the beginning,
   before the "<html". These dialects are case insensitive.  Files may
   start with white space, comments (introduced by "<!--" ), or
   processing instructions (introduced by "<?") prior to the DOCTYPE
   declaration.

   XHTML documents (optionally) start with an XML declaration which
   begins with "<?xml" and are required to have a DOCTYPE declaration
   "<!DOCTYPE html".
https://www.ietf.org/rfc/rfc2854.txt

Possibly old news for others, but it clears up some of my own
XHTML-ignorant confusions. For Dillo it doesn't look like it would harm
performance much to add detection of comments and "<? >" on top of the
existing detection of whitespace before looking for tags that indicate
HTML-compatible content in misc.c. For non-(X)HTML data it will usually
only mean checking for '<' as well as whitespace before it finds a byte
that shouldn't be in a HTML-compatible document before the first tag.
If it does find a '<' first then it will be a little more complicated
to check and skip following characters. But only XML documents would
normally have much of that and yet still end up displayed as plain
text like they are already, so it seems like that would be a rare (and
anyway minimal) performance issue.