[Dillo-dev] Re: Option to define external link handler

June 18, 2024

      Hi Alex,

On Tue, Jun 18, 2024 at 03:28:20PM +0200, a1ex@dismail.de wrote:
...
...
This also allows patching the HTML of sites so you can fix them to
work better (or at all) in Dillo. This is also done by Firefox from
the webcompat[2] project in what they call "interventions", as
sometimes page authors don't fix them or take a long time, so they
patch it from the browser directly. You can open about:compat to see
the long list of patches, here[3] is one for YouTube.
This would be very impressive, a real step forward for Dillo in my
opinion. Do you think this is something that would be relatively
straight-forward to implement, or is it more of a long-term goal with
lots of work required to get there? Either way, sounds like there are
exciting times ahead for Dillo!
Adding a mechanism to rewrite the HTML is surprisingly not super
complicated, as the internal design of Dillo is centered around the CCC,
the "Concomitant Control Chain", which is basically a chain of
bi-directional pipes connected together to pass data around.

Here is how Dillo currently receives data from the a TLS server (AFAIK).
I'm only drawing the incoming direction, but the outgoing link is
similar.

  Net  +--------+    +-------+    +------+    +-------+
  ---->| TLS IO |--->|  IO   |--->| HTTP |--->| CACHE |-...
       +--------+    +-------+    +------+    +-------+
       src/tls.c     src/IO.c     src/http.c  src/capi.c

And adding a new rewrite module (named SED in the diagram) would
require rerouting the chain to add a new element (not hard):

  Net  +--------+    +-------+    +------+    +=====+    +-------+
  ---->| TLS IO |--->|  IO   |--->| HTTP |---># SED #--->| CACHE |-...
       +--------+    +-------+    +------+    +=====+    +-------+
       src/tls.c     src/IO.c     src/http.c     |       src/capi.c
                                                 |
                                            +---------+
                                            | rulesrc |
                                            | ...     |
                                            +---------+

The module can then forward the content parsed from the HTTP module to
the appropriate scripts defined in the rules, and then read the output
and forward it to the next steps in the chain. When no rules apply, it
can just forward the content to the cache as-is.

Now, the interesting part is that we can place another SED module
between the IO and the HTTP nodes, so we can rewrite the HTML content
*and* the HTTP headers too. This would allow for example writing a
plugin that matches a given mime type and on-the-fly rewrites it into an
HTML file changing the Content-Type header.

This is already done by the plugins, but they mix the two things
together. For example we can display a .gmi file served via the
"gemini:" protocol, but we cannot display a local .gmi file. Same for
manual pages with the "man:" protocol, which cannot open manual pages
served via "file:" or "http(s):".

The solution with the new design would involve:

1) Open a "gemini:" link
2) The request is routed to the gemini: dpi handler (like now)
3) The gemini plugin returns the .gmi file as-is as an HTTP response,
instead of converting it to HTML
4) The .gmi mime type matches a rewrite rule and is rewritten into HTML
in the SED node.

Now, if we open a .gmi via HTTP:

1) Open a "https:" link
2) The request is routed to the usual HTTP/IO/TLS chain
3) The HTTP server returns the .gmi file as-is as an HTTP response.
4) The .gmi mime type matches a rewrite rule, and is rewritten into HTML
in the SED node.

Notice that the HTTP content can be compressed. So, for example, this
simple rewrite script:

   #!/bin/sh
   sed 's_www.youtube.com_inv.vern.cc_g'

Would only work well in the SED node *after* the HTTP content is
uncompressed and the headers removed. The rewrite rules should indicate
in which position of the chain they apply.

As a side note, keep in mind that all of these pieces work in stream
mode. Each node reads a bit of data, process it and sends it to the next
node of the chain, without the need to store the whole thing in memory.
Same with that sed command I wrote as an example.

Best,
Rodrigo.

[Dillo-dev] Re: Option to define external link handler

Rodrigo Arias