[Dillo-dev] Re: Option to define external link handler

June 19, 2024

      Rodrigo Arias <rodarima-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
...
On Wed, Jun 19, 2024 at 10:03:20AM +1000, Kevin Koster wrote:
...
This mechanism might suit an idea I've had to do remote downscaling
of extremely large images, which are increasingly being included in
web pages. The script would send a list of URLs in all <img> tags
within the HTML to a remote server (eg. on a VPS), or ideally just
the ones for large image files, then rewrite the URLs in the HTML
to point to the remote server where the converted images are
available over HTTP/S.
You can create a script that rewrites the <img> src attribute
<img src="https://foo.com/img1.png">
To point to an endpoint of your server:
<img src="https://yourserver.com/downscale?url=https://foo.com/img1.png">
Yes that would work too, however the idea of sending a URL list
was that it allows my server to re-use the HTTPS connection to
the image host server while downloading lots of images for a
webpage, whereas the way you have it each image request by my server
would be a new HTTPS connection, and therefore potentially slower.
It's a minor difference though.
...
And then in the server you simply downscale it. Here is how you could do
it with rules:
# Script that would rewrite images to a server for downscaling
  action downscale filter 'rewrite-img.sh'
  define mime header 'Content-Type'
  match mime 'text/html' action downscale
That should work. I would personally find it clearer to read if
there was a character indicating assignment values, since this
would make it obvious which words are commands and which are
arguments. Such as:

  # Script that would rewrite images to a server for downscaling
  action downscale filter='rewrite-img.sh'
  define mime header='Content-Type'
  match mime='text/html' action=downscale

Using "action" as an argument to "match" as well as the name of a
command might still be a little confusing.
...
...
Or a deeper approach would be to apply the same approach as this
rewrite engine to binary content as well, and have Dillo do it
transparently via 'rewrite'/convert rules for image MIME types.
Then the HTML would stay the same and Dillo would trigger a command
that requested a downscaled image from the converter server instead
of the original image's server. That would be more elegant, but
expands the scope of your proposed system a little.
Rewriting the binary image directly would be possible, but then you
would have wasted the bandwidth bringing it to Dillo, and now you have
to send it to the server to downscale it.
No the idea is to reduce the bandwidth usage on Dillo's connection,
so for this approach Dillo would have to abort the connection to the
image server if the image size was over the limit and fetch it from
the script instead, which might do:
 wget -q -O - "https://yourserver.com/downscale?url=https://foo.com/img1.png"

A way to get Dillo to take the replacement URL from the script
would be better, but I suspect that would make this system more
complicated to implement because then state matters between
different connections.

Granted I've lost the ability to re-use the HTTPS connection at my
server with this approach. I like how it wouldn't affect the
performance of fetching small images by waiting for them to be
needlessly downloaded by my server first though, which would
generally be a bigger advantage overall.

Most images are still small enough that I wouldn't down-scale them,
just the 1MB+ ones would get that treatment. The surprise 10MB+
ones are the real evil, and just the option to block these outright
(abort the connection and don't run any script) would be better
than nothing.

To complicate things further, it would be good to have a
right-click menu option to bypass this rule and allow fetching the
full-size image. Alternatively an external handler could be
assigned (via the earlier-discussed mechanism) to a script that
downloads and opens the image URL in an external image viewer,
which might actually be better to use for that.
...
In any case, imagine you want to downscale it locally anyway. Here is
how I can think about it:
# Script that would downscale an image and write to stdout
  action downscale filter 'downscale-img.sh'
# Define headers from the HTTP content with shorter names
  define mime header 'Content-Type'
  define size header 'Content-Length'
# Downscale big images
  match mime =~ 'image/.*' and size > 10K action downscale
Notice that this can be triggered for any image, not only ones provided
via HTTP/HTTPS, but also via other protocols like gemini that are
adapted to speak HTTP and also provide a Content-Length header.
I added the =~ and > operators, as the former would match a regex and the
latter will use a numeric comparator. You can assume that the default if
the header is not present is to make any comparison fail.
I have also added the "define" keyword to define properties like "mime" or
"size" which are parsed from the HTTP headers and are shorter and easier
to write.
That looks good. Maybe I'd find it clearer without the whitespace
between operators as with my '=' example above.
...
...
Maybe since it still requires a remote Web server this problem
would be better solved via a Web proxy (I did look into Squid
before, but drowned in confusing documentation). But I just thought
I'd mention it as an example of a more complex usage for this
proposed rewrite system.
But then you will need to pass all the traffic through the server so it 
performs the substitution there.
Another solution which may be better is to mark from Dillo which
requests are being done from img elements (filtering them before going
to the network).
If Dillo marks those requests in the HTTP headers for example, then you
could do:
# Script that transforms image HTTP requests to a server that
  # downscales the image
  action downscale filter 'downscale-req.sh'
define source header 'Dillo-Request-Source'
# Downscale images comming from <img> elements
  match source 'img' action downscale
Yes, that's neat.
...
This would have the benefit that Dillo already performs the parsing of
the HTML for you, and only the images that are loaded are passed to the
downscaling server. Additionally, cookies would be sent in the HTTP
request, so you can access login protected images this way too.
Good point, although for my usage login protected images wouldn't
be much of a concern.