[Dillo-dev] Re: Issues with HTTP multipart/form-data file upload

Aug. 30, 2024

      Hi Rodrigo,
...
Dillo has a mechanism to read chunks of data from different sources as they are arriving and pass them to the next stage for processing. However, AFAIK it always reads a chunk and appends it to a large buffer. It doesn't free the processed part until is done with the whole thing.
This would require a change in the way Dillo processes data, but I think it would be required for large files. There are more details in the
devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer look.
As I'm planning to change the design of the CCC, I think I can take this into account too so it would be doable. I'll add it to the list of shortcomings of the current design.
Thank you. I am still unfamiliar with that part of Dillo, so please let
me know about any progress.
...
Okay, I'll focus on the boundary patch first, which is the easiest to merge and then I'll take a closer look at the others.
Yeah, I would assume a lot of implementations are broken, so we want to try to minimize the chances we run into problems.
Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of
the 75 possible characters, so roughly 82% of the set. I think that
removing the quoting in favour of the limited set reduce the risk for
broken implementations, yet still provide a good amount of randomness.
...
Check sizeof " ": https://godbolt.org/z/7Tso8ooYz
Interestingly, the " " character on your last email is not really a
<space> (<U0020>):

$ printf "%s" " " | hd
00000000  e2 80 88                                          |...|
00000003

Compared to an ASCII whitespace:

$ printf "%s" " " | hd
00000000  20                                                | |
00000001

Both Godbolt and my editor also flag that multi-byte character with a
yellow rectangle around it because it would be highly confusing
otherwise. For example:

printf("len=%zu\n", strlen(" "));

Confusingly returns "len=3".

I am not sure whether this was an intentional modification from your
side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so
that sizeof " " would always return 2. Was it your intention to flag
this potential confusion?

Also, there was not strict reason to use sizeof " ". Any other character
would do e.g.: sizeof "x", sizeof "A", etc.
...
You can also use dStr_append_c() to only append one character, so you only need a single character.
That would be an unnecessary use of the heap, because the size is static.
...
If we only use alphanumeric characters, we can just use isalnum() right?
According to POSIX.1-2017 [2], isalnum(3) depends on the current locale
configured by the system. For example, characters such as Ä or ú could
return non-zero. To avoid this, there are two possible solutions:

1. Use isalnum_l(3) to specify a locale_t object corresponding to the
"POSIX" locale (equivalent to "C" [3]), which must be previously
allocated by the newlocale(3) function [3] and released by the
freelocal(3) function [4]. A minimalist example is shown below:

        locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);

        for (unsigned char i = 0; i < 255; i++)
                printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i,
isalnum_l(i, l));

        freelocale(l);

2. Define a known subset from the portable character set defined by
POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch.
IMHO this approach is better because:
	- It does not deal with locales, so developers not familiar with them
would understand the code better.
	- It is also portable outside a POSIX environment (not sure if this a
requirement, though).
	- It does not require dynamic allication via newlocale(3).
	- It is the only possible option if non-alnum characters, such as ':'
or '/', are appended to the boundary string.

[1]:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/isalnum.html
[3]:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/newlocale.html
[4]:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/freelocale.html
[5]:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
...
I meant when is the next KoVoꓘ concert :-)
No gigs ahead, but I will keep you informed. :)

Best regards,

Xavi

On 28/8/24 22:47, Rodrigo Arias wrote:
...
Hi Xavier,
On Wed, Aug 28, 2024 at 01:04:04AM +0200, Xavier Del Campo Romero wrote:
...
Hi Rodrigo,
...
Glad to read that you also consider Dillo for slcl, and thanks for
preparing the patches :-)
Thank you! I want slcl to be useful to anyone, including users who care
about minimalist software like Dillo. The web is already too crowded
with bloated "webapps" and other terrible things. :)
Agreed!
...
...
Sounds good, not sure how complicated it would be to do this.
I still need to investigate this further, but I assume this would
require Dillo to at least implement a sink callback.
In other words, the component responsible for transmitting the data
(probably src/IO/IO.c) should trigger a user-defined callback with an
arbitrarily-sized buffer (typically, of BUFSIZ bytes, as defined by
stdio.h) that must filled with file data. Then, the user-defined
callback can fill from zero up to BUFSIZ bytes, which are eventually
trasmitted to the server.
Dillo has a mechanism to read chunks of data from different sources as
they are arriving and pass them to the next stage for processing.
However, AFAIK it always reads a chunk and appends it to a large buffer.
It doesn't free the processed part until is done with the whole thing.
This would require a change in the way Dillo processes data, but I think
it would be required for large files. There are more details in the
devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer
look.
As I'm planning to change the design of the CCC, I think I can take this
into account too so it would be doable. I'll add it to the list of
shortcomings of the current design.
...
That said, I am still not sure how much actual effort this would take.
But I am glad to receive positive feedback so far - I will then continue
to find a solution.
...
However, being able to upload multiple files at the same time sounds
reasonable, so feel free to try on your own in the meanwhile.
Uploading multiple files at once seems doable - the patches I sent on my
previous email are probably already doing most of the required work.
Again, the trickiest task is to send data on-the-fly for each selected
file.
Okay, I'll focus on the boundary patch first, which is the easiest to
merge and then I'll take a closer look at the others.
...
...
Shouldn't it be 68 then?
I understand the opposite: the boundary string with the two leading
dashes ("--") included can be up to 72 bytes long, and 74 bytes long for
the ending boundary (which includes two more dashes after the boundary
string). This is confirmed by reading the BNF defined by RFC 2046 (some
bits omitted for simplicity), section 5.1.1 [1]:
...
boundary := 0*69<bchars> bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
                      "+" / "_" / "," / "-" / "." /
                      "/" / ":" / "=" / "?"
dash-boundary := "--" boundary
                      ; boundary taken from the value of
                      ; boundary parameter of the
                      ; Content-Type field> multipart-body :=
[preamble CRLF]
                       dash-boundary transport-padding CRLF
                       body-part *encapsulation
                       close-delimiter transport-padding
                       [CRLF epilogue]
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
Oh right! I see that we are already using 70 characters anyway.
...
Note: even if the specification tells receivers to handle transport
padding, for the time being I am assuming "transport-padding" as zero
length since composers must not generate non-zero length transport
padding. I am still not sure where transport padding would apply,
anyway. Probably outside web browsers?
...
I would leave out all the symbols to avoid quoting and only use A-Z
a-z and 0-9.
Interestingly, Dillo would always quote boundary strings [2], even if
only using A-Z, a-z and 0-9. In fact, this is one of the wrong
assumptions I spotted when testing slcl against Dillo.
Yeah, I would assume a lot of implementations are broken, so we want to
try to minimize the chances we run into problems.
Apart from slcl we should also test this with some sites and see if they
continue to work okay.
This will also increase the fingerprinting information to distinguish
Dillo among other browsers, but I think it is not more information that
the already leaked by the user agent.
...
...
Which, if I computed it correctly, is still too small to worry about.
Not only it is too small of a chance: if we really wanted to do "the
right thing" and make Dillo absolutely sure the boundary string is not
contained within the selected files, this would imply a noticeable
performance impact when dealing with large files, much likely for a
near-zero benefit.
I have not inspected their source code yet (and I do not want to), but I
understand both Gecko and Chromium are also making that assumption,
because otherwise it would take them a lot of CPU time to upload large
files.
But then they would be doing such assumption with a "much larger"
probability it hits the file.
Skipping it with 70 characters is safe for one file, but also probably
safe for all files ever uploaded with Dillo.
Maybe curl or other small codebases are easier to read, but not really
needed.
...
...
Why sizeof " " instead of just 2?
Because, to my eyes, sizeof " " has more meaningful semantics, compared
to a magic integer constant such as 2. However, for this simple
scenario, I would still consider both acceptable.
Check sizeof " ": https://godbolt.org/z/7Tso8ooYz
You can also use dStr_append_c() to only append one character, so you
only need a single character.
If we only use alphanumeric characters, we can just use isalnum() right?
...
I can replace it with 2 if you find the other construct unacceptable.
...
PS: When are you playing?
Sorry, I did not understand your last sentence. Could you please give a
bit more context? :)
I meant when is the next KoVoꓘ concert :-)
Best,
Rodrigo.
_______________________________________________
Dillo-dev mailing list -- dillo-dev@mailman3.com
To unsubscribe send an email to dillo-dev-leave@mailman3.com