Issues with HTTP multipart/form-data file upload
Hello Dillo dev community, I am testing support among web browsers for slcl [1], a JS-less minimalist web storage solution. slcl relies on "multipart/form-data"-encoded HTTP requests to upload files to a server. Whereas Dillo helped me to uncover a few wrong assumptions on my code, I have realised a few issues on Dillo itself related to filue uploads that should be considered. Issue #1: While Dillo is fine uploading small files (up to a few MiB), things go wrong with larger files, so much that memory usage increases up to multiple GiB and can even lock the system up. This is because Dillo is designed to send requests always from memory, which means file contents must be dumped into memory first, and this might be unfeasible for large files. Suggestion: Instead, Dillo should ideally send file contents on-the-fly, so that memory usage is kept to a minimum regardless the file size. Issue #2: Even if Dillo generates a 70-byte, random boundary string (yet mostly filled with '-', similarly to Firefox [2]), it ensures it is not found anywhere inside the file contents. Again, this can be a serious bottleneck in the case of large files, as it requires to scan the whole file for a match. Suggestion: Define *all* of the 70 bytes in the boundary string as random, and assume they would never be found inside a file. The chance of accidental collision is so low that it is not worth the effort into checking them. Suggested patches: - 0001-dialog.cc-Generate-more-random-boundaries.patch) Issue #3: Dillo only supports uploading 1 file at a time. This is mostly because it relies (probably temporarily?) on the a_Dialog_save_file function [3]. However, this is not a limitation on the HTTP protocol, and other implementation such as Firefox or Chromium-based browsers support this. Suggested patches: - 0002-dialog-Add-a_Dialog_select_files.patch - 0003-WIP-multi-file-uploads.patch Conclusions: Dillo seems designed to always send requests from memory, so it is not straightforward to break this assumption in order to support large file uploads. 0003-WIP-multi-file-uploads.patch is an incomplete first step into fixing this, but it surely needs deeper design changes. I did not put more effort into these patches for the time being because, after seeing the potential complexity behind this task, I thought it was a better idea to ask the community for feedback and guidelines. Thank you very much for reading. Best regards, Xavier Del Campo Romero [1]: https://gitea.privatedns.org/xavi/slcl [2]: https://github.com/dillo-browser/dillo/blob/8a360e32ac3136494a494379a6dbbace... [3]: https://github.com/dillo-browser/dillo/blob/8a360e32ac3136494a494379a6dbbace...
Hi Xavier, On Mon, Aug 26, 2024 at 01:27:29AM +0200, Xavier Del Campo Romero wrote:
Hello Dillo dev community,
I am testing support among web browsers for slcl [1], a JS-less minimalist web storage solution. slcl relies on "multipart/form-data"-encoded HTTP requests to upload files to a server. Whereas Dillo helped me to uncover a few wrong assumptions on my code, I have realised a few issues on Dillo itself related to filue uploads that should be considered.
Glad to read that you also consider Dillo for slcl, and thanks for preparing the patches :-)
Issue #1:
While Dillo is fine uploading small files (up to a few MiB), things go wrong with larger files, so much that memory usage increases up to multiple GiB and can even lock the system up. This is because Dillo is designed to send requests always from memory, which means file contents must be dumped into memory first, and this might be unfeasible for large files.
Suggestion:
Instead, Dillo should ideally send file contents on-the-fly, so that memory usage is kept to a minimum regardless the file size.
Sounds good, not sure how complicated it would be to do this.
Issue #2:
Even if Dillo generates a 70-byte, random boundary string (yet mostly filled with '-', similarly to Firefox [2]), it ensures it is not found anywhere inside the file contents. Again, this can be a serious bottleneck in the case of large files, as it requires to scan the whole file for a match.
Suggestion:
Define *all* of the 70 bytes in the boundary string as random, and assume they would never be found inside a file. The chance of accidental collision is so low that it is not worth the effort into checking them.
Suggested patches:
- 0001-dialog.cc-Generate-more-random-boundaries.patch)
Yes, I think is a good idea. @@ -1246,6 +1246,24 @@ Dstr *DilloHtmlForm::buildQueryData(DilloHtmlInput *active_submit) return DataStr; } +static void generate_boundary(Dstr *boundary) +{ + for (int i = 0; i < 70; i++) { I think this is too long:
Boundary delimiters must not appear within the encapsulated material, and must be no longer than 70 characters, not counting the two leading hyphens.
Shouldn't it be 68 then? + /* Extracted from RFC 2046, section 5.1.1. */ + static const char set[] = "abcdefghijklmnopqrstuvwxyz" + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "0123456789" + "'()+_,-./:=? "; I would leave out all the symbols to avoid quoting and only use A-Z a-z and 0-9. Assuming you are left with 62 symbols (26*2 + 10), the probability of finding a random 70 character string in a random sample of 70 characters is (using ** as ^): p = 1/62**70 = 3.4086e-126 But as you will be dealing with large files, each line of the input could match it. We are interested in the probability that a given boundary will mach any of the N lines. So we compute the probability that it will match none of the lines (1-p)**N and then the opposite of that (matches at least one line): 1 - (1 - p) ** N For example, a 1 TB encoded file, with 2**40/70 lines will have: p = 1 - (1 - 1/62**70) ** (2**40 / 70) Of occurring, which is around 5.354e-116. Which, if I computed it correctly, is still too small to worry about. + char s[sizeof " "] = {0}; Why sizeof " " instead of just 2? + + do { + *s = rand(); + } while (!strspn(s, set)); + + dStr_append(boundary, s); + } +} + /** * Generate a boundary string for use in separating the parts of a * multipart/form-data submission.
Issue #3:
Dillo only supports uploading 1 file at a time. This is mostly because it relies (probably temporarily?) on the a_Dialog_save_file function [3]. However, this is not a limitation on the HTTP protocol, and other implementation such as Firefox or Chromium-based browsers support this.
Suggested patches:
- 0002-dialog-Add-a_Dialog_select_files.patch - 0003-WIP-multi-file-uploads.patch
Conclusions:
Dillo seems designed to always send requests from memory, so it is not straightforward to break this assumption in order to support large file uploads. 0003-WIP-multi-file-uploads.patch is an incomplete first step into fixing this, but it surely needs deeper design changes.
I did not put more effort into these patches for the time being because, after seeing the potential complexity behind this task, I thought it was a better idea to ask the community for feedback and guidelines.
I'm not very familiar with this part of Dillo, so I'll have to check a bit more before providing more feedback. However, being able to upload multiple files at the same time sounds reasonable, so feel free to try on your own in the meanwhile. PS: When are you playing? Best, Rodrigo.
Hi Rodrigo,
Glad to read that you also consider Dillo for slcl, and thanks for preparing the patches :-)
Thank you! I want slcl to be useful to anyone, including users who care about minimalist software like Dillo. The web is already too crowded with bloated "webapps" and other terrible things. :)
Sounds good, not sure how complicated it would be to do this.
I still need to investigate this further, but I assume this would require Dillo to at least implement a sink callback. In other words, the component responsible for transmitting the data (probably src/IO/IO.c) should trigger a user-defined callback with an arbitrarily-sized buffer (typically, of BUFSIZ bytes, as defined by stdio.h) that must filled with file data. Then, the user-defined callback can fill from zero up to BUFSIZ bytes, which are eventually trasmitted to the server. That said, I am still not sure how much actual effort this would take. But I am glad to receive positive feedback so far - I will then continue to find a solution.
However, being able to upload multiple files at the same time sounds reasonable, so feel free to try on your own in the meanwhile.
Uploading multiple files at once seems doable - the patches I sent on my previous email are probably already doing most of the required work. Again, the trickiest task is to send data on-the-fly for each selected file.
Shouldn't it be 68 then?
boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" / "_" / "," / "-" / "." / "/" / ":" / "=" / "?" dash-boundary := "--" boundary ; boundary taken from the value of ; boundary parameter of the ; Content-Type field> multipart-body := [preamble CRLF] dash-boundary transport-padding CRLF body-part *encapsulation close-delimiter transport-padding [CRLF epilogue] delimiter := CRLF dash-boundary close-delimiter := delimiter "--" Note: even if the specification tells receivers to handle transport
I understand the opposite: the boundary string with the two leading dashes ("--") included can be up to 72 bytes long, and 74 bytes long for the ending boundary (which includes two more dashes after the boundary string). This is confirmed by reading the BNF defined by RFC 2046 (some bits omitted for simplicity), section 5.1.1 [1]: padding, for the time being I am assuming "transport-padding" as zero length since composers must not generate non-zero length transport padding. I am still not sure where transport padding would apply, anyway. Probably outside web browsers?
I would leave out all the symbols to avoid quoting and only use A-Z a-z and 0-9.
Interestingly, Dillo would always quote boundary strings [2], even if only using A-Z, a-z and 0-9. In fact, this is one of the wrong assumptions I spotted when testing slcl against Dillo.
Which, if I computed it correctly, is still too small to worry about.
Not only it is too small of a chance: if we really wanted to do "the right thing" and make Dillo absolutely sure the boundary string is not contained within the selected files, this would imply a noticeable performance impact when dealing with large files, much likely for a near-zero benefit. I have not inspected their source code yet (and I do not want to), but I understand both Gecko and Chromium are also making that assumption, because otherwise it would take them a lot of CPU time to upload large files.
Why sizeof " " instead of just 2?
Because, to my eyes, sizeof " " has more meaningful semantics, compared to a magic integer constant such as 2. However, for this simple scenario, I would still consider both acceptable. I can replace it with 2 if you find the other construct unacceptable.
PS: When are you playing?
Sorry, I did not understand your last sentence. Could you please give a bit more context? :) Best regards, Xavi [1]: https://www.rfc-editor.org/rfc/rfc2046.html#section-5.1.1 [2]: https://github.com/dillo-browser/dillo/blob/8a360e32ac3136494a494379a6dbbace... On 27/8/24 22:59, Rodrigo Arias wrote:
Hi Xavier,
On Mon, Aug 26, 2024 at 01:27:29AM +0200, Xavier Del Campo Romero wrote:
Hello Dillo dev community,
I am testing support among web browsers for slcl [1], a JS-less minimalist web storage solution. slcl relies on "multipart/form-data"-encoded HTTP requests to upload files to a server. Whereas Dillo helped me to uncover a few wrong assumptions on my code, I have realised a few issues on Dillo itself related to filue uploads that should be considered.
Glad to read that you also consider Dillo for slcl, and thanks for preparing the patches :-)
Issue #1:
While Dillo is fine uploading small files (up to a few MiB), things go wrong with larger files, so much that memory usage increases up to multiple GiB and can even lock the system up. This is because Dillo is designed to send requests always from memory, which means file contents must be dumped into memory first, and this might be unfeasible for large files.
Suggestion:
Instead, Dillo should ideally send file contents on-the-fly, so that memory usage is kept to a minimum regardless the file size.
Sounds good, not sure how complicated it would be to do this.
Issue #2:
Even if Dillo generates a 70-byte, random boundary string (yet mostly filled with '-', similarly to Firefox [2]), it ensures it is not found anywhere inside the file contents. Again, this can be a serious bottleneck in the case of large files, as it requires to scan the whole file for a match.
Suggestion:
Define *all* of the 70 bytes in the boundary string as random, and assume they would never be found inside a file. The chance of accidental collision is so low that it is not worth the effort into checking them.
Suggested patches:
- 0001-dialog.cc-Generate-more-random-boundaries.patch)
Yes, I think is a good idea.
@@ -1246,6 +1246,24 @@ Dstr *DilloHtmlForm::buildQueryData(DilloHtmlInput *active_submit) return DataStr; }
+static void generate_boundary(Dstr *boundary) +{ + for (int i = 0; i < 70; i++) {
I think this is too long:
Boundary delimiters must not appear within the encapsulated material, and must be no longer than 70 characters, not counting the two leading hyphens.
Shouldn't it be 68 then?
+ /* Extracted from RFC 2046, section 5.1.1. */ + static const char set[] = "abcdefghijklmnopqrstuvwxyz" + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "0123456789" + "'()+_,-./:=? ";
I would leave out all the symbols to avoid quoting and only use A-Z a-z and 0-9.
Assuming you are left with 62 symbols (26*2 + 10), the probability of finding a random 70 character string in a random sample of 70 characters is (using ** as ^):
p = 1/62**70 = 3.4086e-126
But as you will be dealing with large files, each line of the input could match it. We are interested in the probability that a given boundary will mach any of the N lines. So we compute the probability that it will match none of the lines (1-p)**N and then the opposite of that (matches at least one line):
1 - (1 - p) ** N
For example, a 1 TB encoded file, with 2**40/70 lines will have:
p = 1 - (1 - 1/62**70) ** (2**40 / 70)
Of occurring, which is around 5.354e-116.
Which, if I computed it correctly, is still too small to worry about.
+ char s[sizeof " "] = {0};
Why sizeof " " instead of just 2?
+ + do { + *s = rand(); + } while (!strspn(s, set)); + + dStr_append(boundary, s); + } +} + /** * Generate a boundary string for use in separating the parts of a * multipart/form-data submission.
Issue #3:
Dillo only supports uploading 1 file at a time. This is mostly because it relies (probably temporarily?) on the a_Dialog_save_file function [3]. However, this is not a limitation on the HTTP protocol, and other implementation such as Firefox or Chromium-based browsers support this.
Suggested patches:
- 0002-dialog-Add-a_Dialog_select_files.patch - 0003-WIP-multi-file-uploads.patch
Conclusions:
Dillo seems designed to always send requests from memory, so it is not straightforward to break this assumption in order to support large file uploads. 0003-WIP-multi-file-uploads.patch is an incomplete first step into fixing this, but it surely needs deeper design changes.
I did not put more effort into these patches for the time being because, after seeing the potential complexity behind this task, I thought it was a better idea to ask the community for feedback and guidelines.
I'm not very familiar with this part of Dillo, so I'll have to check a bit more before providing more feedback.
However, being able to upload multiple files at the same time sounds reasonable, so feel free to try on your own in the meanwhile.
PS: When are you playing?
Best, Rodrigo. _______________________________________________ Dillo-dev mailing list -- dillo-dev@mailman3.com To unsubscribe send an email to dillo-dev-leave@mailman3.com
Hi Xavier, On Wed, Aug 28, 2024 at 01:04:04AM +0200, Xavier Del Campo Romero wrote:
Hi Rodrigo,
Glad to read that you also consider Dillo for slcl, and thanks for preparing the patches :-)
Thank you! I want slcl to be useful to anyone, including users who care about minimalist software like Dillo. The web is already too crowded with bloated "webapps" and other terrible things. :)
Agreed!
Sounds good, not sure how complicated it would be to do this.
I still need to investigate this further, but I assume this would require Dillo to at least implement a sink callback.
In other words, the component responsible for transmitting the data (probably src/IO/IO.c) should trigger a user-defined callback with an arbitrarily-sized buffer (typically, of BUFSIZ bytes, as defined by stdio.h) that must filled with file data. Then, the user-defined callback can fill from zero up to BUFSIZ bytes, which are eventually trasmitted to the server.
Dillo has a mechanism to read chunks of data from different sources as they are arriving and pass them to the next stage for processing. However, AFAIK it always reads a chunk and appends it to a large buffer. It doesn't free the processed part until is done with the whole thing. This would require a change in the way Dillo processes data, but I think it would be required for large files. There are more details in the devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer look. As I'm planning to change the design of the CCC, I think I can take this into account too so it would be doable. I'll add it to the list of shortcomings of the current design.
That said, I am still not sure how much actual effort this would take. But I am glad to receive positive feedback so far - I will then continue to find a solution.
However, being able to upload multiple files at the same time sounds reasonable, so feel free to try on your own in the meanwhile.
Uploading multiple files at once seems doable - the patches I sent on my previous email are probably already doing most of the required work. Again, the trickiest task is to send data on-the-fly for each selected file.
Okay, I'll focus on the boundary patch first, which is the easiest to merge and then I'll take a closer look at the others.
Shouldn't it be 68 then?
I understand the opposite: the boundary string with the two leading dashes ("--") included can be up to 72 bytes long, and 74 bytes long for the ending boundary (which includes two more dashes after the boundary string). This is confirmed by reading the BNF defined by RFC 2046 (some bits omitted for simplicity), section 5.1.1 [1]:
boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" / "_" / "," / "-" / "." / "/" / ":" / "=" / "?" dash-boundary := "--" boundary ; boundary taken from the value of ; boundary parameter of the ; Content-Type field> multipart-body := [preamble CRLF] dash-boundary transport-padding CRLF body-part *encapsulation close-delimiter transport-padding [CRLF epilogue] delimiter := CRLF dash-boundary close-delimiter := delimiter "--"
Oh right! I see that we are already using 70 characters anyway.
Note: even if the specification tells receivers to handle transport padding, for the time being I am assuming "transport-padding" as zero length since composers must not generate non-zero length transport padding. I am still not sure where transport padding would apply, anyway. Probably outside web browsers?
I would leave out all the symbols to avoid quoting and only use A-Z a-z and 0-9.
Interestingly, Dillo would always quote boundary strings [2], even if only using A-Z, a-z and 0-9. In fact, this is one of the wrong assumptions I spotted when testing slcl against Dillo.
Yeah, I would assume a lot of implementations are broken, so we want to try to minimize the chances we run into problems. Apart from slcl we should also test this with some sites and see if they continue to work okay. This will also increase the fingerprinting information to distinguish Dillo among other browsers, but I think it is not more information that the already leaked by the user agent.
Which, if I computed it correctly, is still too small to worry about.
Not only it is too small of a chance: if we really wanted to do "the right thing" and make Dillo absolutely sure the boundary string is not contained within the selected files, this would imply a noticeable performance impact when dealing with large files, much likely for a near-zero benefit.
I have not inspected their source code yet (and I do not want to), but I understand both Gecko and Chromium are also making that assumption, because otherwise it would take them a lot of CPU time to upload large files.
But then they would be doing such assumption with a "much larger" probability it hits the file. Skipping it with 70 characters is safe for one file, but also probably safe for all files ever uploaded with Dillo. Maybe curl or other small codebases are easier to read, but not really needed.
Why sizeof " " instead of just 2?
Because, to my eyes, sizeof " " has more meaningful semantics, compared to a magic integer constant such as 2. However, for this simple scenario, I would still consider both acceptable.
Check sizeof " ": https://godbolt.org/z/7Tso8ooYz You can also use dStr_append_c() to only append one character, so you only need a single character. If we only use alphanumeric characters, we can just use isalnum() right?
I can replace it with 2 if you find the other construct unacceptable.
PS: When are you playing?
Sorry, I did not understand your last sentence. Could you please give a bit more context? :)
I meant when is the next KoVoꓘ concert :-) Best, Rodrigo.
Hi Rodrigo,
Dillo has a mechanism to read chunks of data from different sources as they are arriving and pass them to the next stage for processing. However, AFAIK it always reads a chunk and appends it to a large buffer. It doesn't free the processed part until is done with the whole thing.
This would require a change in the way Dillo processes data, but I think it would be required for large files. There are more details in the devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer look.
As I'm planning to change the design of the CCC, I think I can take this into account too so it would be doable. I'll add it to the list of shortcomings of the current design.
Thank you. I am still unfamiliar with that part of Dillo, so please let me know about any progress.
Okay, I'll focus on the boundary patch first, which is the easiest to merge and then I'll take a closer look at the others.
Yeah, I would assume a lot of implementations are broken, so we want to try to minimize the chances we run into problems.
Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of the 75 possible characters, so roughly 82% of the set. I think that removing the quoting in favour of the limited set reduce the risk for broken implementations, yet still provide a good amount of randomness.
Check sizeof " ": https://godbolt.org/z/7Tso8ooYz
Interestingly, the " " character on your last email is not really a <space> (<U0020>): $ printf "%s" " " | hd 00000000 e2 80 88 |...| 00000003 Compared to an ASCII whitespace: $ printf "%s" " " | hd 00000000 20 | | 00000001 Both Godbolt and my editor also flag that multi-byte character with a yellow rectangle around it because it would be highly confusing otherwise. For example: printf("len=%zu\n", strlen(" ")); Confusingly returns "len=3". I am not sure whether this was an intentional modification from your side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so that sizeof " " would always return 2. Was it your intention to flag this potential confusion? Also, there was not strict reason to use sizeof " ". Any other character would do e.g.: sizeof "x", sizeof "A", etc.
You can also use dStr_append_c() to only append one character, so you only need a single character.
That would be an unnecessary use of the heap, because the size is static.
If we only use alphanumeric characters, we can just use isalnum() right?
According to POSIX.1-2017 [2], isalnum(3) depends on the current locale configured by the system. For example, characters such as Ä or ú could return non-zero. To avoid this, there are two possible solutions: 1. Use isalnum_l(3) to specify a locale_t object corresponding to the "POSIX" locale (equivalent to "C" [3]), which must be previously allocated by the newlocale(3) function [3] and released by the freelocal(3) function [4]. A minimalist example is shown below: locale_t l = newlocale(LC_CTYPE, "POSIX", NULL); for (unsigned char i = 0; i < 255; i++) printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i, isalnum_l(i, l)); freelocale(l); 2. Define a known subset from the portable character set defined by POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch. IMHO this approach is better because: - It does not deal with locales, so developers not familiar with them would understand the code better. - It is also portable outside a POSIX environment (not sure if this a requirement, though). - It does not require dynamic allication via newlocale(3). - It is the only possible option if non-alnum characters, such as ':' or '/', are appended to the boundary string. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html [2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/isalnum.html [3]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/newlocale.html [4]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/freelocale.html [5]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
I meant when is the next KoVoꓘ concert :-)
No gigs ahead, but I will keep you informed. :) Best regards, Xavi On 28/8/24 22:47, Rodrigo Arias wrote:
Hi Xavier,
On Wed, Aug 28, 2024 at 01:04:04AM +0200, Xavier Del Campo Romero wrote:
Hi Rodrigo,
Glad to read that you also consider Dillo for slcl, and thanks for preparing the patches :-)
Thank you! I want slcl to be useful to anyone, including users who care about minimalist software like Dillo. The web is already too crowded with bloated "webapps" and other terrible things. :)
Agreed!
Sounds good, not sure how complicated it would be to do this.
I still need to investigate this further, but I assume this would require Dillo to at least implement a sink callback.
In other words, the component responsible for transmitting the data (probably src/IO/IO.c) should trigger a user-defined callback with an arbitrarily-sized buffer (typically, of BUFSIZ bytes, as defined by stdio.h) that must filled with file data. Then, the user-defined callback can fill from zero up to BUFSIZ bytes, which are eventually trasmitted to the server.
Dillo has a mechanism to read chunks of data from different sources as they are arriving and pass them to the next stage for processing. However, AFAIK it always reads a chunk and appends it to a large buffer. It doesn't free the processed part until is done with the whole thing.
This would require a change in the way Dillo processes data, but I think it would be required for large files. There are more details in the devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer look.
As I'm planning to change the design of the CCC, I think I can take this into account too so it would be doable. I'll add it to the list of shortcomings of the current design.
That said, I am still not sure how much actual effort this would take. But I am glad to receive positive feedback so far - I will then continue to find a solution.
However, being able to upload multiple files at the same time sounds reasonable, so feel free to try on your own in the meanwhile.
Uploading multiple files at once seems doable - the patches I sent on my previous email are probably already doing most of the required work. Again, the trickiest task is to send data on-the-fly for each selected file.
Okay, I'll focus on the boundary patch first, which is the easiest to merge and then I'll take a closer look at the others.
Shouldn't it be 68 then?
I understand the opposite: the boundary string with the two leading dashes ("--") included can be up to 72 bytes long, and 74 bytes long for the ending boundary (which includes two more dashes after the boundary string). This is confirmed by reading the BNF defined by RFC 2046 (some bits omitted for simplicity), section 5.1.1 [1]:
boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" / "_" / "," / "-" / "." / "/" / ":" / "=" / "?" dash-boundary := "--" boundary ; boundary taken from the value of ; boundary parameter of the ; Content-Type field> multipart-body := [preamble CRLF] dash-boundary transport-padding CRLF body-part *encapsulation close-delimiter transport-padding [CRLF epilogue] delimiter := CRLF dash-boundary close-delimiter := delimiter "--"
Oh right! I see that we are already using 70 characters anyway.
Note: even if the specification tells receivers to handle transport padding, for the time being I am assuming "transport-padding" as zero length since composers must not generate non-zero length transport padding. I am still not sure where transport padding would apply, anyway. Probably outside web browsers?
I would leave out all the symbols to avoid quoting and only use A-Z a-z and 0-9.
Interestingly, Dillo would always quote boundary strings [2], even if only using A-Z, a-z and 0-9. In fact, this is one of the wrong assumptions I spotted when testing slcl against Dillo.
Yeah, I would assume a lot of implementations are broken, so we want to try to minimize the chances we run into problems.
Apart from slcl we should also test this with some sites and see if they continue to work okay.
This will also increase the fingerprinting information to distinguish Dillo among other browsers, but I think it is not more information that the already leaked by the user agent.
Which, if I computed it correctly, is still too small to worry about.
Not only it is too small of a chance: if we really wanted to do "the right thing" and make Dillo absolutely sure the boundary string is not contained within the selected files, this would imply a noticeable performance impact when dealing with large files, much likely for a near-zero benefit.
I have not inspected their source code yet (and I do not want to), but I understand both Gecko and Chromium are also making that assumption, because otherwise it would take them a lot of CPU time to upload large files.
But then they would be doing such assumption with a "much larger" probability it hits the file.
Skipping it with 70 characters is safe for one file, but also probably safe for all files ever uploaded with Dillo.
Maybe curl or other small codebases are easier to read, but not really needed.
Why sizeof " " instead of just 2?
Because, to my eyes, sizeof " " has more meaningful semantics, compared to a magic integer constant such as 2. However, for this simple scenario, I would still consider both acceptable.
Check sizeof " ": https://godbolt.org/z/7Tso8ooYz
You can also use dStr_append_c() to only append one character, so you only need a single character.
If we only use alphanumeric characters, we can just use isalnum() right?
I can replace it with 2 if you find the other construct unacceptable.
PS: When are you playing?
Sorry, I did not understand your last sentence. Could you please give a bit more context? :)
I meant when is the next KoVoꓘ concert :-)
Best, Rodrigo. _______________________________________________ Dillo-dev mailing list -- dillo-dev@mailman3.com To unsubscribe send an email to dillo-dev-leave@mailman3.com
Hi Xavier,
Thank you. I am still unfamiliar with that part of Dillo, so please let me know about any progress.
For now I'm still writing the RFC and doing some proof of concepts. I can see that it will take a while.
Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of the 75 possible characters, so roughly 82% of the set. I think that removing the quoting in favour of the limited set reduce the risk for broken implementations, yet still provide a good amount of randomness.
Yes, I think so too.
I am not sure whether this was an intentional modification from your side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so that sizeof " " would always return 2. Was it your intention to flag this potential confusion?
Yes, my point was that it is that is not easy to determine the length of a UTF-8 string by just looking at it.
Also, there was not strict reason to use sizeof " ". Any other character would do e.g.: sizeof "x", sizeof "A", etc.
Same problem: ᕁ 𝓍 𝙭 х 𝐱 𝗑 ⤫ 𝑥 𝘅 ⤬ ᙮ ⨯ 𝕩 𝖝 × 𝔁 𝚡 x x ⅹ 𝔵 ᕽ 𝒙 𝘹 𝙰 A 𝐴 ᗅ 𝑨 𝚨 𝕬 𝖠 А Α 𝛢 𝝖 𝒜 𝜜 𖽀 𝓐 𝞐 A 𝗔 𝘈 Ꭺ ꓮ 𝐀 𝔄 𝔸 𐊠 𝘼 Check: https://util.unicode.org/UnicodeJsps/confusables.jsp It is generally safer to use the explicit length.
You can also use dStr_append_c() to only append one character, so you only need a single character.
That would be an unnecessary use of the heap, because the size is static.
Not sure what you mean with "unnecessary use of the heap". I meant something like this: static void generate_boundary(Dstr *boundary) { for (int i = 0; i < 70; i++) { /* Extracted from RFC 2046, section 5.1.1. */ static const char set[] = "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "0123456789"; int c; do { c = rand() & 0xff; } while (!strchr(set, c)); dStr_append_c(boundary, c); } } The variable c is stored in a register, as you can see in the disassembly (built with -Og): hop% r2 build/src/dillo WARN: Relocs has not been applied. Please use `-e bin.relocs.apply=true` or `-e bin.cache=true` next time [0x000e3860]> s sym.generate_boundary_Dstr_ [0x0012cf99]> af [0x0012cf99]> pdf ┌ 85: sym.generate_boundary_Dstr_ (int64_t arg1); │ ; arg int64_t arg1 @ rdi │ 0x0012cf99 55 push rbp ; Fl_Pixmap.H:1250 ; generate_boundary(Dstr*) │ 0x0012cf9a 4889e5 mov rbp, rsp │ 0x0012cf9d 4155 push r13 │ 0x0012cf9f 4154 push r12 │ 0x0012cfa1 53 push rbx │ 0x0012cfa2 4883ec08 sub rsp, 8 │ 0x0012cfa6 4989fd mov r13, rdi ; arg1 │ 0x0012cfa9 41bc00000000 mov r12d, 0 ; Fl_Pixmap.H:1251 │ ┌─< 0x0012cfaf eb2c jmp 0x12cfdd │ ┌┌──> 0x0012cfb1 ff1509cd1e00 call qword [reloc.rand] ; Fl_Pixmap.H:1253 ; [0x319cc0:8]=0 │ ╎╎│ 0x0012cfb7 0fb6d8 movzx ebx, al ; <--- Perform the "& 0xff" by zero-extending the lowest byte in eax │ ╎╎│ 0x0012cfba 89de mov esi, ebx ; Fl_Pixmap.H:1260 │ ╎╎│ 0x0012cfbc 488d3dbd5b.. lea rdi, obj.generate_boundary_Dstr_::set ; 0x2a2b80 ; "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" │ ╎╎│ 0x0012cfc3 ff15afc51e00 call qword [reloc.strchr] ; [0x319578:8]=0 │ ╎╎│ 0x0012cfc9 4885c0 test rax, rax │ └───< 0x0012cfcc 74e3 je 0x12cfb1 │ ╎│ 0x0012cfce 89de mov esi, ebx ; Fl_Pixmap.H:1262 │ ╎│ 0x0012cfd0 4c89ef mov rdi, r13 │ ╎│ 0x0012cfd3 67e830570800 call sym.dStr_append_c │ ╎│ 0x0012cfd9 4183c401 add r12d, 1 ; Fl_Pixmap.H:1251 │ ╎│ ; CODE XREF from generate_boundary(Dstr*) @ 0x12cfaf(x) │ ╎└─> 0x0012cfdd 4183fc45 cmp r12d, 0x45 ; 'E' │ └──< 0x0012cfe1 7ece jle 0x12cfb1 │ 0x0012cfe3 4883c408 add rsp, 8 ; Fl_Pixmap.H:1264 │ 0x0012cfe7 5b pop rbx │ 0x0012cfe8 415c pop r12 │ 0x0012cfea 415d pop r13 │ 0x0012cfec 5d pop rbp └ 0x0012cfed c3 ret This method gives us a perfect uniform distribution when RAND_MAX is a power of 2, at the cost of executing rand() more times than required (70/(62/256) ≈ 289 on average). This other method has a slightly not uniform distribution, but only runs rand() 70 times: static void generate_boundary(Dstr *boundary) { /* Extracted from RFC 2046, section 5.1.1. */ static const char set[] = "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "0123456789"; static const int n = strlen(set); for (int i = 0; i < 70; i++) { int c = (unsigned char) set[rand() % n]; dStr_append_c(boundary, c); } } Which is probably okay for this case. Notice I haven't tested any of these methods yet, I'll need to add a form upload test case to be able to see them in action (or a unit test).
If we only use alphanumeric characters, we can just use isalnum() right?
According to POSIX.1-2017 [2], isalnum(3) depends on the current locale configured by the system. For example, characters such as Ä or ú could return non-zero.
Oh right, for some reason I was thinking this was the other way around, and isalnum only worked with ASCII. I think we should review the other uses of isalnum and friends as I think there may be used under similar assumptions.
To avoid this, there are two possible solutions:
1. Use isalnum_l(3) to specify a locale_t object corresponding to the "POSIX" locale (equivalent to "C" [3]), which must be previously allocated by the newlocale(3) function [3] and released by the freelocal(3) function [4]. A minimalist example is shown below:
locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);
for (unsigned char i = 0; i < 255; i++) printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i, isalnum_l(i, l));
freelocale(l);
2. Define a known subset from the portable character set defined by POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch. IMHO this approach is better because: - It does not deal with locales, so developers not familiar with them would understand the code better. - It is also portable outside a POSIX environment (not sure if this a requirement, though). - It does not require dynamic allication via newlocale(3). - It is the only possible option if non-alnum characters, such as ':' or '/', are appended to the boundary string.
Then, I think using our own set may be the most readable solution here, which also allows the quick module method. Best, Rodrigo
Hi Rodrigo,
Not sure what you mean with "unnecessary use of the heap". I meant something like this:
You are right: I was too fixated with strcspn(3), when in fact strchr(3) would already do. Also, this use of dStr_append_c() looks good to me.
The variable c is stored in a register, as you can see in the disassembly (built with -Og):
Thanks for the detailed explanation. Anyway, I do not expect this function to become a performance bottleneck.
This other method has a slightly not uniform distribution, but only runs rand() 70 times:
Which is probably okay for this case.
I also think it is good enough.
Then, I think using our own set may be the most readable solution here, which also allows the quick module method.
I agree. Best regards, Xavi On 1/9/24 16:12, Rodrigo Arias wrote:
Hi Xavier,
Thank you. I am still unfamiliar with that part of Dillo, so please let me know about any progress.
For now I'm still writing the RFC and doing some proof of concepts. I can see that it will take a while.
Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of the 75 possible characters, so roughly 82% of the set. I think that removing the quoting in favour of the limited set reduce the risk for broken implementations, yet still provide a good amount of randomness.
Yes, I think so too.
I am not sure whether this was an intentional modification from your side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so that sizeof " " would always return 2. Was it your intention to flag this potential confusion?
Yes, my point was that it is that is not easy to determine the length of a UTF-8 string by just looking at it.
Also, there was not strict reason to use sizeof " ". Any other character would do e.g.: sizeof "x", sizeof "A", etc.
Same problem:
ᕁ 𝓍 𝙭 х 𝐱 𝗑 ⤫ 𝑥 𝘅 ⤬ ᙮ ⨯ 𝕩 𝖝 × 𝔁 𝚡 x x ⅹ 𝔵 ᕽ 𝒙 𝘹
𝙰 A 𝐴 ᗅ 𝑨 𝚨 𝕬 𝖠 А Α 𝛢 𝝖 𝒜 𝜜 𖽀 𝓐 𝞐 A 𝗔 𝘈 Ꭺ ꓮ 𝐀 𝔄 𝔸 𐊠 𝘼
Check: https://util.unicode.org/UnicodeJsps/confusables.jsp
It is generally safer to use the explicit length.
You can also use dStr_append_c() to only append one character, so you only need a single character.
That would be an unnecessary use of the heap, because the size is static.
Not sure what you mean with "unnecessary use of the heap". I meant something like this:
static void generate_boundary(Dstr *boundary) { for (int i = 0; i < 70; i++) { /* Extracted from RFC 2046, section 5.1.1. */ static const char set[] = "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "0123456789"; int c;
do { c = rand() & 0xff; } while (!strchr(set, c));
dStr_append_c(boundary, c); } }
The variable c is stored in a register, as you can see in the disassembly (built with -Og):
hop% r2 build/src/dillo WARN: Relocs has not been applied. Please use `-e bin.relocs.apply=true` or `-e bin.cache=true` next time [0x000e3860]> s sym.generate_boundary_Dstr_ [0x0012cf99]> af [0x0012cf99]> pdf ┌ 85: sym.generate_boundary_Dstr_ (int64_t arg1); │ ; arg int64_t arg1 @ rdi │ 0x0012cf99 55 push rbp ; Fl_Pixmap.H:1250 ; generate_boundary(Dstr*) │ 0x0012cf9a 4889e5 mov rbp, rsp │ 0x0012cf9d 4155 push r13 │ 0x0012cf9f 4154 push r12 │ 0x0012cfa1 53 push rbx │ 0x0012cfa2 4883ec08 sub rsp, 8 │ 0x0012cfa6 4989fd mov r13, rdi ; arg1 │ 0x0012cfa9 41bc00000000 mov r12d, 0 ; Fl_Pixmap.H:1251 │ ┌─< 0x0012cfaf eb2c jmp 0x12cfdd │ ┌┌──> 0x0012cfb1 ff1509cd1e00 call qword [reloc.rand] ; Fl_Pixmap.H:1253 ; [0x319cc0:8]=0 │ ╎╎│ 0x0012cfb7 0fb6d8 movzx ebx, al ; <--- Perform the "& 0xff" by zero-extending the lowest byte in eax │ ╎╎│ 0x0012cfba 89de mov esi, ebx ; Fl_Pixmap.H:1260 │ ╎╎│ 0x0012cfbc 488d3dbd5b.. lea rdi, obj.generate_boundary_Dstr_::set ; 0x2a2b80 ; "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" │ ╎╎│ 0x0012cfc3 ff15afc51e00 call qword [reloc.strchr] ; [0x319578:8]=0 │ ╎╎│ 0x0012cfc9 4885c0 test rax, rax │ └───< 0x0012cfcc 74e3 je 0x12cfb1 │ ╎│ 0x0012cfce 89de mov esi, ebx ; Fl_Pixmap.H:1262 │ ╎│ 0x0012cfd0 4c89ef mov rdi, r13 │ ╎│ 0x0012cfd3 67e830570800 call sym.dStr_append_c │ ╎│ 0x0012cfd9 4183c401 add r12d, 1 ; Fl_Pixmap.H:1251 │ ╎│ ; CODE XREF from generate_boundary(Dstr*) @ 0x12cfaf(x) │ ╎└─> 0x0012cfdd 4183fc45 cmp r12d, 0x45 ; 'E' │ └──< 0x0012cfe1 7ece jle 0x12cfb1 │ 0x0012cfe3 4883c408 add rsp, 8 ; Fl_Pixmap.H:1264 │ 0x0012cfe7 5b pop rbx │ 0x0012cfe8 415c pop r12 │ 0x0012cfea 415d pop r13 │ 0x0012cfec 5d pop rbp └ 0x0012cfed c3 ret
This method gives us a perfect uniform distribution when RAND_MAX is a power of 2, at the cost of executing rand() more times than required (70/(62/256) ≈ 289 on average). This other method has a slightly not uniform distribution, but only runs rand() 70 times:
static void generate_boundary(Dstr *boundary) { /* Extracted from RFC 2046, section 5.1.1. */ static const char set[] = "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "0123456789"; static const int n = strlen(set);
for (int i = 0; i < 70; i++) { int c = (unsigned char) set[rand() % n]; dStr_append_c(boundary, c); } }
Which is probably okay for this case.
Notice I haven't tested any of these methods yet, I'll need to add a form upload test case to be able to see them in action (or a unit test).
If we only use alphanumeric characters, we can just use isalnum() right?
According to POSIX.1-2017 [2], isalnum(3) depends on the current locale configured by the system. For example, characters such as Ä or ú could return non-zero.
Oh right, for some reason I was thinking this was the other way around, and isalnum only worked with ASCII. I think we should review the other uses of isalnum and friends as I think there may be used under similar assumptions.
To avoid this, there are two possible solutions:
1. Use isalnum_l(3) to specify a locale_t object corresponding to the "POSIX" locale (equivalent to "C" [3]), which must be previously allocated by the newlocale(3) function [3] and released by the freelocal(3) function [4]. A minimalist example is shown below:
locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);
for (unsigned char i = 0; i < 255; i++) printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i, isalnum_l(i, l));
freelocale(l);
2. Define a known subset from the portable character set defined by POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch. IMHO this approach is better because: - It does not deal with locales, so developers not familiar with them would understand the code better. - It is also portable outside a POSIX environment (not sure if this a requirement, though). - It does not require dynamic allication via newlocale(3). - It is the only possible option if non-alnum characters, such as ':' or '/', are appended to the boundary string.
Then, I think using our own set may be the most readable solution here, which also allows the quick module method.
Best, Rodrigo _______________________________________________ Dillo-dev mailing list -- dillo-dev@mailman3.com To unsubscribe send an email to dillo-dev-leave@mailman3.com
Hi Xavier, On Sun, Sep 01, 2024 at 11:17:53PM +0200, Xavier Del Campo Romero wrote:
This other method has a slightly not uniform distribution, but only runs rand() 70 times:
Which is probably okay for this case.
I also think it is good enough.
Then, I think using our own set may be the most readable solution here, which also allows the quick module method.
I agree.
Please, take a look at this PR: https://github.com/dillo-browser/dillo/pull/256 Here is the raw patch series for "git am": https://github.com/dillo-browser/dillo/pull/256.patch It would be nice if you can test it with slcl to know it works fine. Best, Rodrigo.
participants (2)
-
Rodrigo Arias
-
Xavier Del Campo Romero