Fingerprinting lessons from 'curl-impersonate'

newer
Re: Fingerprinting lessons from...

a1ex＠dismail.de

Dec. 30, 2024

4:35 p.m.

Hi, There was an interesting post[1] on HN today about 'curl-impersonate', which is a patch[2] to curl which allows it to act like various big browsers, bypassing various fingerprinting techniques which would otherwise prevent the client from accessing the page. Looking at the patch, maybe there could be some useful ideas here for Dillo to use to load more sites. The SSL library also obviously plays a large role, maybe that's something we will need to consider as well. -Alex [1]https://news.ycombinator.com/item?id=42547820 [2]https://github.com/lexiforest/curl-impersonate/blob/main/chrome/patches/curl...

Show replies by date

Rodrigo Arias

December 2024

9:15 p.m.

Hi Alex, On Mon, Dec 30, 2024 at 05:35:50PM +0100, a1ex@dismail.de wrote:

...

Hi,

There was an interesting post[1] on HN today about 'curl-impersonate', which is a patch[2] to curl which allows it to act like various big browsers, bypassing various fingerprinting techniques which would otherwise prevent the client from accessing the page.

Looking at the patch, maybe there could be some useful ideas here for Dillo to use to load more sites. The SSL library also obviously plays a large role, maybe that's something we will need to consider as well.

I experienced problems with the user-agent being banned, and having to impersonate Firefox to load some sites. I haven't found yet examples of this deep fingerprinting for TLS or similar, you? In any case, it would be trivial to discern Dillo as we don't support JS, so it can be banned if they decide so. In my experiences, it is generally not worth reading the website that performs this type of discrimination. Best, Rodrigo.

Kevin Koster

10:57 p.m.

Rodrigo Arias wrote:

...

On Mon, Dec 30, 2024 at 05:35:50PM +0100, a1ex-J7K0XVabL0iELgA04lAiVw@public.gmane.org wrote:

...
There was an interesting post[1] on HN today about 'curl-impersonate', which is a patch[2] to curl which allows it to act like various big browsers, bypassing various fingerprinting techniques which would otherwise prevent the client from accessing the page.

Looking at the patch, maybe there could be some useful ideas here for Dillo to use to load more sites. The SSL library also obviously plays a large role, maybe that's something we will need to consider as well.

I experienced problems with the user-agent being banned, and having to impersonate Firefox to load some sites. I haven't found yet examples of this deep fingerprinting for TLS or similar, you?

In any case, it would be trivial to discern Dillo as we don't support JS, so it can be banned if they decide so.

I've found that sometimes I go to a webpage and see one of the "enable Javascript to continue" pages in Dillo, then I load the same page in Firefox with NoScript blocking all its scripts, and it comes up fine without running any such Javascript. That could be just the User-Agent header though because I don't try faking that. Rather than add Chrome-faking features to Dillo, maybe this would be an extra application of the Rule-based content manipulation RFC: https://github.com/dillo-browser/rfc/blob/rfc-002/rfc-002-rule-based-content... Make a rule for some sites (or Web server responses?) that has Dillo call curl-impersonate to retrieve a Web page instead of doing it in Dillo? By the way, being a Git failure, I really can't see where that MD document lives. I look at the "rfc" repo via the GitHub website in Dillo and there's just a readme. I clone the repo and I just get a readme. I had to look back to your RFC repo announcement to find that link. I guess they're in separate branches or something but I forget things about Git faster than I learn them and can't be bothered learning how to use branches yet again today. I really think it would be better to list them together somewhere obvious, eg. a new Developer Documentation webpage. I can see from this URL mangling that there are probably only two RFCs so far: https://github.com/dillo-browser/rfc/tree/rfc-001/ (rfc-001-dillo-rfc-documents.md) https://github.com/dillo-browser/rfc/tree/rfc-002/ (rfc-002-rule-based-content-manipulation.md) https://github.com/dillo-browser/rfc/tree/rfc-003/ (404)

...

In my experiences, it is generally not worth reading the website that performs this type of discrimination.

That's often my approach, but then big offenders are things like government websites which one is obliged to read sometimes.

Rodrigo Arias

11:23 p.m.

Hi Kevin, On Tue, Dec 31, 2024 at 09:57:34AM +1100, Kevin Koster wrote:

...

I've found that sometimes I go to a webpage and see one of the "enable Javascript to continue" pages in Dillo, then I load the same page in Firefox with NoScript blocking all its scripts, and it comes up fine without running any such Javascript. That could be just the User-Agent header though because I don't try faking that.

Could this be happening because you have at some point solved a captcha that had stored a cookie in Firefox, but doesn't load the JS to solve the captcha in Dillo? It would be nice to have a test case so I can reproduce this myself too.

...

Rather than add Chrome-faking features to Dillo, maybe this would be an extra application of the Rule-based content manipulation RFC: https://github.com/dillo-browser/rfc/blob/rfc-002/rfc-002-rule-based-content...

Make a rule for some sites (or Web server responses?) that has Dillo call curl-impersonate to retrieve a Web page instead of doing it in Dillo?

I believe this could be doable yes.

...

By the way, being a Git failure, I really can't see where that MD document lives. I look at the "rfc" repo via the GitHub website in Dillo and there's just a readme. I clone the repo and I just get a readme. I had to look back to your RFC repo announcement to find that link. I guess they're in separate branches or something but I forget things about Git faster than I learn them and can't be bothered learning how to use branches yet again today. I really think it would be better to list them together somewhere obvious, eg. a new Developer Documentation webpage.

Yes, I'm thinking yet what could be a good organization. At first I was writing the proposals in Markdown, but I'm considering using HTML directly so we can do custom things. I plan to make the RFCs available in the website, as soon as I think of a way to automatically render them. From Dillo it is hard to find the RFCs as the GitHub UI doesn't work.

...

I can see from this URL mangling that there are probably only two RFCs so far: https://github.com/dillo-browser/rfc/tree/rfc-001/ (rfc-001-dillo-rfc-documents.md) https://github.com/dillo-browser/rfc/tree/rfc-002/ (rfc-002-rule-based-content-manipulation.md) https://github.com/dillo-browser/rfc/tree/rfc-003/ (404)

I just added this one today to add support for UNIX sockets in URLs: https://dillo-browser.github.io/rfc/003-unix-sockets/ But I haven't uploaded it to the RFC repository yet. Best, Rodrigo.

Kevin Koster

9:57 p.m.

Rodrigo Arias wrote:

...

On Tue, Dec 31, 2024 at 09:57:34AM +1100, Kevin Koster wrote:

...
I've found that sometimes I go to a webpage and see one of the "enable Javascript to continue" pages in Dillo, then I load the same page in Firefox with NoScript blocking all its scripts, and it comes up fine without running any such Javascript. That could be just the User-Agent header though because I don't try faking that.

Could this be happening because you have at some point solved a captcha that had stored a cookie in Firefox, but doesn't load the JS to solve the captcha in Dillo?

No, Firefox is set to delete all cookies upon shut down (which is done frequently), and it seems to do so.

...

It would be nice to have a test case so I can reproduce this myself too.

OK, here's one: https://www.autosurplus.com.au/?rf=kw&kw=Land+Cruiser The homepage seems to load in Dillo, but searches like that URL don't, except sometimes _after_ loading the same URL in Firefox, or maybe just waiting a long time (the three minute page auto-refresh duration?). I could spend all day trying to narrow down the exact behaviour, but it _seems_ to always work in Firefox. That auto-refreshing page page titled "Just a moment..." and saying "Enable JavaScript and cookies to continue" is typical. Lots of sites use that same anti-scraping/DDoS-protection service. Based on this string in the source code: "/cdn-cgi/challenge-platform/h/b/orchestrate/chl_page/v1" (about the most identifiable thing I could see in the noise), it looks like this is from CloudFlare: https://github.com/scaredos/cfresearch/blob/master/README.md No CloudFlare URLs are set to be allowed through NoScript. But none actually show in the list of blocked scripts for the page when it's loaded in Firefox either - the CloudFlare JS seems to have been bypassed. I also get intermittently blocked by one website in Firefox (gumtree.com.au, not apparantly using CloudFlare) even with scripts enabled. That doesn't happen if I use the same FF version and profile on a different internet connection, so something about some of the IP addresses that my ISP dynamically assigns looks suspicious to some protection service they use. That website requires JS itself, so Dillo isn't an option anyway, but it shows I might have trouble that doesn't happen to people with better-trusted IPs. But, having said all that, I want to point out that getting deep into tricks to work around CloudFlare or other such services is a rabbit hole that I wouldn't ask anyone to go down. If some people enjoy doing that, great, but it's probably better isolated to a separate project like curl-impersonate rather than distracting from the challenge of implementing Web standards for rendering sites that aren't actively trying to block Dillo. Ideally then lots more people start using Dillo and CloudFlare or their users are motivated to fix this at their end (hence why I set an accurate User-Agent in Dillo so Web admins can see it in their logs). Yes in reality that ship sailed many years ago, but I think it's still the only practical approach.

...

...
By the way, being a Git failure, I really can't see where that MD document lives. I look at the "rfc" repo via the GitHub website in Dillo and there's just a readme. I clone the repo and I just get a readme. I had to look back to your RFC repo announcement to find that link. I guess they're in separate branches or something but I forget things about Git faster than I learn them and can't be bothered learning how to use branches yet again today. I really think it would be better to list them together somewhere obvious, eg. a new Developer Documentation webpage.

Yes, I'm thinking yet what could be a good organization. At first I was writing the proposals in Markdown, but I'm considering using HTML directly so we can do custom things.

I plan to make the RFCs available in the website, as soon as I think of a way to automatically render them.

OK, great. I guess a Wiki with write access limited to the Dillo maintainers would be another option. fully Dillo-compatible Wikis are probably thin on the ground though (I use Wikepage partly for this reason, but it hasn't had new development since 2008).

...

...
I can see from this URL mangling that there are probably only two RFCs so far: https://github.com/dillo-browser/rfc/tree/rfc-001/ (rfc-001-dillo-rfc-documents.md) https://github.com/dillo-browser/rfc/tree/rfc-002/ (rfc-002-rule-based-content-manipulation.md) https://github.com/dillo-browser/rfc/tree/rfc-003/ (404)

I just added this one today to add support for UNIX sockets in URLs:

https://dillo-browser.github.io/rfc/003-unix-sockets/

Great!

Rodrigo Arias

10:17 p.m.

Hi, On Sun, Jan 05, 2025 at 08:56:44AM -0300, Kevin Koster wrote:

...

OK, here's one: https://www.autosurplus.com.au/?rf=kw&kw=Land+Cruiser

The homepage seems to load in Dillo, but searches like that URL don't, except sometimes _after_ loading the same URL in Firefox, or maybe just waiting a long time (the three minute page auto-refresh duration?). I could spend all day trying to narrow down the exact behaviour, but it _seems_ to always work in Firefox.

Try this in dillorc: http_user_agent="Mozilla/5.0 (PSP (PlayStation Portable); 2.00)" Source: https://news.ycombinator.com/item?id=38852310 Best, Rodrigo.

Kevin Koster

11:55 p.m.

Rodrigo Arias wrote:

...

On Sun, Jan 05, 2025 at 08:56:44AM -0300, Kevin Koster wrote:

...
OK, here's one: https://www.autosurplus.com.au/?rf=kw&kw=Land+Cruiser

The homepage seems to load in Dillo, but searches like that URL don't, except sometimes _after_ loading the same URL in Firefox, or maybe just waiting a long time (the three minute page auto-refresh duration?). I could spend all day trying to narrow down the exact behaviour, but it _seems_ to always work in Firefox.

Try this in dillorc:

http_user_agent="Mozilla/5.0 (PSP (PlayStation Portable); 2.00)"

Same result (and I checked that the User-Agent really changed). Reloading after a few minutes does seem to work though. For that particular website anyway. Does it happen for you?

Rodrigo Arias

January 2025

12:21 a.m.

Hi, On Sun, Jan 05, 2025 at 10:54:39AM -0300, Kevin Koster wrote:

...

...
Try this in dillorc:

http_user_agent="Mozilla/5.0 (PSP (PlayStation Portable); 2.00)"

Same result (and I checked that the User-Agent really changed).

Reloading after a few minutes does seem to work though. For that particular website anyway.

Does it happen for you?

I just checked once and I though it just worked always, but it doesn't seems so. For me, it looks like there is a 1/2 chance it works or so. I'm not sure what is triggering the challenge, but it is related to CloudFlare. Opening this link instead seems to always work for me, regardless of the Dillo user agent: https://www.autosurplus.com.au/?kw=Land+Cruiser Not entirely sure what criteria they are using. Mimicking Firefox/Chrome may work but only temporarily until bots do the same. A long term approach is something like Privacy Pass, but I see several red flags on that proposal: https://privacypass.github.io/ Best, Rodrigo.

Kevin Koster

1 a.m.

Rodrigo Arias wrote:

...

On Sun, Jan 05, 2025 at 10:54:39AM -0300, Kevin Koster wrote:

...
...
Try this in dillorc:

http_user_agent="Mozilla/5.0 (PSP (PlayStation Portable); 2.00)"

Same result (and I checked that the User-Agent really changed).

Reloading after a few minutes does seem to work though. For that particular website anyway.

Does it happen for you?

I just checked once and I though it just worked always, but it doesn't seems so. For me, it looks like there is a 1/2 chance it works or so. I'm not sure what is triggering the challenge, but it is related to CloudFlare.

Opening this link instead seems to always work for me, regardless of the Dillo user agent:

https://www.autosurplus.com.au/?kw=Land+Cruiser

Yes that seems to work for me too, strangely. Anyway that particular URL was just a random example I found in Firefox's history.

...

Not entirely sure what criteria they are using. Mimicking Firefox/Chrome may work but only temporarily until bots do the same. A long term approach is something like Privacy Pass, but I see several red flags on that proposal:

https://privacypass.github.io/

Seems unlikely that it will be tied to a JS-free option for the initial "proof-of-work" challenge. But yes it might be useful after completing the first challenge in Firefox, if switching browser can work in practice.

Rodrigo Arias

1:24 p.m.

Hi Kevin, On Wed, Jan 01, 2025 at 12:00:32PM +1100, Kevin Koster wrote:

...

Yes that seems to work for me too, strangely. Anyway that particular URL was just a random example I found in Firefox's history.

I see, here is a similar thread in Palemoon: https://forum.palemoon.org/viewtopic.php?f=3&t=31339

...

Seems unlikely that it will be tied to a JS-free option for the initial "proof-of-work" challenge. But yes it might be useful after completing the first challenge in Firefox, if switching browser can work in practice.

Privacy pass is not tied to a particular challenge. Cloudflare now supports a JS captcha but other non-JS challenges may be added in the future. On the other hand, this also opens the door to Apple enforcing only non-modified iOS systems to access a website, which is just pure evil, see: https://educatedguesswork.org/posts/private-access-tokens/ I think the idea of private tokens may eventually solve the spam problem, but I think in its current form is harmful. Best, Rodrigo.

Schimon Jehudah

4:49 p.m.

Greetings, and Happy New Year! Client side ----------- Some check/restriction pages are pseudo and can be bypassed with a refresh. See https://greasyfork.org/en/scripts/493323-xhreload It can also work by clicking on the buttons back and forward with some browsers. Server side ----------- However, some are server side, and therefore the so called "PassKey" be required, which is the only reason that I have FF installed, until it is available for Falkon. fixoutlook.org -------------- As was with Outlock which has foisted a modified HTML format, which could not be displayed with other email clients, I would suggest to organize a similar campaign, or better yet, a boycott. It was "fixoutlook.org" and later redirected to "email-standards.org". I sense that it was a form of a controlled opposition in order to make a benevolent display or distration to the public. Boycott ------- Therefore, I suggest to begin a boycott campaign, and consequently teach and inform people of recommended means to publish and serve contents on the internet, including utilization of P2P file-sharing software. It might be too radical to you, yet I am a Jew and I am of the orthodox Jewish sector, and that sector has mostly inner communication exchanges and less with other sectors and authorities (i.e. government). So the practical mean to them, is to boycott and inform others of recommended commercial or cultural practices, if necessary. As a Jew, I am a part of such community of well over a 100K for almost a decade, and I highly recommend to try that practice of boycott. Happy New Year, Schimon On Wed, 1 Jan 2025 14:24:05 +0100 Rodrigo Arias <rodarima@gmail.com> wrote:

...

Hi Kevin,

On Wed, Jan 01, 2025 at 12:00:32PM +1100, Kevin Koster wrote:

...
Yes that seems to work for me too, strangely. Anyway that particular URL was just a random example I found in Firefox's history.

I see, here is a similar thread in Palemoon: https://forum.palemoon.org/viewtopic.php?f=3&t=31339

...
Seems unlikely that it will be tied to a JS-free option for the initial "proof-of-work" challenge. But yes it might be useful after completing the first challenge in Firefox, if switching browser can work in practice.

Privacy pass is not tied to a particular challenge. Cloudflare now supports a JS captcha but other non-JS challenges may be added in the future. On the other hand, this also opens the door to Apple enforcing only non-modified iOS systems to access a website, which is just pure evil, see: https://educatedguesswork.org/posts/private-access-tokens/

I think the idea of private tokens may eventually solve the spam problem, but I think in its current form is harmful.

Best, Rodrigo. _______________________________________________ Dillo-dev mailing list -- dillo-dev@mailman3.com To unsubscribe send an email to dillo-dev-leave@mailman3.com

225

Age (days ago)

227

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

a1ex＠dismail.de
Kevin Koster
Rodrigo Arias
Schimon Jehudah

Fingerprinting lessons from 'curl-impersonate'

tags

participants (4)