Rodrigo Arias wrote:
On Mon, Dec 30, 2024 at 05:35:50PM +0100, a1ex-J7K0XVabL0iELgA04lAiVw@public.gmane.org wrote:
There was an interesting post[1] on HN today about 'curl-impersonate', which is a patch[2] to curl which allows it to act like various big browsers, bypassing various fingerprinting techniques which would otherwise prevent the client from accessing the page.
Looking at the patch, maybe there could be some useful ideas here for Dillo to use to load more sites. The SSL library also obviously plays a large role, maybe that's something we will need to consider as well.
I experienced problems with the user-agent being banned, and having to impersonate Firefox to load some sites. I haven't found yet examples of this deep fingerprinting for TLS or similar, you?
In any case, it would be trivial to discern Dillo as we don't support JS, so it can be banned if they decide so.
I've found that sometimes I go to a webpage and see one of the "enable Javascript to continue" pages in Dillo, then I load the same page in Firefox with NoScript blocking all its scripts, and it comes up fine without running any such Javascript. That could be just the User-Agent header though because I don't try faking that. Rather than add Chrome-faking features to Dillo, maybe this would be an extra application of the Rule-based content manipulation RFC: https://github.com/dillo-browser/rfc/blob/rfc-002/rfc-002-rule-based-content... Make a rule for some sites (or Web server responses?) that has Dillo call curl-impersonate to retrieve a Web page instead of doing it in Dillo? By the way, being a Git failure, I really can't see where that MD document lives. I look at the "rfc" repo via the GitHub website in Dillo and there's just a readme. I clone the repo and I just get a readme. I had to look back to your RFC repo announcement to find that link. I guess they're in separate branches or something but I forget things about Git faster than I learn them and can't be bothered learning how to use branches yet again today. I really think it would be better to list them together somewhere obvious, eg. a new Developer Documentation webpage. I can see from this URL mangling that there are probably only two RFCs so far: https://github.com/dillo-browser/rfc/tree/rfc-001/ (rfc-001-dillo-rfc-documents.md) https://github.com/dillo-browser/rfc/tree/rfc-002/ (rfc-002-rule-based-content-manipulation.md) https://github.com/dillo-browser/rfc/tree/rfc-003/ (404)
In my experiences, it is generally not worth reading the website that performs this type of discrimination.
That's often my approach, but then big offenders are things like government websites which one is obliged to read sometimes.