patch: the http header Content-Type half of charset conversion

place＠gobigwest.com

Jan. 10, 2008

7:10 p.m.

Dug out my old patch and ripped out the meta tag stuff, the charset-setting dialog, etc. This much should be unobjectionable. Had good fun searching for things like dillo ??????? and seeing how many sites using windows-1251 or koi8-r worked now with this much of the code. Maybe half. Greek and Hebrew were somewhat less than half. And my fonts don't have enough coverage to make it fun trying Asian sites. Of course it doesn't matter much when everybody who participates here is a native speaker of some latin-1 language... Note that View Source will not show files as converted because the conversion is up in html.cc and View Source uses the cache.

Show replies by date

jcid＠dillo.org

January 2008

1:17 a.m.

On Thu, Jan 10, 2008 at 06:00:12PM +0000, place wrote:

...

Dug out my old patch and ripped out the meta tag stuff, the charset-setting dialog, etc. This much should be unobjectionable.

OK, committed. I added support for surrounding whitespace in Html_get_charset(), and added a call to dStr_fit(Local_Buf).

...

Had good fun searching for things like dillo ?????????????? and seeing how many sites using windows-1251 or koi8-r worked now with this much of the code. Maybe half. Greek and Hebrew were somewhat less than half. And my fonts don't have enough coverage to make it fun trying Asian sites. Of course it doesn't matter much when everybody who participates here is a native speaker of some latin-1 language...

How did you make the search? I managed to get a few windows-125[2-3] but nothing else. URLs, or a dummies guide to searching is appreciatted. ;)

...

Note that View Source will not show files as converted because the conversion is up in html.cc and View Source uses the cache.

Actually, to me, it looks like the cache is a good place to make the conversion (instead of html). That way text, html and view source would work in a similar way: file ---- \ .-> text '-> / Cache ---> html .-> (w/decoding) \ / '-> View Source web ---- .-> verbatim from source / Save \ '-> utf-8 (removing meta charset if present) For saving, a preference may be set (or a dialog popped up). Maybe the most important part now is to try to get HTML's meta charset working, and only after that to choose where to place the code. -- Cheers Jorge.-

place＠gobigwest.com

11:44 a.m.

Jorge wrote:

...

How did you make the search? I managed to get a few windows-125[2-3] but nothing else. URLs, or a dummies guide to searching is appreciated. ;)

I wonder whether "???????" came through. I was careful to send the msg as UTF-8, but I neglected to check the copy that came back via the list to see whether anything munged it up along the way. The third hit on the first page via google gives me a koi8-r page for a charset patch to 0.6.6. http://www.google.com/search?q=dillo+%D0%B1%D1%80%D0%B0%D1%83%D0%B7%D0%B5%D1... I just stumbled upon the fact that google itself is a decent source. www.google.cn is GB2312 (google, working to crush dissent and make information disappear. Big money is the only morality, after all.) www.google.gr is ISO-8859-7, www.google.lt is windows-1257 Took me a while this time to find any Hebrew. pc.co.il is windows-1255. Displayed left-to-right, but it's a start.

...

Actually, to me, it looks like the cache is a good place to make the conversion (instead of html). That way text, html and view source would work in a similar way:

I think you're right. When I started it, I had the idea that cache was going to be "pure" and not know about what it was sending to clients, but, yeah, it does need to know.

...

Maybe the most important part now is to try to get HTML's meta charset working, and only after that to choose where to place the code.

Putting aside my earlier concerns about javascript since it's not like javascript support is imminent, let's see... http says it's charset A. cache.c translates A->utf8. meta says it's charset B. html.cc translates utf8->A followed by B->utf8, except that won't work because the utf8 that html.cc received is probably full of UFFFD characters and things. badly-written pages displaying text and making buttons and tooltips and so on by the time the meta tag is reached.

place＠gobigwest.com

9 p.m.

Jorge wrote:

...

How did you make the search? I managed to get a few windows-125[2-3] but nothing else. URLs, or a dummies guide to searching is appreciatted. ;)

Why google sucks, part n: UTF-8 queries work from the dialog, and I can click page 2, page 3, etc., but it turns out that if I use their search input, they _insist_ upon sending <input type=hidden name=ie value="ISO-8859-1"> unless I change the user-agent string. So what if my accept-charset is utf-8 (even tried removing iso-8859-1)? Doesn't matter. It must think it knows what dillo can do, but it is tragically mistaken. Maybe the proper solution is to allow hidden inputs to be "unhidden" as text inputs. Probably wouldn't be the only place where it would be useful...

6403

Age (days ago)

6408

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

jcid＠dillo.org
place＠gobigwest.com