Jorge wrote:
On Fri, May 02, 2008 at 08:02:14PM +0000, corvid wrote:
I've been playing around with this now and then over the past few months when I haven't had anything else going on. I've been running on it for a day or so without incident, but I haven't gone through it with a fine-toothed comb because I know there will be changes.
It's 35K, so I realize I'm in little danger of hearing back too quickly :) Besides, I know you said you have some nbsp thing that needs attention...
I've made a second review over the code, and it doesn't look that bad to me. This is, the decoding in cache, even the stuffed cache-entry; if it can be isolated/tidied in the future better.
hurray!
The part that looks inconvenient to me is the charset choosing dialog. I'd prefer Dillo do this automatically. There's no need to try to recognize every charset in the world, but just a few "most used" ones like koi8-r. BTW, UTF-8 has become the preferred solution.
For one thing, I was thinking of a site I knew of where they had a Russian forum and the message board was too limited to deal with it, so they told people to change each page in that part to windows-1251 or something like that. But okay. That's obviously not any sort of fundamental requirement. (And because I liked changing to text/plain to have source in a proper bw.)
The idea is that we can leave space for a expandable function just in case. BTW, there's an improved version of the data guesser in the ftp dpi. Please send me a koi8-r encoded page to test (or just let me know how to find it with google :).
Here's one that's apparently about 0.7.0: http://gazette.linux.ru.net/lg79/arndt.html
IIRC you sent me an explanation message on what this patch was for and remaining issues. I can't find it :(. Please resend it to me.
Oh, it was just...let's see... "> BTW, please explain what is/are the main point(s) of it. After
reading the comment it was a bit scary to apply. :-o
It's not for applying. It's "here's what I was able to come up with to get encoding translation for page source and plain text, but there are obviously issues, so I need suggestions for... how to do what needs to be done in the most dillolike manner possible" or... or something like that..."
There are plenty of issues. Going in order through the diff:
- *_get_content_type(). I searched for some Russian text files where a charset is given in the HTTP header. None of them were regarded as text by the old test for ascii chars. Maybe the way to go is to have tests for more filetypes and treat unknown as text. Checking for all possible charsets would be no fun, anyway. (Tests for more filetypes might be needed for file uploading, as well.)
The problem with "unknown as text" is that some HTTP servers send downloadable files as text. It is quite practical in this case to get a download dialog from dillo, intead of a binary rendering (especially for newbies). That's why the current data guesser is conservative.
Quite true.
The data guesser can be improved. e.g. taking ASCCI control characters as a strong indication of a binary, and their absence as a text hint. I can give it a look after you send me the files.
- CacheEntry_t now has a horrifying amount of junk in it. Maybe it would be good to pull out about half of it into some sort of Data thing.
It's stuffed but still understandable. If it can be tidied in a second pass, better.
o TypeSet to be set from Meta or a user dialog. Meta can set it once; the dialog can set it repeatedly.
I'd like to avoid that dialog if possible.
o UTF8Data (LocalData?) that's been translated. cache.c has a lot of "entry->UTF8Data ? entry->UTF8Data : entry->Data" in it now that perhaps should be hidden behind something.
Yes.
o DataReaders reference count. UTF8Data is destroyed when not needed. o CharsetDecoder
- a_Cache_unref_buf, a_Capi_unref_buf, a_Nav_unref_buf all just to provide notification that the view source window isn't reading it anymore. Ugh.
Not pretty, but tolerable.
I'm thinking along the line of getting the core of the patch committed and then polish the rough edges.
- a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is.
Simple to solve by allowing view source after parsing the HEAD element only.
Do you think badly-formed html with, e.g., <head>..</head><body>...</body><head><meta..charset..></head> <body>...</body> is worth worrying about?
(Why content-type and not charset? I find switching to text/plain to use findtext on the source to be useful.)
Sorry, I don't understand the question.
I was saying that I was pushing around content-type instead of charset because I liked changing it to text/plain in the dialog. If there's no dialog, maybe I'll go back to charset.
- Decode_charset() no longer free()s its input. This inconsistency would be an excellent source of bugs, so it would probably be good to make it so that none of the decoders free input.
Yes, this is a must. Consistency should be our goal/horizon/friend/helper. :-)