Hi, On Fri, May 02, 2008 at 08:02:14PM +0000, corvid wrote:
I've been playing around with this now and then over the past few months when I haven't had anything else going on. I've been running on it for a day or so without incident, but I haven't gone through it with a fine-toothed comb because I know there will be changes.
It's 35K, so I realize I'm in little danger of hearing back too quickly :) Besides, I know you said you have some nbsp thing that needs attention...
I've made a second review over the code, and it doesn't look that bad to me. This is, the decoding in cache, even the stuffed cache-entry; if it can be isolated/tidied in the future better. The part that looks inconvenient to me is the charset choosing dialog. I'd prefer Dillo do this automatically. There's no need to try to recognize every charset in the world, but just a few "most used" ones like koi8-r. BTW, UTF-8 has become the preferred solution. The idea is that we can leave space for a expandable function just in case. BTW, there's an improved version of the data guesser in the ftp dpi. Please send me a koi8-r encoded page to test (or just let me know how to find it with google :). IIRC you sent me an explanation message on what this patch was for and remaining issues. I can't find it :(. Please resend it to me.
There are plenty of issues. Going in order through the diff:
- *_get_content_type(). I searched for some Russian text files where a charset is given in the HTTP header. None of them were regarded as text by the old test for ascii chars. Maybe the way to go is to have tests for more filetypes and treat unknown as text. Checking for all possible charsets would be no fun, anyway. (Tests for more filetypes might be needed for file uploading, as well.)
The problem with "unknown as text" is that some HTTP servers send downloadable files as text. It is quite practical in this case to get a download dialog from dillo, intead of a binary rendering (especially for newbies). That's why the current data guesser is conservative. The data guesser can be improved. e.g. taking ASCCI control characters as a strong indication of a binary, and their absence as a text hint. I can give it a look after you send me the files.
- CacheEntry_t now has a horrifying amount of junk in it. Maybe it would be good to pull out about half of it into some sort of Data thing.
It's stuffed but still understandable. If it can be tidied in a second pass, better.
o TypeSet to be set from Meta or a user dialog. Meta can set it once; the dialog can set it repeatedly.
I'd like to avoid that dialog if possible.
o UTF8Data (LocalData?) that's been translated. cache.c has a lot of "entry->UTF8Data ? entry->UTF8Data : entry->Data" in it now that perhaps should be hidden behind something.
Yes.
o DataReaders reference count. UTF8Data is destroyed when not needed. o CharsetDecoder
- a_Cache_unref_buf, a_Capi_unref_buf, a_Nav_unref_buf all just to provide notification that the view source window isn't reading it anymore. Ugh.
Not pretty, but tolerable. I'm thinking along the line of getting the core of the patch committed and then polish the rough edges.
- a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is. Simple to solve by allowing view source after parsing the HEAD element only.
(Why content-type and not charset? I find switching to text/plain to use findtext on the source to be useful.)
Sorry, I don't understand the question.
- Decode_charset() no longer free()s its input. This inconsistency would be an excellent source of bugs, so it would probably be good to make it so that none of the decoders free input.
Yes, this is a must. Consistency should be our goal/horizon/friend/helper. :-) -- Cheers Jorge.-