Re: (working) prototype of charset decoding in cache.c
Hi, On Fri, May 02, 2008 at 08:02:14PM +0000, corvid wrote:
I've been playing around with this now and then over the past few months when I haven't had anything else going on. I've been running on it for a day or so without incident, but I haven't gone through it with a fine-toothed comb because I know there will be changes.
It's 35K, so I realize I'm in little danger of hearing back too quickly :) Besides, I know you said you have some nbsp thing that needs attention...
I've made a second review over the code, and it doesn't look that bad to me. This is, the decoding in cache, even the stuffed cache-entry; if it can be isolated/tidied in the future better. The part that looks inconvenient to me is the charset choosing dialog. I'd prefer Dillo do this automatically. There's no need to try to recognize every charset in the world, but just a few "most used" ones like koi8-r. BTW, UTF-8 has become the preferred solution. The idea is that we can leave space for a expandable function just in case. BTW, there's an improved version of the data guesser in the ftp dpi. Please send me a koi8-r encoded page to test (or just let me know how to find it with google :). IIRC you sent me an explanation message on what this patch was for and remaining issues. I can't find it :(. Please resend it to me.
There are plenty of issues. Going in order through the diff:
- *_get_content_type(). I searched for some Russian text files where a charset is given in the HTTP header. None of them were regarded as text by the old test for ascii chars. Maybe the way to go is to have tests for more filetypes and treat unknown as text. Checking for all possible charsets would be no fun, anyway. (Tests for more filetypes might be needed for file uploading, as well.)
The problem with "unknown as text" is that some HTTP servers send downloadable files as text. It is quite practical in this case to get a download dialog from dillo, intead of a binary rendering (especially for newbies). That's why the current data guesser is conservative. The data guesser can be improved. e.g. taking ASCCI control characters as a strong indication of a binary, and their absence as a text hint. I can give it a look after you send me the files.
- CacheEntry_t now has a horrifying amount of junk in it. Maybe it would be good to pull out about half of it into some sort of Data thing.
It's stuffed but still understandable. If it can be tidied in a second pass, better.
o TypeSet to be set from Meta or a user dialog. Meta can set it once; the dialog can set it repeatedly.
I'd like to avoid that dialog if possible.
o UTF8Data (LocalData?) that's been translated. cache.c has a lot of "entry->UTF8Data ? entry->UTF8Data : entry->Data" in it now that perhaps should be hidden behind something.
Yes.
o DataReaders reference count. UTF8Data is destroyed when not needed. o CharsetDecoder
- a_Cache_unref_buf, a_Capi_unref_buf, a_Nav_unref_buf all just to provide notification that the view source window isn't reading it anymore. Ugh.
Not pretty, but tolerable. I'm thinking along the line of getting the core of the patch committed and then polish the rough edges.
- a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is. Simple to solve by allowing view source after parsing the HEAD element only.
(Why content-type and not charset? I find switching to text/plain to use findtext on the source to be useful.)
Sorry, I don't understand the question.
- Decode_charset() no longer free()s its input. This inconsistency would be an excellent source of bugs, so it would probably be good to make it so that none of the decoders free input.
Yes, this is a must. Consistency should be our goal/horizon/friend/helper. :-) -- Cheers Jorge.-
Jorge wrote:
On Fri, May 02, 2008 at 08:02:14PM +0000, corvid wrote:
I've been playing around with this now and then over the past few months when I haven't had anything else going on. I've been running on it for a day or so without incident, but I haven't gone through it with a fine-toothed comb because I know there will be changes.
It's 35K, so I realize I'm in little danger of hearing back too quickly :) Besides, I know you said you have some nbsp thing that needs attention...
I've made a second review over the code, and it doesn't look that bad to me. This is, the decoding in cache, even the stuffed cache-entry; if it can be isolated/tidied in the future better.
hurray!
The part that looks inconvenient to me is the charset choosing dialog. I'd prefer Dillo do this automatically. There's no need to try to recognize every charset in the world, but just a few "most used" ones like koi8-r. BTW, UTF-8 has become the preferred solution.
For one thing, I was thinking of a site I knew of where they had a Russian forum and the message board was too limited to deal with it, so they told people to change each page in that part to windows-1251 or something like that. But okay. That's obviously not any sort of fundamental requirement. (And because I liked changing to text/plain to have source in a proper bw.)
The idea is that we can leave space for a expandable function just in case. BTW, there's an improved version of the data guesser in the ftp dpi. Please send me a koi8-r encoded page to test (or just let me know how to find it with google :).
Here's one that's apparently about 0.7.0: http://gazette.linux.ru.net/lg79/arndt.html
IIRC you sent me an explanation message on what this patch was for and remaining issues. I can't find it :(. Please resend it to me.
Oh, it was just...let's see... "> BTW, please explain what is/are the main point(s) of it. After
reading the comment it was a bit scary to apply. :-o
It's not for applying. It's "here's what I was able to come up with to get encoding translation for page source and plain text, but there are obviously issues, so I need suggestions for... how to do what needs to be done in the most dillolike manner possible" or... or something like that..."
There are plenty of issues. Going in order through the diff:
- *_get_content_type(). I searched for some Russian text files where a charset is given in the HTTP header. None of them were regarded as text by the old test for ascii chars. Maybe the way to go is to have tests for more filetypes and treat unknown as text. Checking for all possible charsets would be no fun, anyway. (Tests for more filetypes might be needed for file uploading, as well.)
The problem with "unknown as text" is that some HTTP servers send downloadable files as text. It is quite practical in this case to get a download dialog from dillo, intead of a binary rendering (especially for newbies). That's why the current data guesser is conservative.
Quite true.
The data guesser can be improved. e.g. taking ASCCI control characters as a strong indication of a binary, and their absence as a text hint. I can give it a look after you send me the files.
- CacheEntry_t now has a horrifying amount of junk in it. Maybe it would be good to pull out about half of it into some sort of Data thing.
It's stuffed but still understandable. If it can be tidied in a second pass, better.
o TypeSet to be set from Meta or a user dialog. Meta can set it once; the dialog can set it repeatedly.
I'd like to avoid that dialog if possible.
o UTF8Data (LocalData?) that's been translated. cache.c has a lot of "entry->UTF8Data ? entry->UTF8Data : entry->Data" in it now that perhaps should be hidden behind something.
Yes.
o DataReaders reference count. UTF8Data is destroyed when not needed. o CharsetDecoder
- a_Cache_unref_buf, a_Capi_unref_buf, a_Nav_unref_buf all just to provide notification that the view source window isn't reading it anymore. Ugh.
Not pretty, but tolerable.
I'm thinking along the line of getting the core of the patch committed and then polish the rough edges.
- a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is.
Simple to solve by allowing view source after parsing the HEAD element only.
Do you think badly-formed html with, e.g., <head>..</head><body>...</body><head><meta..charset..></head> <body>...</body> is worth worrying about?
(Why content-type and not charset? I find switching to text/plain to use findtext on the source to be useful.)
Sorry, I don't understand the question.
I was saying that I was pushing around content-type instead of charset because I liked changing it to text/plain in the dialog. If there's no dialog, maybe I'll go back to charset.
- Decode_charset() no longer free()s its input. This inconsistency would be an excellent source of bugs, so it would probably be good to make it so that none of the decoders free input.
Yes, this is a must. Consistency should be our goal/horizon/friend/helper. :-)
On Mon, May 26, 2008 at 03:01:53PM +0000, corvid wrote:
[...] - a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is.
Simple to solve by allowing view source after parsing the HEAD element only.
Do you think badly-formed html with, e.g., <head>..</head><body>...</body><head><meta..charset..></head> <body>...</body> is worth worrying about?
This is already handled in the parser. But now that you mention this, we have to be careful with reload. It deletes the entry from the cache before making the request. A simple solution is just to close the view-source window in this corner case.
(Why content-type and not charset? I find switching to text/plain to use findtext on the source to be useful.)
Sorry, I don't understand the question.
I was saying that I was pushing around content-type instead of charset because I liked changing it to text/plain in the dialog. If there's no dialog, maybe I'll go back to charset.
"View source" as a bw is not a bad idea. It reuses code and adds findtext for free, but currently loses the line numbers. (which match nicely with the bug meter). This is a separate problem. One way to have both features in "View source" is with a dpi that gets the source from dillo and sends back an html page. The simple way is to prefix line numbers and wrap with PRE. It can be sent with a no-cache directive, or removed from cache upon window close. The problem of sending back and forth a huge page may be not as bad as it looks because huge pages are seldom inspected for source from a browser, and it can be handled anyway. -- Cheers Jorge.-
Jorge wrote:
On Mon, May 26, 2008 at 03:01:53PM +0000, corvid wrote:
[...] - a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is.
Simple to solve by allowing view source after parsing the HEAD element only.
Do you think badly-formed html with, e.g., <head>..</head><body>...</body><head><meta..charset..></head> <body>...</body> is worth worrying about?
This is already handled in the parser.
But I mean it finishes seeing the first head, enables view source, user views the source, and then the charset in the second head causes a repush. Is the parser avoiding that?
But now that you mention this, we have to be careful with reload. It deletes the entry from the cache before making the request.
A simple solution is just to close the view-source window in this corner case.
Ah, yes. Stupid corner cases!
I was saying that I was pushing around content-type instead of charset because I liked changing it to text/plain in the dialog. If there's no dialog, maybe I'll go back to charset.
"View source" as a bw is not a bad idea. It reuses code and adds findtext for free, but currently loses the line numbers. (which match nicely with the bug meter).
This is a separate problem.
One way to have both features in "View source" is with a dpi that gets the source from dillo and sends back an html page. The simple way is to prefix line numbers and wrap with PRE.
It can be sent with a no-cache directive, or removed from cache upon window close.
The problem of sending back and forth a huge page may be not as bad as it looks because huge pages are seldom inspected for source from a browser, and it can be handled anyway.
Maybe adding a callback that stuffs in line numbers before going into the regular plain callback would work and be tiny. a_Capi_open_url(Web, our_special_callback, ...) Oh well, I realize it's premature to dream too hard about this at this point :)
On Mon, May 26, 2008 at 07:31:14PM +0000, corvid wrote:
Jorge wrote:
On Mon, May 26, 2008 at 03:01:53PM +0000, corvid wrote:
[...] - a_Cache_[gs]et_content_type and capi and nav. Ugh again. I don't know whether there's a timing issue with changing decoders due to a meta tag at the same time that a view source window is being opened.
Yes there is.
Simple to solve by allowing view source after parsing the HEAD element only.
Do you think badly-formed html with, e.g., <head>..</head><body>...</body><head><meta..charset..></head> <body>...</body> is worth worrying about?
This is already handled in the parser.
But I mean it finishes seeing the first head, enables view source, user views the source, and then the charset in the second head causes a repush. Is the parser avoiding that?
Yes, once you are IN_BODY, HEAD can't be re-opened, and META doesn't work outside HEAD.
But now that you mention this, we have to be careful with reload. It deletes the entry from the cache before making the request.
A simple solution is just to close the view-source window in this corner case.
Ah, yes. Stupid corner cases!
:)
I was saying that I was pushing around content-type instead of charset because I liked changing it to text/plain in the dialog. If there's no dialog, maybe I'll go back to charset.
"View source" as a bw is not a bad idea. It reuses code and adds findtext for free, but currently loses the line numbers. (which match nicely with the bug meter).
This is a separate problem.
One way to have both features in "View source" is with a dpi that gets the source from dillo and sends back an html page. The simple way is to prefix line numbers and wrap with PRE.
It can be sent with a no-cache directive, or removed from cache upon window close.
The problem of sending back and forth a huge page may be not as bad as it looks because huge pages are seldom inspected for source from a browser, and it can be handled anyway.
Maybe adding a callback that stuffs in line numbers before going into the regular plain callback would work and be tiny. a_Capi_open_url(Web, our_special_callback, ...) Oh well, I realize it's premature to dream too hard about this at this point :)
I'd prefer to try the dpi before optimizing. Having one path helps a lot with... Oh, I already gave that lines. ;) The ones that read the paper, please raise your hands. :-) -- Cheers Jorge.-
Jorge wrote:
But I mean it finishes seeing the first head, enables view source, user views the source, and then the charset in the second head causes a repush. Is the parser avoiding that?
Yes, once you are IN_BODY, HEAD can't be re-opened, and META doesn't work outside HEAD.
But if I give it a page containing <html> <head> <title>first head</title> </head> <body> body </body> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> </head> </html> I get .. META Content-Type changes charset to: iso-8859-1 .. META Content-Type gave charset as: iso-8859-1 a_Nav_expect_done: repush!
I'd prefer to try the dpi before optimizing.
I don't know dpis, so to me that's the difficult way...
On Mon, May 26, 2008 at 08:50:19PM +0000, corvid wrote:
Jorge wrote:
But I mean it finishes seeing the first head, enables view source, user views the source, and then the charset in the second head causes a repush. Is the parser avoiding that?
Yes, once you are IN_BODY, HEAD can't be re-opened, and META doesn't work outside HEAD.
But if I give it a page containing
<html> <head> <title>first head</title> </head> <body> body </body> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> </head> </html>
I get
.. META Content-Type changes charset to: iso-8859-1 .. META Content-Type gave charset as: iso-8859-1 a_Nav_expect_done: repush!
That's a corner case bug. It passed because you closed BODY explicitly (i.e. if you remove the </body> tag, the fault is caught. Patch committed.
I'd prefer to try the dpi before optimizing.
I don't know dpis, so to me that's the difficult way...
Then let's focus in the decoder. -- Cheers Jorge.-
participants (2)
-
corvid@lavabit.com
-
jcid@dillo.org