[Dillo-dev] Re: (working) prototype of charset decoding in cache.c

May 26, 2008


      Jorge wrote:
...
On Fri, May 02, 2008 at 08:02:14PM +0000, corvid wrote:
...
I've been playing around with this now and then over the past few months
when I haven't had anything else going on. I've been running on it for
a day or so without incident, but I haven't gone through it with a
fine-toothed comb because I know there will be changes.
It's 35K, so I realize I'm in little danger of hearing back too quickly :)
Besides, I know you said you have some nbsp thing that needs attention...
I've  made  a  second review over the code, and it doesn't look
that  bad to me. This is, the decoding in cache, even the stuffed
cache-entry; if it can be isolated/tidied in the future better.
hurray!
...
The  part that looks inconvenient to me is the charset choosing
dialog.  I'd  prefer Dillo do this automatically. There's no need
to  try  to  recognize every charset in the world, but just a few
"most used" ones like koi8-r. BTW, UTF-8 has become the preferred
solution.
For one thing, I was thinking of a site I knew of where they had a
Russian forum and the message board was too limited to deal with
it, so they told people to change each page in that part to windows-1251
or something like that. But okay. That's obviously not any sort of
fundamental requirement.
(And because I liked changing to text/plain to have source in a proper bw.)
...
The  idea  is that we can leave space for a expandable function
just  in  case.  BTW,  there's  an  improved  version of the data
guesser  in  the ftp dpi. Please send me a koi8-r encoded page to
test (or just let me know how to find it with google :).
Here's one that's apparently about 0.7.0:
http://gazette.linux.ru.net/lg79/arndt.html
...
IIRC  you sent me an explanation message on what this patch was
for and remaining issues. I can't find it :(. Please resend it to
me.
Oh, it was just...let's see...

">   BTW,  please explain what is/are the main point(s) of it. After
...
reading the comment it was a bit scary to apply. :-o
It's not for applying.
 It's "here's what I was able to come up with to get encoding translation
 for page source and plain text, but there are obviously issues, so I need
 suggestions for... how to do what needs to be done in the most dillolike
 manner possible" or... or something like that..."
...
...
There are plenty of issues. Going in order through the diff:
- *_get_content_type(). I searched for some Russian text files where a
  charset is given in the HTTP header. None of them were regarded as
  text by the old test for ascii chars.
  Maybe the way to go is to have tests for more filetypes and treat
  unknown as text. Checking for all possible charsets would be no fun,
  anyway. (Tests for more filetypes might be needed for file uploading,
  as well.)
The  problem  with "unknown as text" is that some HTTP servers
send  downloadable  files  as text. It is quite practical in this
case  to  get  a  download  dialog from dillo, intead of a binary
rendering  (especially  for newbies). That's why the current data
guesser is conservative.
Quite true.
...
The  data  guesser  can be improved. e.g. taking ASCCI control
characters  as a strong indication of a binary, and their absence
as a text hint. I can give it a look after you send me the files.
...
- CacheEntry_t now has a horrifying amount of junk in it. Maybe it
  would be good to pull out about half of it into some sort of Data thing.
It's stuffed but still understandable. If it can be tidied in a
second pass, better.
...
o TypeSet to be set from Meta or a user dialog. Meta can set it once;
    the dialog can set it repeatedly.
I'd like to avoid that dialog if possible.
...
o UTF8Data (LocalData?) that's been translated. cache.c has a lot of
    "entry->UTF8Data ? entry->UTF8Data : entry->Data" in it now that
    perhaps should be hidden behind something.
Yes.
...
o DataReaders reference count. UTF8Data is destroyed when not needed.
  o CharsetDecoder
- a_Cache_unref_buf, a_Capi_unref_buf, a_Nav_unref_buf all just to
  provide notification that the view source window isn't reading it anymore.
  Ugh.
Not pretty, but tolerable.
I'm  thinking  along  the line of getting the core of the patch
committed and then polish the rough edges.
...
- a_Cache_[gs]et_content_type and capi and nav. Ugh again.
  I don't know whether there's a timing issue with changing decoders
  due to a meta tag at the same time that a view source window is being
  opened.
Yes there is.
Simple  to solve by allowing view source after parsing the HEAD
element only.
Do you think badly-formed html with, e.g.,
<head>..</head><body>...</body><head><meta..charset..></head> <body>...</body>
is worth worrying about?
...
...
(Why content-type and not charset? I find switching to text/plain to
  use findtext on the source to be useful.)
Sorry, I don't understand the question.
I was saying that I was pushing around content-type instead of charset
because I liked changing it to text/plain in the dialog.
If there's no dialog, maybe I'll go back to charset.
...
...
- Decode_charset() no longer free()s its input. This inconsistency would
  be an excellent source of bugs, so it would probably be good to make it so
  that none of the decoders free input.
Yes, this is a must.
  Consistency should be our goal/horizon/friend/helper.  :-)

[Dillo-dev] Re: (working) prototype of charset decoding in cache.c

corvid＠lavabit.com