On Thu, Mar 30, 2006 at 10:30:32PM +0100, Jeremy Henty (ie. me) wrote:
It breaks http://news.independent.co.uk/ , -rc3 renders it as plain text, -rc2 is fine.
It breaks http://www.slate.com/ too.
Aargh! *More* breakage! I've discovered MP3s often start wtih "ID3<lots of whitespace>" so -rc3 guesses they are text/plain . Don't know how to fix this. http://www.nostalgia.com/nf_moreinfo.html?sku=10576 starts with "<!HAS_WEBDNA_TAGS>", again -rc3 guesses text/plain . I can work around this by skipping all leading "<!...>" tags except the "<!DOCTYPE...>". http://www.techworld.com/applications/news/index.cfm?NewsID=5685&inkc=0 starts with a *huge* amount of whitespace and triggers a bug in the content type guesser: it gets a buffer full of whitespace, skips all the way to the end and guesses based on the following garbage in memory. This turns out to be 8 or so binary characters followed by ascii text from a previous buffer. Sometimes the text is part of a previous page, sometimes it contains the message "waiting for the server". -rc3 guesses text/plain . I fixed this bug by adding "if (i==Size) return st;" before doing any guessing - clearly Dillo should not try to guess if it has nothing guess with. I vote against content type guessing unless it can be improved a lot. It just doesn't work well enough. (BTW, I've frequently run Dillo -rcs before and this is the first one that gave me any trouble at all.) Jeremy Henty