Jorge wrote:
- Reach the meta tag and undo/redo within html.cc. Somewhat fragile. Whatever I might come up with, some html out there would surely find a way to outsmart me.
This is the one I like most at first sight.
Considering the charset can be given by HTTP or a META element:
We can assume ASCII in the html text until the HEAD element is closed. If there's a charset in the META, then the decoder can be switched from null to the specified one.
This approach has the advantage of working both when the charset comes via HTTP or META (<HEAD> content is ASCII).
We can even add a text buffer for the HEAD element and append it to the whole HTML content if the offset is hard to set for the new decoder.
Do the problems you found apply to this scheme too?
My original code worked a bit like this. It would set a flag to make Html_write_raw() quit early and go back up to Html_write() for re-decoding. (It would just restart from the beginning, though, and I would ignore the error messages about multiple heads and so on.) - At least the title tag has to be called again. - Maybe a future dillo that does things with javascript has already seen some and done things with it -- possibly involving literal strings -- by the time we reach meta. I don't know javascript, though, so I don't know whether that's an actual concern. I'm not quite sure what you're saying about the text buffer, though...