[patch]: do not require Html_process_word()s argument to be null-terminated
This is almost a trivial cleanup, but I'd be more comfortable if someone else checked I have not missed anything. (I have been using it locally for a week or two with no problems so far.) Html_process_word() requires its argument to be null-terminated. This forces Html_write_raw() (the only caller of Html_process_word()) to write a null character before calling Html_process_word() and restore the original character afterwards. It turns out to be easy to change Html_process_word() to not need the null-terminator. This eliminates the hack in Html_write_raw() and lets us make several function arguments "const char *" instead of "char *". It's a little tricky to see that the patch is correct. It works because none of the functions that Html_process_word() calls require a null-terminator, and the return value of a_Html_parse_entities() *is* null-terminated. Have I missed anything? If not, I'll push it. Regards, Jeremy Henty
On Sun, Sep 12, 2010 at 01:14:11PM +0100, Jeremy Henty wrote:
This is almost a trivial cleanup, but I'd be more comfortable if someone else checked I have not missed anything. (I have been using it locally for a week or two with no problems so far.)
Html_process_word() requires its argument to be null-terminated. This forces Html_write_raw() (the only caller of Html_process_word()) to write a null character before calling Html_process_word() and restore the original character afterwards.
It turns out to be easy to change Html_process_word() to not need the null-terminator. This eliminates the hack in Html_write_raw() and lets us make several function arguments "const char *" instead of "char *".
It's a little tricky to see that the patch is correct. It works because none of the functions that Html_process_word() calls require a null-terminator, and the return value of a_Html_parse_entities() *is* null-terminated.
Have I missed anything? If not, I'll push it.
A few years ago, parts of the parser were coded not to need null-terminators (for speed's sake). It got really hard to maintain and every time a change was necessary, the whole function-call stack needed a review. Simple semantically good loking patches introduced bugs because library functions couldn't be used there too. One day I decided to use standard null-terminated strings and everything got easier. Not to mention there was no noticeable speed penalty. IOW, I learnt this the hard way! ;) FWIW, profiling has surprised me a few times: once I wrote a minimal perfect hash for matching HTML elements and the added complexity didn't payoff against a simple binary search. -1 -- Cheers Jorge.-
Just to clarify, I wasn't expecting any speed-up from this, it was just a cleanup patch. I accept the point about the dangers of fragile APIs in general, but I don't find it very compelling in this particular case. Html_process_word() is called from only one place, and it's caller has to hack its around the entirely unnecessary requirement that the argument be null-terminated. I don't see that any API is damaged just by cleaning up this particular function call. Regards, Jeremy Henty
On Wed, Sep 15, 2010 at 07:45:06AM +0100, Jeremy Henty wrote:
Just to clarify, I wasn't expecting any speed-up from this, it was just a cleanup patch. I accept the point about the dangers of fragile APIs in general, but I don't find it very compelling in this particular case. Html_process_word() is called from only one place, and it's caller has to hack its around the entirely unnecessary requirement that the argument be null-terminated. I don't see that any API is damaged just by cleaning up this particular function call.
First I have to say that I agree with Jorge. Working with non NULL-terminated buffers is extremly ugly and error prone. The reason is that when using length based variants of the standard string routines (e.g. strncmp() ...) you always have to recompute the length relative to your current position in the buffer. With null-terminated strings no matter where you are in the string it will still be properly terminated. On the other hand I'd also like to see Html_write_raw() to work on a const-buffer and even drop the requirement that it's buf parameter needs to be null-terminated (which is not addressed in your patch atm). The reason is that currently we pass a buflen and _also_ require buf to be null-terminated. That's rather confusing. I don't have a solution yet, but I envision something, where e.g. Html_write_raw() or DilloHtml::write() works with a const buf and buflen (no null-termination required), but non-terminated buffers don't leak into lower-layer functions (like Html_process_space(), or Html_process_word()). Also, as there is no pressing need we should be extremely careful and think twice before changing things. Regards, Johannes
Johannes Hofmann wrote:
I don't have a solution yet, but I envision something, where e.g. Html_write_raw() or DilloHtml::write() works with a const buf and buflen (no null-termination required), but non-terminated buffers don't leak into lower-layer functions (like Html_process_space(), or Html_process_word()).
In that case, whenever a "high-layer" function like Html_write_raw() calls a lower-layer function it will have to either poke in a null byte (as Html_write_raw() does now) or copy a segment of its buffer into a new null-terminated buffer. If you want the original buffer to be const then it can't poke in a null, so it will have to copy. Unless there's a third way, but I can't see one. Would you consider all the Html_tag_open_*() functions to be lower-layer too? If so then it would be natural (as I understand your suggestion) for them to lose their length arguments. Is that right? And presumably Html_get_{attr,attr2}() also? The more I think about this the more functions in html.cc I think it might affect. Maybe we could clone a new repository to experiment with? Regards, Jeremy Henty
On Wed, Sep 15, 2010 at 09:24:58PM +0100, Jeremy Henty wrote:
Johannes Hofmann wrote:
I don't have a solution yet, but I envision something, where e.g. Html_write_raw() or DilloHtml::write() works with a const buf and buflen (no null-termination required), but non-terminated buffers don't leak into lower-layer functions (like Html_process_space(), or Html_process_word()).
In that case, whenever a "high-layer" function like Html_write_raw() calls a lower-layer function it will have to either poke in a null byte (as Html_write_raw() does now) or copy a segment of its buffer into a new null-terminated buffer. If you want the original buffer to be const then it can't poke in a null, so it will have to copy. Unless there's a third way, but I can't see one.
As I said, I don't have any good idea either. So maybe what we currently have is a good compromise - I don't know. Also I won't invest much time as there are is no real need to change things.
Would you consider all the Html_tag_open_*() functions to be lower-layer too? If so then it would be natural (as I understand your suggestion) for them to lose their length arguments. Is that right? And presumably Html_get_{attr,attr2}() also? The more I think about this the more functions in html.cc I think it might affect. Maybe we could clone a new repository to experiment with?
I would leave them as they are but feel free to experiment. Having a cloned repo for experimenting is always a good idea. I have a clone for every non-trivial thing I do. Regards, Johannes
participants (3)
-
jcid@dillo.org
-
Johannes.Hofmann@gmx.de
-
onepoint@starurchin.org