On Sun, May 25, 2008 at 01:12:38PM +0200, Justus Winter wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi :)
I was playing around with Michal Zalewskis html fuzzer 'mangle' trying to crash dillo and succeeded :).
Good. It's no surprise because dillo2 needs a careful review of behaviour when facing strange/malicious values. Obviously this phase was procrastinated until basic functionality was completed. For instance, most probably you'll get to crash dillo2 by passing it some negative values in attributes. The problem of being robust when parsing garbage or malicious code needs a general strategy. I was delighted to read DJB's paper: "Some thoughts on security after ten years of qmail 1.0" (Strongly suggested to everyone!). There're lots of good ideas there, tried and true. It's clear that implementing them all is a huge task, but just a couple of them can make wonders, and reading the paper makes you think in a different way when designing code. For instance, in dlib, I took the simple approach of checking each function's input before processing (not 100% complete, but you get the idea), and it has proven rock solid. The extra time spent on checks doesn't impact performance noticeably, but the simplicity and joy of relying on a strong ADT set is priceless. Another strategy that allowed me to simplify dillo a lot, was to avoid multiple execution paths. With multiple paths, code gets much more complex, and maintenance is even harder, minor changes have side effects to track and testing depends on the path taken. A single path is tested by common use all the time, has only one place to make changes, is simpler to understand/debug/modify, and the simplicity leverage clearly overcomes the possible performace tradeoff. Beautifuly expressed: "The key to performance is elegance, not battalions of special cases." Jon Bentley and Doug McIlroy e.g. The CCC used to have one-branch-per-protocol. When it was switched to a single "universal" branch, CCC code size was reduced importantly, a lot of the complexity vanished, fixing a bug benefited all protocols, there were no multiple paths to track, etc. e.g.2 Memory handling in dillo. The general idea of trying to make each function deallocate what it allocated, made a hell of memory traceback debugging vanish. And the performance impact was unnoticed!
Furthermore I was able to reduce the ~100k page that was triggering a NULL pointer dereferencing to three tags (see the attached page).
Some digging around (html.cc is a monster ^^) revealed that
style_attrs.setBorderColor ( Color::createShaded(HT2LT(html), style_attrs.backgroundColor->getColor()));
in Html_tag_open_table was failing due to the fact that style_attrs.backgroundColor was NULL.
Html_tag_open_select sets the backgroundColor to NULL and this is inherited by the table tag since it's a child of the select tag.
Greping through the code shows that there are at least two other functions that set backgroundColor (and color) to NULL (Html_tag_open_input and Html_tag_open_isindex) and twelve locations calling getColor() on either color or backgroundColor.
I have no idea how to _properly_ fix this issue.
Not me either. I'd check what NULL means. A quick look hints towards a non inherited value: /** * \brief Reset those style attributes to their standard values, which are * not inherited, according to CSS. */ void StyleAttrs::resetValues () A default backgroundColor should be a good place to start. HTH.
Furthermore I'd suggest to break down html.cc into several files. It is about 6000 lines long (-> hard to dig into) and it takes gcc roughly five seconds to compile on my box.
I agree. The hard part is a good choice on what to separate. ;) -- Cheers Jorge.-