There is a bug in the HTML parser: Tags within quotes are interpreted: <input type="text" name="test" value="<p>asdf</p>" /> I think it is line 3754f (src/html.cc) which evokes the unwanted behaviour. I'd fix this bug by introducing two variables: The first one states whether we're currently inside of a quote and the second one stores its type (single or double quote). As long as the current character does not equal the type, all characters in between the starting and ending quote will be ignored. Of course we should be also able to deal with escaped quotes allowing constructs similar to the following one: <input type="text" name="test" value="<p>\"asdf\"</p>" /> --Tim
Tim wrote:
There is a bug in the HTML parser: Tags within quotes are interpreted:
<input type="text" name="test" value="<p>asdf</p>" />
I think it is line 3754f (src/html.cc) which evokes the unwanted behaviour. I'd fix this bug by introducing two variables: The first one states whether we're currently inside of a quote and the second one stores its type (single or double quote). As long as the current character does not equal the type, all characters in between the starting and ending quote will be ignored. Of course we should be also able to deal with escaped quotes allowing constructs similar to the following one:
<input type="text" name="test" value="<p>\"asdf\"</p>" />
Strictly speaking, for attributes that are CDATA, at least, my impression is that the element is closed if the parser encounters an (SGML jargon!) end-tag open (ETAGO, i.e., "</") "delimiter-in-context" (meaning, I think, that it is followed by certain characters such as ASCII letters). Of course, sgml is a horrible, loony monstrosity, and we don't follow it to the letter to begin with, so I'm not sure that I'd necessarily be opposed to making the parser behave as you say. (Remembers http://lists.auriga.wearlab.de/pipermail/dillo-dev/2008-January/003668.html)
On Sat, Dec 05, 2009 at 08:47:56PM +0000, corvid wrote:
Tim wrote:
There is a bug in the HTML parser: Tags within quotes are interpreted:
<input type="text" name="test" value="<p>asdf</p>" />
I think it is line 3754f (src/html.cc) which evokes the unwanted behaviour. I'd fix this bug by introducing two variables: The first one states whether we're currently inside of a quote and the second one stores its type (single or double quote). As long as the current character does not equal the type, all characters in between the starting and ending quote will be ignored. Of course we should be also able to deal with escaped quotes allowing constructs similar to the following one:
<input type="text" name="test" value="<p>\"asdf\"</p>" />
Strictly speaking, for attributes that are CDATA, at least, my impression is that the element is closed if the parser encounters an (SGML jargon!) end-tag open (ETAGO, i.e., "</") "delimiter-in-context" (meaning, I think, that it is followed by certain characters such as ASCII letters).
Of course, sgml is a horrible, loony monstrosity, and we don't follow it to the letter to begin with, so I'm not sure that I'd necessarily be opposed to making the parser behave as you say. (Remembers http://lists.auriga.wearlab.de/pipermail/dillo-dev/2008-January/003668.html)
Good advice! ;) Although a better heuristic is welcomed, it's not easy. Beware of the sleeping dragon. You have to consider (at least): * RFC (HTML, SGML, XML) * What do other browsers do. * Heuristic worst case and average. * Unexpected input (broken HTML or XML, attacks...). * Complexity and overhead (it affects the tokenizer). * Error reporting. The above idea has been tried, but it fails when the attribute lacks its closing quote, sometimes eating lots of content when plain text follows. You can find more details in doc/HtmlParser.txt. I wish we could find a better heuristic. For instance to clean the IFRAME rendering of http://www.linuxfordevices.com/. -- Cheers Jorge.-
participants (3)
-
corvid@lavabit.com
-
jcid@dillo.org
-
tim.nieradzik@gmx.de