[Dillo-dev]HTML Policy and New parser improvements

April 16, 2004

      Hi there,

  Most  of  us  are  familiarized with the aforism about our HTML
parsing policy stated in the [Project Notes]:

  "Our policy with HTML is not to try to render badly written
HTML, ideally send a warning message, and not to crash!"

  Here  goes  the rationale behind it. I wrote it sometime ago as
an email answer, and here I quote an improved version:

<q>

About our parsing policy
------------------------

  These  days I have thought a lot about this subject. In fact an
important  part  of the work of a project maintainer is indeed to
make  a  stance  on  the  difficult decisions; those that are not
white or black, but a trade off.

  With  regard to our parsing policy, imagine a triangle with the
following vertices:

     * Standards (W3C)
     * Web site authors
     * Users

  Each  vertex represents the exact position of its naming group.
The inside area, the whole space of stances anyone could take.

  Each  group  has  its  own  interests,  sometimes  opposed  and
sometimes  very  near.  The  position  of  a  web  browser can be
visualized as a point within the area of that triangle determined
by its development team.

  Dillo  should  have  to  be  in the vertex of the W3C, but that
would make it almost useless because of the horrible state of the
HTML in the Web (aka. "Tag Soup").

  For  that  reason  we  make  some  exceptions so that dillo can
render  a  larger set of the web, by correcting some HTML faults,
but we keep close to the W3C vertex.

  An standards compatible browser, as Mozilla, should be close to
the W3C, but I understand it would never manage to be a canditate
to  replace  IE  if  it  did  (a  trade off). Actually it locates
alongside the authors-users side AFAIU.

  As  our  main objective is the democratization of the access to
the internet's information, and that has direct relation with the
use  of standards, we follow the path of respecting and promoting
them.

  The  idea  of adding an HTML quality meter to the interface, in
the form of a face icon and the number of detected errors, surges
as a good idea to improve dillo as a QA tool for content authors.

  I've  also  thought of adding a "combat mode rendering" button.
That  is  a way of parsing the worst sites into a basic an simple
rendering.  That  way,  users  would be one click away from being
able to "see" pages with awful HTML.

  These  two  ideas would help dillo to keep close to the correct
vertex  of  the  triangle, while also becoming a tool to help web
authors to provide more standards compliant content.

</q>

  After  the  new  parser  was introduced (0.8.0), Dillo featured
much  better HTML error detection, but it rendered malformed HTML
a  bit worst. It was a good trade off from the "standards" vertex
of  the above mentioned triangle, but I also knew that it was not
going to be that much amusing for the "users" vertex.

  These  days  I've  been  working  on  improving  the parser and
bug-meter  by introducing information about the inline, block and
flow content models of HTML.

  After  having that information in place, it was easy to produce
better  and  more  accurate bug detection and also to improve the
rendering more towards what it used to be.

  So  that's  the  good  news:  the new CVS contains code with an
improved  parser  that  hopefully will be a glad surprise for our
users.

  From the Changelog:

   * Added container|inline model information to the HTML element table, and
     made the bug-meter and the parser aware of it. This both improves bug
     detection and rendering.
   * Fixed newly detected HTML bugs in bookmarks dpi and file.c.
   * Fixed opening files with a ':' character in its name (again).
   * Added binaryconst.h (allows for binary constants in C).
   * Fixed The ladder effect with lists (BUG#534).

  So go ahead and try it!

  Cheers
  Jorge.-

Jorge Arellano Cid

Jim Nutt

tags

participants (2)