Re: [Dillo-dev] entities display

May 30, 2003

      Nikita,
...
On Tue, 20 May 2003 13:39:23 -0400 (CLT)
Jorge Arellano Cid <jcid@softhome.net> wrote:
...
Hi Nikita,
Hello, Jorge !
...
Now,  considering the current development plan, it'd be good to
have an interim workaround.
After reviewing the implementation in the patch, I'd prefer not
to  intermix  it with the current (correct but incomplete) way of
handling entities.
For instance, defining another entities table that's to be used
when  the entity is not found in the current one is a good way to
separate the workaround code!
I'd  like  to  add  to  this  new  table some important numeric
character  references (UTF-8) that are usually found in web pages
these days. Things like:
"—"
  "’"
  "“"
  "”"
This could help our rendering while we get to GTK+2.
  Please let me know if you want to improve it this way!
Cheers
  Jorge.-
Jorge, the current way to display entities do not satisfy me because
it throws away any entity which code is more than 255.
Well, that is for Latin1.
...
Is there any
reason to leave table without representations in place and add
a new one _with_ them which differs only in representations field and
(possible) bigger number of entities ?
Maybe not, but I didn't suggest _that_ table.

  I  thought  that  although  implicit it was clear, but it seems
that character handling code is always harder than it seems!

  The  main  reason  for  separating the tables was to make clear
what  was the workaround and what the code to be used with UCS in
the future.

  The  other  is  that the second table can be ordered by isocode
and thus allow for a bin search instead of the slow linear search
the patch uses.

  While reviewing it in more depth, I found a bug in dillo's code
(fixed in CVS), and made a small UCS to latin converter that help
in very weird pages like:

  http://www.michigan.gov/minewswire/
         0,1607,7-136-3452_3518-51568--M_2002_9,00.html

  (all in one line)

  Finally,  I can't see (yet) how having this entitites character
representation  helps  when  using another character set (as with
the  russian  encoding  workaround), unless you first convert the
regular   HTML   character  encoding  into  the  current  locale!
Certainly  a hairy problem, and one of the main reasons for GTK+2
design decision of using UCS instead of the current locale as its
internal representation.
...
In future (GTK2 internals) all representations definition and use could
be cut from source code and I don't see any problem or pain with it,
corresponding changes will be not so big.
IMHO all progress we can do now is entering more entities in table.
In any case I interested in further progress in this way.
P.S. Entities you mention cannot be translated from UCS2 (encoding used
in HTML numeric entities) to UTF-8. Lynx/links do not show that also.
I meant UCS to Latin1 (see the patch in CVS).

  Cheers
  Jorge.-