Nikita,
On Tue, 20 May 2003 13:39:23 -0400 (CLT) Jorge Arellano Cid <jcid@softhome.net> wrote:
Hi Nikita,
Hello, Jorge !
Now, considering the current development plan, it'd be good to have an interim workaround.
After reviewing the implementation in the patch, I'd prefer not to intermix it with the current (correct but incomplete) way of handling entities.
For instance, defining another entities table that's to be used when the entity is not found in the current one is a good way to separate the workaround code!
I'd like to add to this new table some important numeric character references (UTF-8) that are usually found in web pages these days. Things like:
"—" "’" "“" "”"
This could help our rendering while we get to GTK+2. Please let me know if you want to improve it this way!
Cheers Jorge.-
Jorge, the current way to display entities do not satisfy me because it throws away any entity which code is more than 255.
Well, that is for Latin1.
Is there any reason to leave table without representations in place and add a new one _with_ them which differs only in representations field and (possible) bigger number of entities ?
Maybe not, but I didn't suggest _that_ table. I thought that although implicit it was clear, but it seems that character handling code is always harder than it seems! The main reason for separating the tables was to make clear what was the workaround and what the code to be used with UCS in the future. The other is that the second table can be ordered by isocode and thus allow for a bin search instead of the slow linear search the patch uses. While reviewing it in more depth, I found a bug in dillo's code (fixed in CVS), and made a small UCS to latin converter that help in very weird pages like: http://www.michigan.gov/minewswire/ 0,1607,7-136-3452_3518-51568--M_2002_9,00.html (all in one line) Finally, I can't see (yet) how having this entitites character representation helps when using another character set (as with the russian encoding workaround), unless you first convert the regular HTML character encoding into the current locale! Certainly a hairy problem, and one of the main reasons for GTK+2 design decision of using UCS instead of the current locale as its internal representation.
In future (GTK2 internals) all representations definition and use could be cut from source code and I don't see any problem or pain with it, corresponding changes will be not so big.
IMHO all progress we can do now is entering more entities in table.
In any case I interested in further progress in this way.
P.S. Entities you mention cannot be translated from UCS2 (encoding used in HTML numeric entities) to UTF-8. Lynx/links do not show that also.
I meant UCS to Latin1 (see the patch in CVS). Cheers Jorge.-