Workaround for bugs 396 & 445 (entities display)
Hello, all Dillo developers ! As mentioned in HTML 4.01 spec what should be done when html renderer cannot display a symbol is up to renderer. Current code renders all characters with code 0..255 but skip all the others silently. I prefer lynx/links approach to this problem. If they should display symbol which cannot be displayed properly in current locale they render ascii-only representation of this symbol, e.g. "α" -> "a", "Χ" -> "X", "∗" -> "*". Included small patch does that for all entities mentioned in html.c and this table is much more compact than links' one and does not allow transliteration of foreign language as links does. I believe that this patch is a temporary solution for this problem and when dillo migrates on GTK2 and this problem will be solved mush more reliable and elegant way by using standard symbol fonts via fontsets. I should mention also that if your encodings is not ISO-8859-1 (Latin-1) then you should probably use Russian patch http://stuphead.asplinux.ru/dillo/index.html for localization. It shall be updated soon to translate entities to your encoding and show representations for others. Yours, Nikita
Hi Nikita, On Tue, 13 May 2003, Nikita V. Borodikhin wrote:
Hello, all Dillo developers !
As mentioned in HTML 4.01 spec what should be done when html renderer cannot display a symbol is up to renderer. Current code renders all characters with code 0..255 but skip all the others silently. I prefer lynx/links approach to this problem. If they should display symbol which cannot be displayed properly in current locale they render ascii-only representation of this symbol, e.g. "α" -> "a", "Χ" -> "X", "∗" -> "*".
Included small patch does that for all entities mentioned in html.c and this table is much more compact than links' one and does not allow transliteration of foreign language as links does.
I believe that this patch is a temporary solution for this problem and when dillo migrates on GTK2 and this problem will be solved mush more reliable and elegant way by using standard symbol fonts via fontsets.
Yes, the plan is to handle them properly with GTK+2. Now, considering the current development plan, it'd be good to have an interim workaround. After reviewing the implementation in the patch, I'd prefer not to intermix it with the current (correct but incomplete) way of handling entities. For instance, defining another entities table that's to be used when the entity is not found in the current one is a good way to separate the workaround code! I'd like to add to this new table some important numeric character references (UTF-8) that are usually found in web pages these days. Things like: "—" "’" "“" "”" This could help our rendering while we get to GTK+2. Please let me know if you want to improve it this way! Cheers Jorge.- PS: With regard to "saving entered datas in forms", after some WEB research, I found it's somewhat obscure to know whether there's a way to do it the "standard way". Some sources like caching and some others hate it. As we know, caching form data can be a huge security threat. That mainly depends on what data is remembered and on the user's environment. For instance, some banks refuse to work with browsers that cache the pages (verifying browser signatures), with a view to protect their customers from eavesdropping (as it could happen when they make an operation and leave for a while to the bathroom; Any person in the same office can click a few Backs and read what was done and read the hand-typed camps). I'd appreciate some more research on this topic (certainly a time-comsuming task I can't work on now). To know what some serious browsers do, is a good starting point. One of our objectives is personal security and privacy, and it takes a lot of work.
On Tue, 20 May 2003 13:39:23 -0400 (CLT) Jorge Arellano Cid <jcid@softhome.net> wrote:
Hi Nikita,
Hello, Jorge !
Now, considering the current development plan, it'd be good to have an interim workaround.
After reviewing the implementation in the patch, I'd prefer not to intermix it with the current (correct but incomplete) way of handling entities.
For instance, defining another entities table that's to be used when the entity is not found in the current one is a good way to separate the workaround code!
I'd like to add to this new table some important numeric character references (UTF-8) that are usually found in web pages these days. Things like:
"—" "’" "“" "”"
This could help our rendering while we get to GTK+2. Please let me know if you want to improve it this way!
Cheers Jorge.-
Jorge, the current way to display entities do not satisfy me because it throws away any entity which code is more than 255. Is there any reason to leave table without representations in place and add a new one _with_ them which differs only in representations field and (possible) bigger number of entities ? In future (GTK2 internals) all representations definition and use could be cut from source code and I don't see any problem or pain with it, corresponding changes will be not so big. IMHO all progress we can do now is entering more entities in table. In any case I interested in further progress in this way. P.S. Entities you mention cannot be translated from UCS2 (encoding used in HTML numeric entities) to UTF-8. Lynx/links do not show that also. Yours, Nikita
Nikita,
On Tue, 20 May 2003 13:39:23 -0400 (CLT) Jorge Arellano Cid <jcid@softhome.net> wrote:
Hi Nikita,
Hello, Jorge !
Now, considering the current development plan, it'd be good to have an interim workaround.
After reviewing the implementation in the patch, I'd prefer not to intermix it with the current (correct but incomplete) way of handling entities.
For instance, defining another entities table that's to be used when the entity is not found in the current one is a good way to separate the workaround code!
I'd like to add to this new table some important numeric character references (UTF-8) that are usually found in web pages these days. Things like:
"—" "’" "“" "”"
This could help our rendering while we get to GTK+2. Please let me know if you want to improve it this way!
Cheers Jorge.-
Jorge, the current way to display entities do not satisfy me because it throws away any entity which code is more than 255.
Well, that is for Latin1.
Is there any reason to leave table without representations in place and add a new one _with_ them which differs only in representations field and (possible) bigger number of entities ?
Maybe not, but I didn't suggest _that_ table. I thought that although implicit it was clear, but it seems that character handling code is always harder than it seems! The main reason for separating the tables was to make clear what was the workaround and what the code to be used with UCS in the future. The other is that the second table can be ordered by isocode and thus allow for a bin search instead of the slow linear search the patch uses. While reviewing it in more depth, I found a bug in dillo's code (fixed in CVS), and made a small UCS to latin converter that help in very weird pages like: http://www.michigan.gov/minewswire/ 0,1607,7-136-3452_3518-51568--M_2002_9,00.html (all in one line) Finally, I can't see (yet) how having this entitites character representation helps when using another character set (as with the russian encoding workaround), unless you first convert the regular HTML character encoding into the current locale! Certainly a hairy problem, and one of the main reasons for GTK+2 design decision of using UCS instead of the current locale as its internal representation.
In future (GTK2 internals) all representations definition and use could be cut from source code and I don't see any problem or pain with it, corresponding changes will be not so big.
IMHO all progress we can do now is entering more entities in table.
In any case I interested in further progress in this way.
P.S. Entities you mention cannot be translated from UCS2 (encoding used in HTML numeric entities) to UTF-8. Lynx/links do not show that also.
I meant UCS to Latin1 (see the patch in CVS). Cheers Jorge.-
Hello Jorge ! On Fri, 30 May 2003 12:52:04 -0400 (CLT) Jorge Arellano Cid <jcid@softhome.net> wrote:
Is there any reason to leave table without representations in place and add a new one _with_ them which differs only in representations field and(possible) bigger number of entities ?
Maybe not, but I didn't suggest _that_ table.
I thought that although implicit it was clear, but it seems that character handling code is always harder than it seems!
The main reason for separating the tables was to make clear what was the workaround and what the code to be used with UCS in the future.
The other is that the second table can be ordered by isocode and thus allow for a bin search instead of the slow linear search the patch uses.
OK, Jorge, you convinced me. Having a separate table is really better. New patch (against current CVS) is attached. Yours, Nikita.
On Tue, 20 May 2003 13:39:23 -0400 (CLT) Jorge Arellano Cid <jcid@softhome.net> wrote:
PS: With regard to "saving entered datas in forms", after some WEB research, I found it's somewhat obscure to know whether there's a way to do it the "standard way". Some sources like caching and some others hate it.
As we know, caching form data can be a huge security threat. That mainly depends on what data is remembered and on the user's environment.
For instance, some banks refuse to work with browsers that cache the pages (verifying browser signatures), with a view to protect their customers from eavesdropping (as it could happen when they make an operation and leave for a while to the bathroom; Any person in the same office can click a few Backs and read what was done and read the hand-typed camps).
I'd appreciate some more research on this topic (certainly a time-comsuming task I can't work on now). To know what some serious browsers do, is a good starting point. One of our objectives is personal security and privacy, and it takes a lot of work.
Jorge, at least while we wouldn't have normal cache in dillo (I mean cached page expiration) it would not be much reason to work on it further. After that approach must be diffent, values should be saved in cache (or in file ?) not in browser window. Until normal cache to appear in main branch there is not much reason to reimplement subj, IMHO. Nikita
participants (2)
-
Jorge Arellano Cid
-
Nikita V. Borodikhin