If you've been following along on fltk.general, you'll know we've been in discussion over some unpleasantness regarding Turkic locales and strcasecmp/toupper/tolower where 'i' and 'I' are different letters. ( http://en.wikipedia.org/wiki/Dotted_and_dotless_I ) The typical solution is to make sort of ASCIIfied versions where you force i<->I. I think that, for our purposes, we can just use the ordinary POSIX locale and save ourselves the trouble. To see this behaviour: export LC_ALL=tr_TR and then look at: <span style="font-size: 0.5IN">these letters should be big</span> <p> <I>This text should be italic</I>
On Fri, Oct 21, 2011 at 05:29:49AM +0000, corvid wrote:
If you've been following along on fltk.general, you'll know we've been in discussion over some unpleasantness regarding Turkic locales and strcasecmp/toupper/tolower where 'i' and 'I' are different letters.
No I haven't, but if you were into the subject, please patch as necessary.
( http://en.wikipedia.org/wiki/Dotted_and_dotless_I )
The typical solution is to make sort of ASCIIfied versions where you force i<->I.
I think that, for our purposes, we can just use the ordinary POSIX locale and save ourselves the trouble.
FWIW, I don't remember the details but it seems to me that long ago we had a thread on the problems of changing the locale set by the user. -- Cheers Jorge.-
Jorge wrote:
FWIW, I don't remember the details but it seems to me that long ago we had a thread on the problems of changing the locale set by the user.
*has a look through the archives* What I could find was: - In downloads dpi, having to set C locale in the environment before exec()ing wget in order to parse the log correctly. - Dillo resetting to C locale for gtk cmdline option parsing. - Temporarily resetting LC_NUMERIC to C locale for prefs parsing. I don't believe LC_NUMERIC is currently set in the fltk-1.3 era, although it might be good to reset in general because of the strtod() in Html_parse_length_or_multi_length().
Hello corvid. I gave you wrong information. I don't explicitly set locale but implicitly set locale (probably through LANG but I'm not sure). locale command shows current my locale setting like these. LANG=ja_JP.eucJP LC_CTYPE="ja_JP.eucJP" LC_COLLATE="ja_JP.eucJP" LC_TIME="ja_JP.eucJP" LC_NUMERIC="ja_JP.eucJP" LC_MONETARY="ja_JP.eucJP" LC_MESSAGES="ja_JP.eucJP" LC_ALL= Your locale patch (c6cbf3ae7ffd) has side effect on my dillo and while converting to Kanji, font width is half and it's not easy to recognize. Could you add locale setting on preference to make dillo use the preference locale rather than default "C" at start up? Regards, furaisanjin
Here's an idea of what we'd get: http://www.dillo.org/test/ascii_strcasecmp.diff - The naming is clear but kind of ugly. - I haven't added explanatory comments yet. - For the moment, it just #defines dStr(n)casecmp to the ASCII version because there are a bunch of them, but I'd go through and change the calls. Thoughts? Skimming through the diff now, I'm noticing the cases like if (dASCIItolower(attrbuf[0]) == 'j') and thinking maybe if (attrbuf[0] == 'j' || attrbuf[0] == 'J') would be better, but then this issue is probably too tiny to waste precious seconds of my life in noticing. (Whereas continuing to use tolower() because we know that 'j' is safe would be bad practice, in my opinion.)
On Sun, Oct 23, 2011 at 05:55:30AM +0000, corvid wrote:
Here's an idea of what we'd get:
http://www.dillo.org/test/ascii_strcasecmp.diff
- The naming is clear but kind of ugly. - I haven't added explanatory comments yet. - For the moment, it just #defines dStr(n)casecmp to the ASCII version because there are a bunch of them, but I'd go through and change the calls.
Thoughts?
I think it's nasty, but the right way to fix this. Did you find any cases where the localized version actually makes sense? Not sure how a lowercase roman 2 should look like in tr_TR :-)
Skimming through the diff now, I'm noticing the cases like if (dASCIItolower(attrbuf[0]) == 'j') and thinking maybe if (attrbuf[0] == 'j' || attrbuf[0] == 'J') would be better, but then this issue is probably too tiny to waste precious seconds of my life in noticing.
I would guess that you end up with pretty much the same assembler code in both cases.
(Whereas continuing to use tolower() because we know that 'j' is safe would be bad practice, in my opinion.)
right. Independent of this problem I think we should decide on the layering of our base libraries (lout and dlib) to avoid code duplication. My favorite option would be basing dlib on lout. Any opinions? Cheers, Johannes
Johannes wrote:
Did you find any cases where the localized version actually makes sense?
I left the findtext code alone, since neither case will work well anyway.
Independent of this problem I think we should decide on the layering of our base libraries (lout and dlib) to avoid code duplication. My favorite option would be basing dlib on lout. Any opinions?
It would make sense. Speaking of lout, we still have #define prefs_show_msg 1 in lout/msg.h. Oh, and there's the BitSet/Bitvec duplication...
On Sun, Oct 23, 2011 at 10:28:43AM +0200, Johannes Hofmann wrote:
On Sun, Oct 23, 2011 at 05:55:30AM +0000, corvid wrote:
Here's an idea of what we'd get:
http://www.dillo.org/test/ascii_strcasecmp.diff
- The naming is clear but kind of ugly. - I haven't added explanatory comments yet. - For the moment, it just #defines dStr(n)casecmp to the ASCII version because there are a bunch of them, but I'd go through and change the calls.
Thoughts?
I think it's nasty, but the right way to fix this.
Yeah. Ater all locales/utf-8 is seldom a simple thing to handle right. If we'd have to: s/dStrcasecmp/dStrASCIIcasecmp/g s/dStrncasecmp/dStrnASCIIcasecmp/g then I'd keep the old function name and explain in the function's comment why it is restricted to, or handled in the new way. From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not: int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) } Disclaimer: I haven't studied the problem in detail. -- Cheers Jorge.-
Jorge wrote:
On Sun, Oct 23, 2011 at 10:28:43AM +0200, Johannes Hofmann wrote:
On Sun, Oct 23, 2011 at 05:55:30AM +0000, corvid wrote:
Here's an idea of what we'd get:
http://www.dillo.org/test/ascii_strcasecmp.diff
- The naming is clear but kind of ugly. - I haven't added explanatory comments yet. - For the moment, it just #defines dStr(n)casecmp to the ASCII version because there are a bunch of them, but I'd go through and change the calls.
Thoughts?
I think it's nasty, but the right way to fix this.
Yeah. Ater all locales/utf-8 is seldom a simple thing to handle right.
If we'd have to:
s/dStrcasecmp/dStrASCIIcasecmp/g s/dStrncasecmp/dStrnASCIIcasecmp/g
then I'd keep the old function name and explain in the function's comment why it is restricted to, or handled in the new way.
Calling it 'strcasecmp' when it isn't really strcasecmp seems dangerous.
From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not:
int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) }
Disclaimer: I haven't studied the problem in detail.
Having to know about the locales would be trouble. Judging by the wikipedia page, we'd have to look for 'tr' and 'az' for sure, and they've been considering switching back to Latin script in Kazakhstan, and probably not Tatar because they're mostly Cyrillic, but probably yes to Crimean Tatar because it looks like they're a bit more Latin than Cyrillic at the moment. So it's a touchy enough situation that different libc's may have different ideas of it all, and individual ones will change as circumstances change, and it would make more sense to have a test like checking whether toupper('i') == 'I'. And better than that would be to go through the whole ASCII alphabet at initialization time and see whether everything is as we wish, and then set some ptrs to functions accordingly.
On Mon, Oct 31, 2011 at 04:03:38AM +0000, corvid wrote:
Jorge wrote:
On Sun, Oct 23, 2011 at 10:28:43AM +0200, Johannes Hofmann wrote:
On Sun, Oct 23, 2011 at 05:55:30AM +0000, corvid wrote:
Here's an idea of what we'd get:
http://www.dillo.org/test/ascii_strcasecmp.diff
- The naming is clear but kind of ugly. - I haven't added explanatory comments yet. - For the moment, it just #defines dStr(n)casecmp to the ASCII version because there are a bunch of them, but I'd go through and change the calls.
Thoughts?
I think it's nasty, but the right way to fix this.
Yeah. Ater all locales/utf-8 is seldom a simple thing to handle right.
If we'd have to:
s/dStrcasecmp/dStrASCIIcasecmp/g s/dStrncasecmp/dStrnASCIIcasecmp/g
then I'd keep the old function name and explain in the function's comment why it is restricted to, or handled in the new way.
Calling it 'strcasecmp' when it isn't really strcasecmp seems dangerous.
Good point...
From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not:
int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) }
Disclaimer: I haven't studied the problem in detail.
Having to know about the locales would be trouble. Judging by the wikipedia page, we'd have to look for 'tr' and 'az' for sure, and they've been considering switching back to Latin script in Kazakhstan, and probably not Tatar because they're mostly Cyrillic, but probably yes to Crimean Tatar because it looks like they're a bit more Latin than Cyrillic at the moment.
So it's a touchy enough situation that different libc's may have different ideas of it all, and individual ones will change as circumstances change,
If libc has not nailed it already, it must be hard to solve the problem in a generic way.
and it would make more sense to have a test like checking whether toupper('i') == 'I'.
And better than that would be to go through the whole ASCII alphabet at initialization time and see whether everything is as we wish, and then set some ptrs to functions accordingly.
+1 (or just test the subset we know we can handle). -- Cheers Jorge.-
Jorge wrote:
On Mon, Oct 31, 2011 at 04:03:38AM +0000, corvid wrote:
Jorge wrote:
From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not:
int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) }
Disclaimer: I haven't studied the problem in detail.
Having to know about the locales would be trouble. Judging by the wikipedia page, we'd have to look for 'tr' and 'az' for sure, and they've been considering switching back to Latin script in Kazakhstan, and probably not Tatar because they're mostly Cyrillic, but probably yes to Crimean Tatar because it looks like they're a bit more Latin than Cyrillic at the moment.
So it's a touchy enough situation that different libc's may have different ideas of it all, and individual ones will change as circumstances change,
If libc has not nailed it already, it must be hard to solve the problem in a generic way.
There are strcasecmp_l() and strncasecmp_l(), but they were only introduced in a very recent version of posix. http://www.gnu.org/s/hello/manual/gnulib/strcasecmp_005fl.html says "This function is missing on many platforms: MacOS X 10.3, FreeBSD 6.0, NetBSD 5.0, OpenBSD 3.8, AIX 5.1, HP-UX 11, IRIX 6.5, OSF/1 5.1, Solaris 11 2010-11, Cygwin, mingw, Interix 3.5, BeOS." As for whether strcasecmp_l() is cheap, in any case, I don't know.
Jorge wrote:
On Mon, Oct 31, 2011 at 04:03:38AM +0000, corvid wrote:
Jorge wrote:
From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not:
int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) }
Disclaimer: I haven't studied the problem in detail.
Having to know about the locales would be trouble. Judging by the wikipedia page, we'd have to look for 'tr' and 'az' for sure, and they've been considering switching back to Latin script in Kazakhstan, and probably not Tatar because they're mostly Cyrillic, but probably yes to Crimean Tatar because it looks like they're a bit more Latin than Cyrillic at the moment.
So it's a touchy enough situation that different libc's may have different ideas of it all, and individual ones will change as circumstances change,
If libc has not nailed it already, it must be hard to solve the problem in a generic way.
and it would make more sense to have a test like checking whether toupper('i') == 'I'.
And better than that would be to go through the whole ASCII alphabet at initialization time and see whether everything is as we wish, and then set some ptrs to functions accordingly.
+1
(or just test the subset we know we can handle).
Here's what I ended up with when I added in some initialization on the dlib side: http://www.dillo.org/test/ascii_strcasecmp2.diff
On Fri, Nov 04, 2011 at 02:41:28AM +0000, corvid wrote:
Jorge wrote:
On Mon, Oct 31, 2011 at 04:03:38AM +0000, corvid wrote:
Jorge wrote:
From a distance this global change looks to me like exposing ourselves to more troubles. In other words, why not:
int dStrcasecmp(...) { if (locale has special rules) special treatment else strcasecmp(...) }
Disclaimer: I haven't studied the problem in detail.
Having to know about the locales would be trouble. Judging by the wikipedia page, we'd have to look for 'tr' and 'az' for sure, and they've been considering switching back to Latin script in Kazakhstan, and probably not Tatar because they're mostly Cyrillic, but probably yes to Crimean Tatar because it looks like they're a bit more Latin than Cyrillic at the moment.
So it's a touchy enough situation that different libc's may have different ideas of it all, and individual ones will change as circumstances change,
If libc has not nailed it already, it must be hard to solve the problem in a generic way.
and it would make more sense to have a test like checking whether toupper('i') == 'I'.
And better than that would be to go through the whole ASCII alphabet at initialization time and see whether everything is as we wish, and then set some ptrs to functions accordingly.
+1
(or just test the subset we know we can handle).
Here's what I ended up with when I added in some initialization on the dlib side: http://www.dillo.org/test/ascii_strcasecmp2.diff
Hm, what's the point of the dynamic check? We don't want our case insensitive compare function to depend on the user locale, so I would say using strcasecmp(3) and tolower(3) is just wrong in our case - even if it happens to return the right values in some setups. Why not just use a custom function as in the first patch? Or maybe test for strncasecmp_l() in configure and provide a custom solution if it doesn't exist? Cheers, Johannes
Johannes wrote:
Hm, what's the point of the dynamic check? We don't want our case insensitive compare function to depend on the user locale, so I would say using strcasecmp(3) and tolower(3) is just wrong in our case - even if it happens to return the right values in some setups. Why not just use a custom function as in the first patch?
I was curious whether something closer to what Jorge was talking about would turn out to be more pleasing in some way, and maybe a tad quicker, for that matter, with their optimized functions, but...yeah, it didn't really look nicer to me, either.
Or maybe test for strncasecmp_l() in configure and provide a custom solution if it doesn't exist?
I'm willing to give that a look as well. At first I was reluctant to have extra locale calls when the string functions are so cheap to begin with, but...
On Sat, Nov 05, 2011 at 06:45:19PM +0000, corvid wrote:
Johannes wrote:
Hm, what's the point of the dynamic check? We don't want our case insensitive compare function to depend on the user locale, so I would say using strcasecmp(3) and tolower(3) is just wrong in our case - even if it happens to return the right values in some setups. Why not just use a custom function as in the first patch?
I was curious whether something closer to what Jorge was talking about would turn out to be more pleasing in some way, and maybe a tad quicker, for that matter, with their optimized functions, but...yeah, it didn't really look nicer to me, either.
My idea was to have a custom function for the turkish locale (eventually extendable to a bigger set if the need arises), and to use the defaults for all the other cases. With the custom function handling the special case(s), and not handling all characters as special.
Or maybe test for strncasecmp_l() in configure and provide a custom solution if it doesn't exist?
Yes, this would also add to the solution.
I'm willing to give that a look as well. At first I was reluctant to have extra locale calls when the string functions are so cheap to begin with, but...
-- Cheers Jorge.-
I wrote:
Johannes wrote:
Or maybe test for strncasecmp_l() in configure and provide a custom solution if it doesn't exist?
I'm willing to give that a look as well. At first I was reluctant to have extra locale calls when the string functions are so cheap to begin with, but...
I thought the extra argument to the *_l functions was a string like "C", but it's a locale_t, which also appears to be something new. This makes it not very nice to simulate. Hmph.
On Sun, Nov 06, 2011 at 04:03:47AM +0000, corvid wrote:
I wrote:
Johannes wrote:
Or maybe test for strncasecmp_l() in configure and provide a custom solution if it doesn't exist?
I'm willing to give that a look as well. At first I was reluctant to have extra locale calls when the string functions are so cheap to begin with, but...
I thought the extra argument to the *_l functions was a string like "C", but it's a locale_t, which also appears to be something new. This makes it not very nice to simulate. Hmph.
I'd just go with the approach in your first patch.
Johannes wrote:
I'd just go with the approach in your first patch.
All right then. Do you have any thoughts on what function names would be least unwieldy / most descriptive / best fitting in with their surroundings? I'm thinking maybe dAsciiToupper() and dStrAsciiCasecmp().
On Sun, Nov 06, 2011 at 08:19:07PM +0000, corvid wrote:
Johannes wrote:
I'd just go with the approach in your first patch.
All right then. Do you have any thoughts on what function names would be least unwieldy / most descriptive / best fitting in with their surroundings?
I'm thinking maybe dAsciiToupper() and dStrAsciiCasecmp().
Let's see what others do: * KDE: kAsciiToUpper() [1] * glib: g_ascii_toupper() [2] * ffmpeg: av_toupper() [3] * libevent: evutil_ascii_strcasecmp() [4] So dAsciiToupper() and dStrAsciiCasecmp() look good to me. Cheers, Johannes [1] http://api.kde.org/4.x-api/kdelibs-apidocs/kdecore/html/kascii_8h.html [2] http://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html [3] http://permalink.gmane.org/gmane.comp.video.ffmpeg.cvs/43420 [4] http://www.wangafu.net/~nickm/libevent-2.0/doxygen/html/util_8h.html
On Mon, Nov 07, 2011 at 10:25:39PM +0100, Johannes Hofmann wrote:
On Sun, Nov 06, 2011 at 08:19:07PM +0000, corvid wrote:
Johannes wrote:
I'd just go with the approach in your first patch.
All right then. Do you have any thoughts on what function names would be least unwieldy / most descriptive / best fitting in with their surroundings?
I'm thinking maybe dAsciiToupper() and dStrAsciiCasecmp().
Let's see what others do:
* KDE: kAsciiToUpper() [1] * glib: g_ascii_toupper() [2] * ffmpeg: av_toupper() [3] * libevent: evutil_ascii_strcasecmp() [4]
So dAsciiToupper() and dStrAsciiCasecmp() look good to me.
+1 -- Cheers Jorge.-
Jorge wrote:
On Mon, Nov 07, 2011 at 10:25:39PM +0100, Johannes Hofmann wrote:
On Sun, Nov 06, 2011 at 08:19:07PM +0000, corvid wrote:
Johannes wrote:
I'd just go with the approach in your first patch.
All right then. Do you have any thoughts on what function names would be least unwieldy / most descriptive / best fitting in with their surroundings?
I'm thinking maybe dAsciiToupper() and dStrAsciiCasecmp().
Let's see what others do:
* KDE: kAsciiToUpper() [1] * glib: g_ascii_toupper() [2] * ffmpeg: av_toupper() [3] * libevent: evutil_ascii_strcasecmp() [4]
So dAsciiToupper() and dStrAsciiCasecmp() look good to me.
+1
I think I'll go with D_ASCII_TOUPPER because the argument gets evaluated more than once, and we don't want nasty surprises. (I'm tempted to say "Or ASCII_TOUPPER?", but it'd be nice for the whole topic to come to an end, wouldn't it? :)
And *sigh* dStristr -> dStriAsciiStr, which seems more readable than dStrAsciiIStr.
On Tue, Nov 08, 2011 at 05:42:26PM +0000, corvid wrote:
Jorge wrote:
On Mon, Nov 07, 2011 at 10:25:39PM +0100, Johannes Hofmann wrote:
On Sun, Nov 06, 2011 at 08:19:07PM +0000, corvid wrote:
Johannes wrote:
I'd just go with the approach in your first patch.
All right then. Do you have any thoughts on what function names would be least unwieldy / most descriptive / best fitting in with their surroundings?
I'm thinking maybe dAsciiToupper() and dStrAsciiCasecmp().
Let's see what others do:
* KDE: kAsciiToUpper() [1] * glib: g_ascii_toupper() [2] * ffmpeg: av_toupper() [3] * libevent: evutil_ascii_strcasecmp() [4]
So dAsciiToupper() and dStrAsciiCasecmp() look good to me.
+1
I think I'll go with D_ASCII_TOUPPER because the argument gets evaluated more than once, and we don't want nasty surprises.
Hm, not sure what you mean. An inline function should behave just as a normal function that's the nice thing about them. Let's test: #include <stdio.h> static inline int test(int i) { return i > 0 ? i - 1 : i; } #define TEST(i) ((i) > 0 ? (i) - 1 : (i)) int main (int argc, char **argv) { int i = atoi(argv[1]); int j = atoi(argv[1]); printf("%d\n", test(i++)); printf("%d\n", TEST(j++)); } ./inline 2 1 2 Cheers, Johannes
Johannes wrote:
On Tue, Nov 08, 2011 at 05:42:26PM +0000, corvid wrote:
I think I'll go with D_ASCII_TOUPPER because the argument gets evaluated more than once, and we don't want nasty surprises.
Hm, not sure what you mean. An inline function should behave just as a normal function that's the nice thing about them.
If I'm in dlib.h and it's C90, can I use inline? I was under the impression that I couldn't.
On Tue, Nov 08, 2011 at 08:15:57PM +0000, corvid wrote:
Johannes wrote:
On Tue, Nov 08, 2011 at 05:42:26PM +0000, corvid wrote:
I think I'll go with D_ASCII_TOUPPER because the argument gets evaluated more than once, and we don't want nasty surprises.
Hm, not sure what you mean. An inline function should behave just as a normal function that's the nice thing about them.
If I'm in dlib.h and it's C90, can I use inline? I was under the impression that I couldn't.
Argh, I knew there was a catch :-) So all uppercase macros then or just not trying to inline and use a normal function. Cheers, Johannes
Johannes wrote:
So all uppercase macros then or just not trying to inline and use a normal function.
Here's what I have now: http://www.dillo.org/test/ascii_strcasecmp3.diff
On Tue, Nov 08, 2011 at 11:18:12PM +0000, corvid wrote:
Johannes wrote:
So all uppercase macros then or just not trying to inline and use a normal function.
Here's what I have now: http://www.dillo.org/test/ascii_strcasecmp3.diff
Looks good to me. Cheers, Johannes
Hi Johannes, On Sun, Oct 23, 2011 at 10:28:43AM +0200, Johannes Hofmann wrote:
[...] Independent of this problem I think we should decide on the layering of our base libraries (lout and dlib) to avoid code duplication. My favorite option would be basing dlib on lout. Any opinions?
Historically: * the reason for using C++ (partly) in Dillo was FLTK (to be able to use its C++ API). * Sebastian and I worked independently (Dw/rest respectively) and ended with two libs. * This approach allowed for Dw and Dlib to be standalone. (thus helping Sebastian with other Dw projects and me/others to reuse Dlib in other pure C projects). Which is a good thing! One way to keep the advanges is to base lout on dlib. Mixing C/C++ for lout; this may be uglier/less-convenient than current status quo, so it would be a trade off. OTOH basing dlib on lout would force using C++ in pure C programs. :-P -- Cheers Jorge.-
participants (4)
-
corvid@lavabit.com
-
furaisanjin@gmail.com
-
jcid@dillo.org
-
Johannes.Hofmann@gmx.de