Hi Sebastian, On Sun, Nov 18, 2012 at 02:29:32PM +0100, Sebastian Geerken wrote:
Hi!
At http://flpsed.org/hgweb/dillo_hyphen, you'll find some extensions for hyphenation I've not yet merged into the main repository. Still needs some documentation, but here is an overview:
There are now configuration variables for dillorc (see source): penalties for hyphens, as well as the left and right side of an em-dash. The suffix "_2" means that this value is used for lines following a line which ends already with a hyphen. When this value is larger, two adjacent lines ending with a hyphen are avoided.
For values, see the definition of the "badness". Typical values:
0 = Penalty used for normal spaces. 1 = A justified line with spaces having 150% or 67% of the ideal space width has this as badness. 8 = A justified line with spaces twice as wide as ideally has this as badness.
"inf" may be used (preventing a break in any case); also "-inf" (forcing a break), although the latter makes no sense and may lead to strange results.
There is a text page, test/hyphens-etc.html, to play around.
I wonder how breaking a single word in a line can be penalized with these controls. For instance [1], with both main dillo and dillo_hyphen the word "hyphenation" is broken twice: hy- phen- ation With the new controls, it could become: hyphen- ation but, in this particular case it should have been: hyphenation In the same page, there're several cases of the same problem (one line above, a 1 row x 8 col table) where words are broken into a maximum of 4 times! In the web case, it's common to use the longest word in a line as minimal width. OTOH there's also the problem of too long "word" strings. In [1] clearly the browser tries to optimize for a minimal page width. Which is not the case, but that could perfectly have been as an external constraint to the algoritm (by means of screen size, TABLE element directives, floats, etc). So it is non trivial. I've worked enough on table rendering to know that making a decision based on the current textblock's min/max width would introduce too much complexity. e.g. in [1], just imagine the problem of deciding which words to break and where for a dynamic optimum of the table width. :-P A much simpler approach would be to introduce a penalty for breaking single words in a line, above a certain threshold that could be relative to the browser window's width. For instance: penalty_one_word_line=5 /* Penalty = (word_length > 1/4 window width) ? 0 : 5 */ or even simpler, in characters: penalty_one_word_line=18 /* Don't try to break words shorter than 18 chars, when alone in a single line */ The advantage I see to a penalty that handles this case is that it can help a lot with web rendering and also with more precise book rendering with a simple dillorc option. These are just ideas, not meant to be *the* solution. They have relatively simple implementations that could be field tested. HTH. [1] http://www.thefreedictionary.com/hyphenation -- Cheers Jorge.-