On my personal website <https://256-32.com/>, Dillo reports the following bug in a link that uses special characters: HTML warning: line 64, URL has 8 illegal bytes in {00-1F, 7F-FF} range ('/computers/důvěřivý'). However, the HTTP response sent by my server specifies charset=utf-8 in the content type. These UTF-8 links work properly in Dillo, despite the bug message, and I believe my site is not in error here. <validator.w3.org> says my site is fine, so it seems there's nothing wrong with special characters in URLs. It doesn't report the same bytes outside of a link href as being bad. The error message seems to originate from src/html.cc:196.
Hi, On Sun, May 10, 2026 at 03:25:07AM +0100, 256@256-32.com wrote:
On my personal website <https://256-32.com/>, Dillo reports the following bug in a link that uses special characters: HTML warning: line 64, URL has 8 illegal bytes in {00-1F, 7F-FF} range ('/computers/důvěřivý').
Yes, we follow the RFC 3986 and the HTML 4.01 recommendation of marking illegal characters outside the unreserved set: https://www.w3.org/TR/html401/appendix/notes.html#h-B.2 In HTML by the WHATWG they added exceptions for UTF-8 URLs, but I don't think is a good idea. This breaks software that doesn't handle UTF-8 URLs (i.e. anything that follows the RFC not what Google says). My recommendation is to encode the URL with percent encoding: https://256-32.com/computers/d%C5%AFv%C4%9B%C5%99iv%C3%BD This is the URL that is used in HTTP, despite not being rendered as-is in the URL location of modern browsers. One of the problems with that is the Unicode "confusables", characters that render very similar but are different, like this: https://аpple.com See: https://www.xudongz.com/blog/2017/idn-phishing/ Best, Rodrigo.
Thanks for the HTML4 standard link; I wasn't aware it didn't allow special characters in URLs. I have modified my static site generator to percent-encode these characters; it may be unimportant but I may as well conform to HTML4.
In HTML by the WHATWG they added exceptions for UTF-8 URLs, but I don't think is a good idea. This breaks software that doesn't handle UTF-8 URLs (i.e. anything that follows the RFC not what Google says).
Can you name any examples of software that would break when presented with un-encoded UTF-8 URLs? Even Lynx supports them. P.S.: I know about RFC 3986; I've implemented it :^). The source of my confusion was that HTTP 1.0 (RFC 1945) defines URIs in a way that treats bytes >=0x80 as unreserved (section 3.2.1), and I falsely assumed HTML must be the same.
participants (2)
-
256@256-32.com -
Rodrigo Arias