Problems with gziped content encodings pages and proposal
Hello. I am a daily user of dillo and i am finding problems with pages served with content-encoding = gzip and dillo mime detection code. For example when i try to open en.wikipedia.org or es.wikipedia.org dillo says "HTTP warning: Content-Type 'text/html; charset=utf-8' doesn't match the real data" When download with wget y see that the problem is that the page is gziped so i think that now wikipedia sends pages with conten-encoding: gzip. Proposal I have a proposal to use DPIs to manage protocols, mime, encodings and maybe scripts whitout modifing dillo source code each time that a dpi is added. Before in the list i have defended to have mime, protocols and script DPIs that call the right DPI for each task when dillo ask for a unknown mime type, protocol or script but this do not fit very well with the DPI infrastructure and need more DPI process runing and that it is better to avoid. Sorry, i think that it was the better option. I think that the way to fit mime handling with DPIs is to use the documented (but not fully implemented) concept of DPI service. In DPIs docs dillo do not call directly a DPI it ask dpid for a service and dpid returns a DPI. In this way various DPIs can be installed for a given service and the user can use whatever he what. In fact in the code dillo ask dpid for a service, but dpid can not handle more than a DPI for a given service(it do extrange things). Basic idea: When dillo do not known how to access to a url like mms:/something or show a mime type like image/xpm it can ask for a protocol_mms service or mime_image%2Fxpm service. We can fix dpid too. It will only need a configuration file to store user preferences for each service. Details Protocol DPIs can translate the unknown protocol to http (like https, ftp and data DPIs) or show a html page with actions about the protocol. For example a mms DPI can show a page with options to hear the mms stream(with mplayer) or download it (with mimms). An extension to the download_gui DPI to support downloading unknown protocols can be usefull. When download is not one supported by wget it can send open_url to dillo or ask dpid ... Mime types need a mime_x-dillo%2Fx-unknown because a file can be send without content-type and dillo can fail to discover the mime. A DPI handling it can use 'file --mime ' and resend the file to dillo with correct mime. A way to not resend the file will be better, maybe dillo can cache it if the file is small and is not a local file. If it is local it can send the url or path. A dpid more flexible configuration is needed so users can use the 'file --mime' DPI with occet/stream if they want. Using a path to the DPI with base at ~/.dillo/dpi or the dpi_lib in dpidrc can be sufficient. Surely a fallback DPI can be usefull. If dpid can not return a DPI because there is not service for that mime dillo can ask for a mime_x-dillo%2Fx-service-not-found or mime_x-dillo%2Fx-fallback that DPI can show an error or list DPIs to download and install or show a search engine For security reasons a mime_... and inlined_mime_... groups of services can be needed so an image or an object tag data automatically loaded from a page to show can be diferent from user clicked ones. Content encoding can use a group of encoded_... service names. (content_gzip for example) This do not need to detect the encoding because if there is not content-encode it is not encoded. Fallback can be needed. Scripts need more thought but if dillo send scripts tags to script_aplication%2Fjavascript and implement html intrinsic events a DPI can start to do easy things as redirect, status line scrolls(sigh) or javascript based links. (javascript based links can be done with a protocol_javascript easely) I mail this because i want coments and help. If you has read this boring (and bad writed) text until here, please do not think on comments or help, Just do if! :) Diego. PS: I have a debt of a parser code to dillo. It only need a clean up but is finished and have less size that the old glib parser code. :P
Hello again. I have started to code my proposal. I have a little patch with the easiest part (dillo protocol part) so people can test it if they want. The patch is highly experimental. Dpid is not modified so (until it(the most complex part)) DPI tree and names of protocols DPIs will be break. It can be solved with a few moves and renames, but do not try it in a non test system if you do not known what are you doing. The patch works with current cvs. After compiling a dillo with the experimental protocol patch when dillo finds a url that do not start with http: or about: it will ask to dpid for a service called protocol_PROTOCOL-PART-OF-URL (examples: protocol_ftp for ftp urls, protocol_data for data urls , protocol_javascript for javascript urls(if you delete the code in html.c that intercepts javascript), ...). I will use protocol_https for examples from now. How dpid is not changed it will exec ~/.dillo/dpi/protocol_https/protocol_https.dpi or $dpi_lib/dpi/protocol_https/protocol_https.dpi (in the case that the file was there when dpid started) The changed dpid will exec the user selected DPI for that service when done. Until that If anybody want to play with this code apply the patch, compile with care(do not install it for example), create a DPI(a filter script one for example) in .../dpi/protocol_YOUR-PREFERED-PROTOCOL/protocol_YOUR-PREFERED-PROTOCOL.filter.dpi , stop and restart dpid and test pointing compiled dillo to YOUR-PREFERED-PROTOCOL:WHATEVER-YOU-WANT I have make a script for javascript urls to test it(see upper comment). I will try to code a more complex part (mime) now. Diego.
Hi, On Fri, Aug 25, 2006 at 03:49:10AM +0200, Diego Sáenz wrote:
Hello again.
[...]
I mail this because i want coments and help. If you has read this boring (and bad +writed) text until here, please do not think on comments or help, Just do if! :) [...]
I have started to code my proposal.
Beware, your first mail is very hard to read an understand (I don't understand it for instance). The subject matter is complex, so I'd advice you to: take your time, re-think, be more careful with your english, and to try to express the ideas as clearly as possible. With regard to gzip encoding: zlib can do gzip decoding, and as it is already linked in Dillo, gzip decoding can be implemented inside Dillo (not using a dpi). This also avoids multiple dpi passes. For instance with dpi- decompress, gzipped isolatin2 would require one dpi pass for uncompress and another for latin2 to utf8. Inside-Dillo decoding also allows the idea of having a compressed cache in the future. -- Cheers Jorge.-
El Sun, 27 Aug 2006 12:53:47 -0400 Jorge Arellano Cid <jcid@dillo.org> escribio:
Hi,
On Fri, Aug 25, 2006 at 03:49:10AM +0200, Diego Sáenz wrote:
Hello again.
[...]
I mail this because i want coments and help. If you has read this boring (and bad +writed) text until here, please do not think on comments or help, Just do if! :) [...]
I have started to code my proposal.
Beware, your first mail is very hard to read an understand (I don't understand it for instance).
Sorry. I almost forgot how to write english when i do not write it in a long time.
The subject matter is complex, so I'd advice you to: take your time, re-think, be more careful with your english, and to try to express the ideas as clearly as possible.
Ok, second try(i have changed a pair of things): The general ideas are to enhance dpid to manage DPIs using services (Like documentation suggest) and to add code to dillo to implement protocols DPIs (without hardcoding them in dillo core), mime handlers DPIs and maybe scripts languages DPIs and content encodig DPIs(now encodigs DPIs are in doubt after your mail) Dpid will need configuration lines(in dpidrc) to associate server name with DPI path(A dpi dir based path allow a DPI to serve more that one service(in this proposal a javascript DPI will serve protocol_javascript, script_javascript and maybe mime_text/javascript services)). When dillo finds a protocol that it not handle in core (currently it handle http:, about: and dpi:) it ask for a service for that protocol(to dpid) and dpid returns the DPI from the configuration file(really a socket to comunicate with it). For the service name dillo add the protocol name to the "protocol_" string. The lines in dpidrc for current services can be this: protocol_file=file/file.dpi protocol_ftp=ftp/ftp.filter.dpi protocol_https=https/https.filter.dpi protocol_data=datauri/datauri.filter.dpi I think i will allow the use of '*' to match any(even none) char until end of string in the configuration file. '*' will be checked last. It can be used by a ProtocolNotSupported.dpi that show an error or a page with protocols DPIs to download. The dpidrc line can be like this: protocol_*=misc/ProtocolNotSupported.dpi For mime types it works similar. When the user clicks on a link or uses the url bar to view a mime type not handled by dillo the mime type is addeded to the "mime_" string to get the service name(dillo currently handles mime_text/html, mime_text/plain, mime_text/*, mime_image/gif, mime_image/jpeg, mime_image/png). When a not handled mime type is open because a page load dillo add the "_inlined" string to the previous string(example: For a bmp file used like image in a html page dillo will try the mime_image/x-bmp_inlined service) The use of '*' wildcard allow the use of one DPI to manage both services with only a configuration line if user wants. About mime type check/detection i am thinking on diferent options (mime detection/check go before mime DPIs) 1a. Fixed service called for pages without mime types and ocet/stream (using this like unknown mime type) that returns a detected mime 1b. Fixed service called for pages without mimes and for a list of configured mime types in dillorc 2a. Check service per mime type with service names like mime_check_MIME/TYPE. Dillo send actuall type so the DPI handling the service can do something if detected mime is diferent from the server sended one.(ask the user, ignore error, stop load ...) 2b. Like 2b, but only ask for a list of mimes configured in dillorc Scripts and content encoding work similarly. For more details i can send how the dpi tags will be and more about dillo-DPIs communication and dillo internal changes.
With regard to gzip encoding: zlib can do gzip decoding, and as it is already linked in Dillo, gzip decoding can be implemented inside Dillo (not using a dpi).
Oh, i forgot it.
This also avoids multiple dpi passes. For instance with dpi- decompress, gzipped isolatin2 would require one dpi pass for uncompress and another for latin2 to utf8.
Inside-Dillo decoding also allows the idea of having a compressed cache in the future.
I unknown if i can implement content encoding in dillo internals so i will move it after dpid changes on implementation complex. I hoppe it is more clear now. Diego
On Tue, Aug 29, 2006 at 06:40:40PM +0200, Diego Sáenz wrote:
Ok, second try(i have changed a pair of things):
The general ideas are to enhance dpid to manage DPIs using services (Like documentation suggest) and to add code to dillo to implement protocols DPIs (without hardcoding them in dillo core), mime handlers DPIs and maybe scripts languages DPIs and content encodig DPIs(now encodigs DPIs are in doubt after your mail)
Dpid will need configuration lines(in dpidrc) to associate server name with DPI path(A dpi dir based path allow a DPI to serve more that one service(in this proposal a javascript DPI will serve protocol_javascript, script_javascript and maybe mime_text/javascript services)).
When dillo finds a protocol that it not handle in core (currently it handle http:, about: and dpi:) it ask for a service for that protocol(to dpid) and dpid returns the DPI from the configuration file(really a socket to comunicate with it). For the service name dillo add the protocol name to the "protocol_" string.
The lines in dpidrc for current services can be this:
protocol_file=file/file.dpi protocol_ftp=ftp/ftp.filter.dpi protocol_https=https/https.filter.dpi protocol_data=datauri/datauri.filter.dpi
I think i will allow the use of '*' to match any(even none) char until end of string in the configuration file. '*' will be checked last. It can be used by a ProtocolNotSupported.dpi that show an error or a page with protocols DPIs to download.
The dpidrc line can be like this: protocol_*=misc/ProtocolNotSupported.dpi
For mime types it works similar. When the user clicks on a link or uses the url bar to view a mime type not handled by dillo the mime type is addeded to the "mime_" string to get the service name(dillo currently handles mime_text/html, mime_text/plain, mime_text/*, mime_image/gif, mime_image/jpeg, mime_image/png). When a not handled mime type is open because a page load dillo add the "_inlined" string to the previous string(example: For a bmp file used like image in a html page dillo will try the mime_image/x-bmp_inlined service)
The use of '*' wildcard allow the use of one DPI to manage both services with only a configuration line if user wants.
About mime type check/detection i am thinking on diferent options (mime detection/check go before mime DPIs) 1a. Fixed service called for pages without mime types and ocet/stream (using this like unknown mime type) that returns a detected mime 1b. Fixed service called for pages without mimes and for a list of configured mime types in dillorc 2a. Check service per mime type with service names like mime_check_MIME/TYPE. Dillo send actuall type so the DPI handling the service can do something if detected mime is diferent from the server sended one.(ask the user, ignore error, stop load ...) 2b. Like 2b, but only ask for a list of mimes configured in dillorc
Scripts and content encoding work similarly.
For more details i can send how the dpi tags will be and more about dillo-DPIs communication and dillo internal changes.
At this phase I'd appreciatte much more the "general picture" rather than some implementation details. I mean. What's the problem you're trying to solve? What's the scheme you propose? Have you tested it to suit current functionality and maybe other problems too? For instance, if you click over a network radio link, it's not a good idea to use dillo to download the link data and to pass it via dpi to a dpi-radio plugin (and worst, to cache the stream! :). It's much better to cut the network connection from dillo and to handle the URL to a dedicated radio-stream program. In the case of a PDF. It would be nice to save the data to some temporary directory while you pipe it into xpdf or to let the user choose whether to save with a dialog. Same for a movie stream, etc, etc.
With regard to gzip encoding: zlib can do gzip decoding, and as it is already linked in Dillo, gzip decoding can be implemented inside Dillo (not using a dpi).
Oh, i forgot it.
Will you implement this?
This also avoids multiple dpi passes. For instance with dpi- decompress, gzipped isolatin2 would require one dpi pass for uncompress and another for latin2 to utf8.
Inside-Dillo decoding also allows the idea of having a compressed cache in the future.
I unknown if i can implement content encoding in dillo internals so i will move it after dpid changes on implementation complex.
IMO this is much simpler tan what you're willing to try first. Beware. -- Cheers Jorge.-
participants (2)
-
Diego Sáenz
-
Jorge Arellano Cid