Topic: settings: new option for filesystem (filenames) encoding (Read 20981 times)

macias · « **Reply #15 on:** December 24, 2007, 04:07:04 PM »

Quote from: Vollstrecker on December 24, 2007, 03:38:59 PM

Maybe it's more usefull to upgrade this old software and use the locales in the sense of the inventor.

:-)
2) LANG is old way to express anything about encoding, so if anything should be upgraded is this old system of communicating with the world
1) old software by all means is still useful -- Midnight Commander for example (MC is the main reason I use LANG set to en, not to pl)

lfroen · « **Reply #16 on:** December 25, 2007, 07:56:32 AM »

Quote from: macias on December 24, 2007, 03:34:55 PM

Quote from: Vollstrecker on December 24, 2007, 02:26:58 PM
The problem is, that you tell the app, that it has encoding en_EN with for itself is legal, and your filesystem serves the filename in pl_PL or whatever with for itself again is absolutely legal. But what you see is what happens when they don't communicate and expect the other to use the same as they do. Maybe the other apps just ignore the settings you gave them and this causes them to work.

No, the other apps have settings to override LANG. And it is pretty sane -- there are a lot of older software that heavily rely on LANG and LANG only and do not work with extended LANG too well. So it is useful to set LANG to the most used settings (en) for the sake of them working and set it independently in modern environment to really needed encoding (iso-8859-2).

This paragraph of your makes no sense. WTF is "settings to override LANG"?! Can you please point me to some application which have setting like "my filesystem encoding is BLAH"? Did you saw Firefox doing that? OpenOffice? Mplayer?
Your complain that "it works for 100 application but doesn't work for 2" is meaningless. All it means that sometimes it works, sometimes not. Which in turn means that your settings are screwed.
Now, can you please stop for a second, and explain why did you started to mess with local/encoding/LANG/whatever in a first place?!

macias · « **Reply #17 on:** December 25, 2007, 04:42:10 PM »

Quote from: lfroen on December 25, 2007, 07:56:32 AM

This paragraph of your makes no sense. WTF is "settings to override LANG"?! Can you please point me to some application which have setting like "my filesystem encoding is BLAH"? Did you saw Firefox doing that? OpenOffice? Mplayer?

I already pointed it out -- KDE. You set up your locale, that's it.

Quote from: lfroen on December 25, 2007, 07:56:32 AM

Now, can you please stop for a second, and explain why did you started to mess with local/encoding/LANG/whatever in a first place?!

I chose optimal settings to make all applications I use work. And the best solution was to drop utf8 and use plain 8-bit LANG.

Ok, about tests:
LANG=pl_PL
amule

and amule ignores the LANG, but

LANG=pl_PL
vlc

and vlc displays filenames with correct polish characters.

Could it be, that:
a) amule ignores LANG or tries to do some dirty tricks with it? Like reading /etc/ files directly?
b) requires LANG to use utf8 and does not handle 8-bit (iso-8859-2) characters?
c) or it checks out if client is using utf8 if not is always assumes it is iso-8859-1 and nothing else

I have no idea why amule cannot accept LANG (I also tried RC_LANG and RC_LC_ALL) set to pl_PL ? I set language to Polish in amule as well. No change.

wuischke · « **Reply #18 on:** December 25, 2007, 10:38:51 PM »

IIRC we assume UTF-8 and will fallback to ISO 8859-1.

I don't like the corresponding code, somewhere there's the source of our not being able to share files with é,ñ,ô..., but the fix I tried (and which worked on my and some other's systems) made things even worse for others...

macias · « **Reply #19 on:** December 25, 2007, 11:03:57 PM »

Wuischke, thank you for the reply.

So maybe this could be my wish -- settings for fallback encoding? It would be good to have this configurable.

wuischke · « **Reply #20 on:** December 25, 2007, 11:26:53 PM »

Are you comfortable with compiling aMule? If yes, please open the file src/libs/common/ConvAmule.h and change the used encoding in line 59 and try it again on your system.

lfroen · « **Reply #21 on:** December 26, 2007, 01:18:00 AM »

Quote from: wuischke on December 25, 2007, 10:38:51 PM

IIRC we assume UTF-8 and will fallback to ISO 8859-1.

I don't like the corresponding code, somewhere there's the source of our not being able to share files with é,ñ,ô..., but the fix I tried (and which worked on my and some other's systems) made things even worse for others...

That's true WTF. Why wx doesn't handle that internally?

phoenix · « **Reply #22 on:** December 26, 2007, 05:07:49 AM »

Using anything different from UTF-8 locale is broken. I used to use ISO-8859-1 here, but I gave up. That meant I had to do some file name conversions, i. e. I had to rename some files manually.

Let me try to explain how aMule tries to handle this situation:

0) wxString uses UNICODE internally. ED2K network file names use UNICODE too. No problem here.

1) When reading a file name from the system, aMule will first try to use it as a UTF-8 encoded string. That might fail if the string is not a valid UTF-8 encoding, and this is usual on systems that do not use UTF-8, like ISO-8859-*. In the case of a failure, aMule will use it as ISO-8859-1. This is currently a hardcoded value, but there could be a preference tab to put any local encoding you like. This is certainly the source of the problem why aMule does not recognize the polish characters and doing what wuischke said above will correct the issue for ISO-8859-2, but will break for ISO-8859-1. Giving the user the possibility to choose the local encoding fallback is probably the best shot.

2) When converting ED2K file names to local encoding, for consistency, we first try ISO-8859-1. If the conversion is not possible, (e.g. chinese characters), then we use UTF-8. This is not a good policy if you have a UTF-8 system (like I do), but avoids breaking ISO-8859-1 systems, which used to be a lot, but maybe not so much nowadays. That procedure will present problems if the local encoding is ISO-8859-X, for X different from 1, as some characters will not be able to be converted and UTF-8 will be used.

In my opinion, we should stick to UTF-8 local encoding, even if the user has problems with it. #1 does not present a problem, and the user could in fact choose the fallback encoding to help aMule share the file with its correct name, but #2 is currently doing a bad job converting UNICODE to ISO-8859-1 and saving files with this encoding, when it should be using UTF-8. If we only use UTF-8 in #2 the user would possibly see some strange characters due to the lack of UTF-8 support from the system, but aMule and other UTF-8 aware applications would retrieve the file name correctly. My vote is to drop anything different from UTF-8 from #2.

Historically, wxWidgets did not help a lot in this matter, and in fact they were reluctant to accept this as beeing a problem the first time we reported it to them. Things have improoved a bit when they created a wxConvFileName objet that originally used a "broken file name conversion object", which is currently replaced by aMule's ConvAmuleBrokenFileNames object. Funny thing, they call it a broken file name. It is broken because wx can't deal with it properly, indeed, a very convenient atitude. Another problem has been their lack of consistency in using wxConvFileName internally to wxWidgets, but they seem to be improoving in that matter. I just got tired of discussing with them, I don't do that anymore, I have other real life stuff to spent my limited time resources with more effectiveness.

At the present moment, I believe that the problem has little to do with wxWidgets, unless they are still forgetting to use wxConvFileName somewhere. We just have to figure out what is the best policy and implement it.

wuischke · « **Reply #23 on:** December 26, 2007, 09:25:52 AM »

I personally agree on only supporting UTF8. The application will work very well and the negative impact for the few people not using Unicode is very small - even smaller than with today's solution. (They don't have to rename every file containing a special character anymore, it will only misdisplay the filename.)

macias · « **Reply #24 on:** December 26, 2007, 10:30:55 AM »

Quote from: wuischke on December 25, 2007, 11:26:53 PM

Are you comfortable with compiling aMule? If yes, please open the file src/libs/common/ConvAmule.h and change the used encoding in line 59 and try it again on your system.

Wuischke, thank you! I did it and things are getting better -- when I look at the shared files I can see properly displayed filenames now (iso-8859-2, success), but if I search for them amule still recognizes (the same files) as iso-8859-1.

I found out that iso-8859-1 is mentioned also in src/FileFunctions.cpp but I have no knowledge about amule so I don't know when those functions are used. Should I apply conversion to fallback encoding instead of direct copy of filename?

Phoenix,

Quote from: phoenix on December 26, 2007, 05:07:49 AM

In my opinion, we should stick to UTF-8 local encoding, even if the user has problems with it. #1 does not present a problem, and the user could in fact choose the fallback encoding to help aMule share the file with its correct name, but #2 is currently doing a bad job converting UNICODE to ISO-8859-1 and saving files with this encoding, when it should be using UTF-8. If we only use UTF-8 in #2 the user would possibly see some strange characters due to the lack of UTF-8 support from the system, but aMule and other UTF-8 aware applications would retrieve the file name correctly. My vote is to drop anything different from UTF-8 from #2.

I think it is ok, since when you get the file you can rename it according to your locale (in case if anything is wrong of course).

phoenix · « **Reply #25 on:** December 29, 2007, 09:02:29 PM »

Folks,

I am sorry, I fogot an important detail. It's been quite a while since I was involved with this last time.

The file name conversion routines are, respectively, for #1 MB2WC() and for #2 WC2MB(). The tricky part is that in an ideal world, WC2MB( MB2WC( s ) ) == s, i.e., converting to UNICODE and then back to multibyte, should be an invariant. If you read my previous explanation, you will notice that this is not the case. So, what are the options:

1) Leave the code as it is. This means that the code will fail if there is an UTF-8 file name in your system that can be converted to ISO-8859-1.
aMule will report something like this for a UTF-8 file name containing a modified letter " u + ' ":

Quote

2007-12-29 17:14:02: Logger.cpp(268): Error: Failed to retrieve file times for '/home/myuser/.aMule/Incoming/filmes/asdú.mp4' (error 2: No such file or directory)
2007-12-29 17:14:02: FileFunctions.cpp(187): FileIO: Error on GetLastModificationTime from `/home/myuser/.aMule/Incoming/filmes/asdú.mp4'
2007-12-29 17:14:02: CFile.cpp(135): CFile: Error when opening file (/home/myuser/.aMule/Incoming/filmes/asdú.mp4): No such file or directory

This is pretty bad IMHO, because a UTF-8 configured system will fail out of no apparent reason.

2) Change #1 and #2 so that we only recognize UTF-8 valid names. This means that the app will fail to read non-UTF-8 file names. The fact that your system is configured to use UTF-8 is irrelevant here, an application can always save a file name such that it is an invalid UTF-8 sequence. ISO-8859-1 file names would not be able to be shared. This is bad, but maybe not so much, non-UTF-8 systems are starting to get rare.

3) Leave #1 as it is, so that we are always able to read a file name from the system , and change #2 to convert UNICODE file name always to UTF-8. This would also break things for ISO-8859-1 names, aMule will not be able to share these names because the invariant is broken. UTF-8 names will work fine. I see no big advantage over the previous choice and maybe we are just postponing an error that should be caught sooner.

The big problem is that once the file name is converted to UNICODE by WC2MB(), there is no way for us to know if the original encoding was UTF-8 or ISO-8859, and we need this information to satisfy the invariance relation. The proper solution IMHO would be to patch wxWidgets so that it would remember the original file name string or the original encoding, somewhere in its internal file structure.

I would really appreciate oppinions here, this issue is bugging us for too much time, and all the cards are on the table now. I need help from the other people in the project as well as anyone wishing to contribute. My vote is for solution number 2, but we must be conscient that this will break sharing for all non-UTF-8 names.

Hope to hear from lots of people.

EDIT: Another possibility (#4) would be not to use wxStrings to store file names. I don't know if this is possible given that wx file functions expect wxStrings.

macias · « **Reply #26 on:** December 30, 2007, 09:38:31 AM »

Phoenix,

Maybe I miss something but why not use pairs, something like this
PAIR(FileName,wxString,local,wxString,outer_world);

Encoding for local filenames would be also taken from the system without any modification. System says it is iso-8859-1, fine, no problem.System says it is utf8 -- fine.

On the other hand "outer_world" would be always representation how the world sees that file (most likely it would be utf8).

For example I have have a file to share, it is remembered with original encoding and yet it has "outer_world" set after conversion so anybody else can search for that file and get correct results.

And vice versa -- when I get something, the "outer_world" is set to the filename as I see it through the net, but the "local" part is converted as it will be saved on my disk.

So it is close to #4 solution of yours, however at every point you will have wxString. If you want to something locally you get "local" part, if you want to transfer something you use "outer_world" part.

My 2 cents :-)

have a nice day, bye

phoenix · « **Reply #27 on:** January 02, 2008, 05:05:25 PM »

Hi macias,

Thank you for your time.

File names on the network are UNICODE encoded. There is no provision for exchanging locale information. That is something we must live with, but that presents no problems because you can convert anything to UNICODE.

The problem is that we use a library called wxWidgets that uses some kind of file abstraction. Which in my oppinion is braindamaged. Because the function that maps character strings to UNICODE strings is not one-to-one. There are, e.g., two ways to encode the letter ú, namely ISO-8859-1 and UTF-8. To correctly access the file stored in your local system, you MUST use the same encoding that was used to find the file. In fact, Linux does not know anything about character encodings. This is the reason why I said before that WC2MB( MB2WC( s ) ) == s must be true. Whenever this relation is not satisfied, wxWidgets will fail, it is not aMule's fault. What we are trying here is either to find a workaround or decide to break things in some systems. By the way, things are currently broken on lots of systems...

So, with this in mind, I totally agree with you that the clean solution would be to use a pair or a struct, but unfortunately we cannot use your suggestion, because it is something that must be implemented in wxWidgets and not in aMule. wxWidgets stores file names in wxString's, which use UNICODE, which seems reasonable at first, but is BBD (broken by design (TM)) by the reasons explained above. I have told the wxDevs before about this problem, but I have failed to make them understand that.

Cheers!

Kry · « **Reply #28 on:** January 02, 2008, 09:24:56 PM »

time to use fopen?

phoenix · « **Reply #29 on:** January 03, 2008, 01:15:52 PM »

Quote from: Kry on January 02, 2008, 09:24:56 PM

time to use fopen?

Short answer, yes.

Long answer, we should store the file name and the encoding in a "FileNameData" class and use it consistently althrough the program. Definitely we would use fopen, but you see, we might loose some portability in other file related functions.

That solution sucks because we would be basically reimplementing wxWidgets functionality that they were unable to implement properly.

aMule Forum

News:

Author Topic: settings: new option for filesystem (filenames) encoding (Read 20981 times)

macias

Re: settings: new option for filesystem (filenames) encoding

lfroen

Re: settings: new option for filesystem (filenames) encoding

macias

Re: settings: new option for filesystem (filenames) encoding

wuischke

Re: settings: new option for filesystem (filenames) encoding

macias

Re: settings: new option for filesystem (filenames) encoding

wuischke

Re: settings: new option for filesystem (filenames) encoding

lfroen

Re: settings: new option for filesystem (filenames) encoding

phoenix

Re: settings: new option for filesystem (filenames) encoding

wuischke

Re: settings: new option for filesystem (filenames) encoding

macias

Re: settings: new option for filesystem (filenames) encoding

phoenix

Re: settings: new option for filesystem (filenames) encoding

macias

Re: settings: new option for filesystem (filenames) encoding

phoenix

Re: settings: new option for filesystem (filenames) encoding

Kry

Re: settings: new option for filesystem (filenames) encoding

phoenix

Re: settings: new option for filesystem (filenames) encoding