aMule Forum
English => Feature requests => Topic started by: macias on December 23, 2007, 03:26:32 PM
-
Hello,
Currently (I guess) aMule assumes that LANG env is corresponding with filenames encoding. But there is no guarantee for that (and my system is proof of that), so please add such option in aMule -- this way all non 7-bit characters would be properly visible "outside".
Thank you in advance.
Merry Christmas btw :-)
-
Explain please.
-
I can use EN LANG setting in the system but use ISO8859-2 encoding in filenames. This works great with all apps, because older apps are not messed up with LANG different then EN and in new apps I can set encoding as I wish -- utf8, iso, etc. However only two apps I know assumes the LANG is correct for everything that comes from system (sane assumption) and do not allow any changes to it (not useful for the user). And those apps are Krusader and aMule. So in both filenames are incorrectly recognized -- for example ę is changed into spanish (?) e^.
Thus this wish, option like
Local filenames [ encoding ]
where encoding is a combobox with entries: system, iso...., utf8..., codepage, ....
would be very helpful.
-
Your setup is so wrong, that I don't know where to start.
* I have never seen an application that asks "what encoding your filesystem have"
* What kind of OS is that? On Linux ext2/3 uses Unicode; IIRC same thing on OSX.
* If you're using vfat from external device, filesystem driver is taking care for names encoding. Check out manual for "mount" command options
-
Your setup is so wrong, that I don't know where to start.
Nah, it is quite ok. It is different than yours, that's all.
* I have never seen an application that asks "what encoding your filesystem have"
Me too. But I've seen a lot of apps that allow user not to rely on LANG settings. KDE family for example. And this wish is example this. Since aMule does not use currency, timezone, etc. the only common thing is interpreting filenames.
* What kind of OS is that? On Linux ext2/3 uses Unicode; IIRC same thing on OSX.
Opensuse 10.3, KDE 3.5.8, reiserfs. The point is -- you CAN use unicode, it does not mean amule interpret it this way. My guess is it checks LANG first and then it interprets filenames according to LANG. So if I type
touch test.txt
dir test.txt
I will see test.txt. As well as in amule.
And when I type
touch ęóąłś.txt
dir ęóąłś.txt
(those are all polish characters) I will see ęóąłś.txt. In all apps... except amule and Krusader. Because they rely on LANG and my LANG is set to EN.
In every other app I can override LANG (Kate, Konsole, KDE itself, Kdevelop, you name it).
So maybe I rephrase my wish -- please add option to override the LANG system settings.
-
"Wrong" and "different" are 2 words that describe same situation - for some reason you're messing with filename encoding vs. LANG vs. other system settings. I have no idea why are you doing that.
On my Linux (Fedora 7, ext3 root) there's files with names on three languages - Russian, Hebrew and English; and I did not touched single setting to achieve that.
Now to the point
But I've seen a lot of apps that allow user not to rely on LANG settings.
Either I missing something, or you're asking for this script:
#!/bin/tcsh
setenv LANG whatever
exec amule
So maybe I rephrase my wish -- please add option to override the LANG system settings.
See script above.
-
[quote author=lfroen link=topic=14037.msg74133#msg74133 date=1198481248]
"Wrong" and "different" are 2 words that describe same situation - for some reason you're messing with filename encoding vs. LANG vs. other system settings. I have no idea why are you doing that.
On my Linux (Fedora 7, ext3 root) there's files with names on three languages - Russian, Hebrew and English; and I did not touched single setting to achieve that.
Because default one works for you. For me not -- so I dropped utf8 from LANG and it works for me to.
So maybe I rephrase my wish -- please add option to override the LANG system settings.
See script above.
[/quote]
But I've seen a lot of apps that allow user not to rely on LANG settings.
Either I missing something, or you're asking for this script:
#!/bin/tcsh
setenv LANG whatever
exec amule
Thank you but amule still recognizes polish characters (after setting LANG=pl_PL and language to Polish in amule) as non-polish.
Maybe I explain: I can type in search "żółw" (for example) and it is displayed correctly. Also amule searches for that word and if it finds anything like that it shows that and displays matches correctly. But it misses my files named "żółw", I can do a little test and rename my file to "żółw zolw" -- then if I search for "zolw" I will get that file but polish part characters are displayed (interpreted) incorrectly as some western extended characters (except for "ó" because it is the same code and look).
To test it a bit further I applied your solution to Krusader (it is the second application that does not have settings for encoding). After
LANG=pl_PL
krusader
Krusader displays all filenames 100% correctly.
Wild guess -- could it be that amule assumes utf8 in filenames? Or anything like that.
-
Try LANG=pl_PL.iso88592
I don't guarantee that it works, aMule surely doesn't handle it but the underlying wxWidgets libraries might.
-
Try LANG=pl_PL.iso88592
I don't guarantee that it works, aMule surely doesn't handle it but the underlying wxWidgets libraries might.
Thank you, I tried this, and ISO88592 and ISO-8859-2. The result is always the same, my test "ę" is shown as "e^" (one character).
-
Because default one works for you. For me not -- so I dropped utf8 from LANG and it works for me to.
That's the question that should be answered first. Your settings are screwed. If you find yourself messing with LANG - most chances you doing something you should not.
I don't guarantee that it works, aMule surely doesn't handle it but the underlying wxWidgets libraries might.
It may be a bug in wxWidgets itself - AFAIK aMule doesn't try to interpret filenames. You are welcomed to check if you see same problem in other wxWidgets based applications
-
Because default one works for you. For me not -- so I dropped utf8 from LANG and it works for me to.
That's the question that should be answered first. Your settings are screwed.
Excuse me, since when LANG=en_EN is screwed? It is perfectly legal setting. And even besides that all apps I use except two works perfectly well -- even numbers tell that it is more likely that there is something not right with those two apps than the rest of them, don't you agree?
I don't guarantee that it works, aMule surely doesn't handle it but the underlying wxWidgets libraries might.
It may be a bug in wxWidgets itself - AFAIK aMule doesn't try to interpret filenames. You are welcomed to check if you see same problem in other wxWidgets based applications
Ok, I will do some tests, first I have to find what uses wxWidgets except amule :-)
-
Excuse me, since when LANG=en_EN is screwed? It is perfectly legal setting. And even besides that all apps I use except two works perfectly well -- even numbers tell that it is more likely that there is something not right with those two apps than the rest of them, don't you agree?
Excused. The problem is, that you tell the app, that it has encoding en_EN with for itself is legal, and your filesystem serves the filename in pl_PL or whatever with for itself again is absolutely legal. But what you see is what happens when they don't communicate and expect the other to use the same as they do. Maybe the other apps just ignore the settings you gave them and this causes them to work.
-
Try VLC, it uses wx as well.
-
The problem is, that you tell the app, that it has encoding en_EN with for itself is legal, and your filesystem serves the filename in pl_PL or whatever with for itself again is absolutely legal. But what you see is what happens when they don't communicate and expect the other to use the same as they do. Maybe the other apps just ignore the settings you gave them and this causes them to work.
No, the other apps have settings to override LANG. And it is pretty sane -- there are a lot of older software that heavily rely on LANG and LANG only and do not work with extended LANG too well. So it is useful to set LANG to the most used settings (en) for the sake of them working and set it independently in modern environment to really needed encoding (iso-8859-2).
-
No, the other apps have settings to override LANG. And it is pretty sane -- there are a lot of older software that heavily rely on LANG and LANG only and do not work with extended LANG too well. So it is useful to set LANG to the most used settings (en) for the sake of them working and set it independently in modern environment to really needed encoding (iso-8859-2).
Maybe it's more usefull to upgrade this old software and use the locales in the sense of the inventor.
-
Maybe it's more usefull to upgrade this old software and use the locales in the sense of the inventor.
:-)
2) LANG is old way to express anything about encoding, so if anything should be upgraded is this old system of communicating with the world
1) old software by all means is still useful -- Midnight Commander for example (MC is the main reason I use LANG set to en, not to pl)
-
The problem is, that you tell the app, that it has encoding en_EN with for itself is legal, and your filesystem serves the filename in pl_PL or whatever with for itself again is absolutely legal. But what you see is what happens when they don't communicate and expect the other to use the same as they do. Maybe the other apps just ignore the settings you gave them and this causes them to work.
No, the other apps have settings to override LANG. And it is pretty sane -- there are a lot of older software that heavily rely on LANG and LANG only and do not work with extended LANG too well. So it is useful to set LANG to the most used settings (en) for the sake of them working and set it independently in modern environment to really needed encoding (iso-8859-2).
This paragraph of your makes no sense. WTF is "settings to override LANG"?! Can you please point me to some application which have setting like "my filesystem encoding is BLAH"? Did you saw Firefox doing that? OpenOffice? Mplayer?
Your complain that "it works for 100 application but doesn't work for 2" is meaningless. All it means that sometimes it works, sometimes not. Which in turn means that your settings are screwed.
Now, can you please stop for a second, and explain why did you started to mess with local/encoding/LANG/whatever in a first place?!
-
This paragraph of your makes no sense. WTF is "settings to override LANG"?! Can you please point me to some application which have setting like "my filesystem encoding is BLAH"? Did you saw Firefox doing that? OpenOffice? Mplayer?
I already pointed it out -- KDE. You set up your locale, that's it.
Now, can you please stop for a second, and explain why did you started to mess with local/encoding/LANG/whatever in a first place?!
I chose optimal settings to make all applications I use work. And the best solution was to drop utf8 and use plain 8-bit LANG.
Ok, about tests:
LANG=pl_PL
amule
and amule ignores the LANG, but
LANG=pl_PL
vlc
and vlc displays filenames with correct polish characters.
Could it be, that:
a) amule ignores LANG or tries to do some dirty tricks with it? Like reading /etc/ files directly?
b) requires LANG to use utf8 and does not handle 8-bit (iso-8859-2) characters?
c) or it checks out if client is using utf8 if not is always assumes it is iso-8859-1 and nothing else
I have no idea why amule cannot accept LANG (I also tried RC_LANG and RC_LC_ALL) set to pl_PL ? I set language to Polish in amule as well. No change.
-
IIRC we assume UTF-8 and will fallback to ISO 8859-1.
I don't like the corresponding code, somewhere there's the source of our not being able to share files with é,ñ,ô..., but the fix I tried (and which worked on my and some other's systems) made things even worse for others...
-
Wuischke, thank you for the reply.
So maybe this could be my wish -- settings for fallback encoding? It would be good to have this configurable.
-
Are you comfortable with compiling aMule? If yes, please open the file src/libs/common/ConvAmule.h and change the used encoding in line 59 and try it again on your system.
-
IIRC we assume UTF-8 and will fallback to ISO 8859-1.
I don't like the corresponding code, somewhere there's the source of our not being able to share files with é,ñ,ô..., but the fix I tried (and which worked on my and some other's systems) made things even worse for others...
That's true WTF. Why wx doesn't handle that internally?
-
Using anything different from UTF-8 locale is broken. I used to use ISO-8859-1 here, but I gave up. That meant I had to do some file name conversions, i. e. I had to rename some files manually.
Let me try to explain how aMule tries to handle this situation:
0) wxString uses UNICODE internally. ED2K network file names use UNICODE too. No problem here.
1) When reading a file name from the system, aMule will first try to use it as a UTF-8 encoded string. That might fail if the string is not a valid UTF-8 encoding, and this is usual on systems that do not use UTF-8, like ISO-8859-*. In the case of a failure, aMule will use it as ISO-8859-1. This is currently a hardcoded value, but there could be a preference tab to put any local encoding you like. This is certainly the source of the problem why aMule does not recognize the polish characters and doing what wuischke said above will correct the issue for ISO-8859-2, but will break for ISO-8859-1. Giving the user the possibility to choose the local encoding fallback is probably the best shot.
2) When converting ED2K file names to local encoding, for consistency, we first try ISO-8859-1. If the conversion is not possible, (e.g. chinese characters), then we use UTF-8. This is not a good policy if you have a UTF-8 system (like I do), but avoids breaking ISO-8859-1 systems, which used to be a lot, but maybe not so much nowadays. That procedure will present problems if the local encoding is ISO-8859-X, for X different from 1, as some characters will not be able to be converted and UTF-8 will be used.
In my opinion, we should stick to UTF-8 local encoding, even if the user has problems with it. #1 does not present a problem, and the user could in fact choose the fallback encoding to help aMule share the file with its correct name, but #2 is currently doing a bad job converting UNICODE to ISO-8859-1 and saving files with this encoding, when it should be using UTF-8. If we only use UTF-8 in #2 the user would possibly see some strange characters due to the lack of UTF-8 support from the system, but aMule and other UTF-8 aware applications would retrieve the file name correctly. My vote is to drop anything different from UTF-8 from #2.
Historically, wxWidgets did not help a lot in this matter, and in fact they were reluctant to accept this as beeing a problem the first time we reported it to them. Things have improoved a bit when they created a wxConvFileName objet that originally used a "broken file name conversion object", which is currently replaced by aMule's ConvAmuleBrokenFileNames object. Funny thing, they call it a broken file name. It is broken because wx can't deal with it properly, indeed, a very convenient atitude. Another problem has been their lack of consistency in using wxConvFileName internally to wxWidgets, but they seem to be improoving in that matter. I just got tired of discussing with them, I don't do that anymore, I have other real life stuff to spent my limited time resources with more effectiveness.
At the present moment, I believe that the problem has little to do with wxWidgets, unless they are still forgetting to use wxConvFileName somewhere. We just have to figure out what is the best policy and implement it.
-
I personally agree on only supporting UTF8. The application will work very well and the negative impact for the few people not using Unicode is very small - even smaller than with today's solution. (They don't have to rename every file containing a special character anymore, it will only misdisplay the filename.)
-
Are you comfortable with compiling aMule? If yes, please open the file src/libs/common/ConvAmule.h and change the used encoding in line 59 and try it again on your system.
Wuischke, thank you! I did it and things are getting better -- when I look at the shared files I can see properly displayed filenames now (iso-8859-2, success), but if I search for them amule still recognizes (the same files) as iso-8859-1.
I found out that iso-8859-1 is mentioned also in src/FileFunctions.cpp but I have no knowledge about amule so I don't know when those functions are used. Should I apply conversion to fallback encoding instead of direct copy of filename?
Phoenix,
In my opinion, we should stick to UTF-8 local encoding, even if the user has problems with it. #1 does not present a problem, and the user could in fact choose the fallback encoding to help aMule share the file with its correct name, but #2 is currently doing a bad job converting UNICODE to ISO-8859-1 and saving files with this encoding, when it should be using UTF-8. If we only use UTF-8 in #2 the user would possibly see some strange characters due to the lack of UTF-8 support from the system, but aMule and other UTF-8 aware applications would retrieve the file name correctly. My vote is to drop anything different from UTF-8 from #2.
I think it is ok, since when you get the file you can rename it according to your locale (in case if anything is wrong of course).
-
Folks,
I am sorry, I fogot an important detail. It's been quite a while since I was involved with this last time.
The file name conversion routines are, respectively, for #1 MB2WC() and for #2 WC2MB(). The tricky part is that in an ideal world, WC2MB( MB2WC( s ) ) == s, i.e., converting to UNICODE and then back to multibyte, should be an invariant. If you read my previous explanation, you will notice that this is not the case. So, what are the options:
1) Leave the code as it is. This means that the code will fail if there is an UTF-8 file name in your system that can be converted to ISO-8859-1.
aMule will report something like this for a UTF-8 file name containing a modified letter " u + ' ":
2007-12-29 17:14:02: Logger.cpp(268): Error: Failed to retrieve file times for '/home/myuser/.aMule/Incoming/filmes/asdú.mp4' (error 2: No such file or directory)
2007-12-29 17:14:02: FileFunctions.cpp(187): FileIO: Error on GetLastModificationTime from `/home/myuser/.aMule/Incoming/filmes/asdú.mp4'
2007-12-29 17:14:02: CFile.cpp(135): CFile: Error when opening file (/home/myuser/.aMule/Incoming/filmes/asdú.mp4): No such file or directory
This is pretty bad IMHO, because a UTF-8 configured system will fail out of no apparent reason.
2) Change #1 and #2 so that we only recognize UTF-8 valid names. This means that the app will fail to read non-UTF-8 file names. The fact that your system is configured to use UTF-8 is irrelevant here, an application can always save a file name such that it is an invalid UTF-8 sequence. ISO-8859-1 file names would not be able to be shared. This is bad, but maybe not so much, non-UTF-8 systems are starting to get rare.
3) Leave #1 as it is, so that we are always able to read a file name from the system , and change #2 to convert UNICODE file name always to UTF-8. This would also break things for ISO-8859-1 names, aMule will not be able to share these names because the invariant is broken. UTF-8 names will work fine. I see no big advantage over the previous choice and maybe we are just postponing an error that should be caught sooner.
The big problem is that once the file name is converted to UNICODE by WC2MB(), there is no way for us to know if the original encoding was UTF-8 or ISO-8859, and we need this information to satisfy the invariance relation. The proper solution IMHO would be to patch wxWidgets so that it would remember the original file name string or the original encoding, somewhere in its internal file structure.
I would really appreciate oppinions here, this issue is bugging us for too much time, and all the cards are on the table now. I need help from the other people in the project as well as anyone wishing to contribute. My vote is for solution number 2, but we must be conscient that this will break sharing for all non-UTF-8 names.
Hope to hear from lots of people.
EDIT: Another possibility (#4) would be not to use wxStrings to store file names. I don't know if this is possible given that wx file functions expect wxStrings.
-
Phoenix,
Maybe I miss something but why not use pairs, something like this
PAIR(FileName,wxString,local,wxString,outer_world);
Encoding for local filenames would be also taken from the system without any modification. System says it is iso-8859-1, fine, no problem.System says it is utf8 -- fine.
On the other hand "outer_world" would be always representation how the world sees that file (most likely it would be utf8).
For example I have have a file to share, it is remembered with original encoding and yet it has "outer_world" set after conversion so anybody else can search for that file and get correct results.
And vice versa -- when I get something, the "outer_world" is set to the filename as I see it through the net, but the "local" part is converted as it will be saved on my disk.
So it is close to #4 solution of yours, however at every point you will have wxString. If you want to something locally you get "local" part, if you want to transfer something you use "outer_world" part.
My 2 cents :-)
have a nice day, bye
-
Hi macias,
Thank you for your time.
File names on the network are UNICODE encoded. There is no provision for exchanging locale information. That is something we must live with, but that presents no problems because you can convert anything to UNICODE.
The problem is that we use a library called wxWidgets that uses some kind of file abstraction. Which in my oppinion is braindamaged. Because the function that maps character strings to UNICODE strings is not one-to-one. There are, e.g., two ways to encode the letter ú, namely ISO-8859-1 and UTF-8. To correctly access the file stored in your local system, you MUST use the same encoding that was used to find the file. In fact, Linux does not know anything about character encodings. This is the reason why I said before that WC2MB( MB2WC( s ) ) == s must be true. Whenever this relation is not satisfied, wxWidgets will fail, it is not aMule's fault. What we are trying here is either to find a workaround or decide to break things in some systems. By the way, things are currently broken on lots of systems...
So, with this in mind, I totally agree with you that the clean solution would be to use a pair or a struct, but unfortunately we cannot use your suggestion, because it is something that must be implemented in wxWidgets and not in aMule. wxWidgets stores file names in wxString's, which use UNICODE, which seems reasonable at first, but is BBD (broken by design (TM)) by the reasons explained above. I have told the wxDevs before about this problem, but I have failed to make them understand that.
Cheers!
-
time to use fopen? :P
-
time to use fopen? :P
Short answer, yes.
Long answer, we should store the file name and the encoding in a "FileNameData" class and use it consistently althrough the program. Definitely we would use fopen, but you see, we might loose some portability in other file related functions.
That solution sucks because we would be basically reimplementing wxWidgets functionality that they were unable to implement properly.
-
we don't lose portability, we use fopen and then associate the FILE* to a wxFFile object and use it like we're doing now.
-
Everybody, please test tomorrow svn tarball. Xaignar has probably fixed this issue.
-
Just tried with amule and amuled, Incoming and Temp on a NTFS partition (mounted by ntfs-3g). Everything seems to work correctly!
Thanks again for your marvellous job!
EDIT: I forgot to report my locale settings: en_US.utf8 and ntfs-3g partitions mounted with options "locale=en_US.utf8"
Bye,
Mr Hyde
-
I tested 20080205 version and comparing it to patched (ConvAmule) 20071226 I would say it is a regression -- no matter what I do (*), I don't see Polish characters.
(*) LANG=en_EN or LANG=pl_PL.ISO-8859-2 (I use ISO-8859-2 characters in filenames)
In 20071216 (but only patched!) Polish characters are displayed 100% correctly.
-
If I'm not mistaken, this is intended behaviour.
We use only Unicode now, so you'll get a misdisplay, but at least all the filenames should be written correctly to disk. It would be preferable to use Unicode on your system, too, but iirc you stated some reasons not to use Unicode earlier.
Keep in mind that I haven't read the code after the changes yet, so I could be wrong about what I say.
-
We use only Unicode now, so you'll get a misdisplay, but at least all the filenames should be written correctly to disk.
It is impossible by definition? Unicode can be properly written if the FS is Unicode. If it is not, and any conversion is not done there is no change (i.e. filenames will be written incorrectly).
imho amule should take LANG into account because it is quite good indicator how the system is set.
-
macias,
Xaignar is now looking at your issue and it will be fixed. Please test and report your results after he says it is committed.
Cheers!
-
We use only Unicode now, so you'll get a misdisplay, but at least all the filenames should be written correctly to disk.
It is impossible by definition?
No, the problem with not displaying the Polish chars is because of a bug in the demangling routines used for printing. The filename strings used when actually accessing the filesystem are not affected by this problem. In any case, I've commited a fix, so that these routines should now use the system locale when possible, instead of ISO-8859-1, which was the previously hardcoded fallback.
-
No, the problem with not displaying the Polish chars is because of a bug in the demangling routines used for printing. The filename strings used when actually accessing the filesystem are not affected by this problem. In any case, I've commited a fix, so that these routines should now use the system locale when possible, instead of ISO-8859-1, which was the previously hardcoded fallback.
I downloaded 20080206 SVN and all filenames with Polish characters are now displayed 100% properly. THANK YOU!
-
I downloaded 20080206 SVN and all filenames with Polish characters are now displayed 100% properly. THANK YOU!
Glad to hear it, and appologies for the oversight on my part.