aMule Forum

Please login or register.

Login with username, password and session length
Advanced search  

News:

We're back! (IN POG FORM)

Author Topic: Wrong files name in Chinese  (Read 6061 times)

ccav2000

  • Newbie
  • Karma: 0
  • Offline Offline
  • Posts: 3
Wrong files name in Chinese
« on: May 22, 2010, 04:01:52 AM »

UnescapeHTML have a problem to convert the Chinese UTF-8 to Unicode

Code: [Select]
--- ../aMule-2.2.6/src/libs/common/StringFunctions.cpp 2009-03-29 03:29:59.000000000 +0800
+++ ./src/libs/common/StringFunctions.cpp 2010-05-08 09:52:24.382776700 +0800
@@ -159,6 +159,25 @@
 }
 
 
+__inline size_t UTF8Len(char byte)
+{
+   if ((byte & 0x80) == 0x0) {
+      return 1;
+   } else if ((byte & 0xE0) == 0xC0) {
+      return 2;
+   } else if ((byte & 0xF0) == 0xE0) {
+      return 3;
+   } else if ((byte & 0xF8) == 0xF0) {
+      return 4;
+   } else if ((byte & 0xFC) == 0xF8) {
+      return 5;
+   } else if ((byte & 0xFE) == 0xFC) {
+      return 6;
+   } else {
+      return 1;
+   }
+}
+
 wxString UnescapeHTML( const wxString& str )
 {
  wxString result;
@@ -168,14 +187,32 @@
  if ( str.GetChar(i) == wxT('%') && ( i + 2 < str.Len() ) ) {
  wxChar unesc = HexToDec( str.Mid( i + 1, 2 ) );
 
- if ( unesc ) {
- i += 2;
-
- result += unesc;
- } else {
- // If conversion failed, then we just add the escape-code
- // and continue past it like nothing happened.
- result += str.at(i);
+ size_t len;
+ if ((len=UTF8Len (unesc))>1) {
+    //multi chars
+    if ((i+(len*3)) >= str.Len()) break;
+    unsigned char s[8];
+            s[0] = (unsigned char)unesc;
+            size_t ii;
+            i+=3;
+    for (ii=1; ii<len; i+=3, ++ii) {
+       unesc = HexToDec (str.Mid(i+1,2));
+       s[ii] = (unsigned char)unesc;
+    }
+    s[ii] = 0;
+    result += UTF82unicode((char*)s);
+    --i;
+ }
+ else {
+            if ( unesc ) {
+               i += 2;
+
+               result += unesc;
+            } else {
+               // If conversion failed, then we just add the escape-code
+               // and continue past it like nothing happened.
+               result += str.at(i);
+            }
  }
  } else {
  result += str.at(i);
Logged

Kry

  • Ex-developer
  • Retired admin
  • Hero Member
  • *****
  • Karma: -665
  • Offline Offline
  • Posts: 5822
Re: Wrong files name in Chinese
« Reply #1 on: May 22, 2010, 11:01:30 AM »

Thanks for your contribution, but:

1) Please create the patch against current SVN, not the 2.2.x releases.
2) Please make sure the code is correctly indented, as the current patch is barely readable with the indentation being wrong all over the place.
Logged

ccav2000

  • Newbie
  • Karma: 0
  • Offline Offline
  • Posts: 3
Re: Wrong files name in Chinese
« Reply #2 on: May 25, 2010, 02:05:24 AM »

ok, where can i get the svn of amule???
Logged

Kry

  • Ex-developer
  • Retired admin
  • Hero Member
  • *****
  • Karma: -665
  • Offline Offline
  • Posts: 5822
Logged

ccav2000

  • Newbie
  • Karma: 0
  • Offline Offline
  • Posts: 3
Re: Wrong files name in Chinese
« Reply #4 on: May 25, 2010, 05:52:48 AM »

patch for aMule-SVN-r10179
Code: [Select]
diff -Nur aMule-SVN-r10179/src/libs/common/StringFunctions.cpp aMule-SVN-r10179-ccav2000/src/libs/common/StringFunctions.cpp
--- aMule-SVN-r10179/src/libs/common/StringFunctions.cpp 2010-04-19 04:52:05.000000000 +0800
+++ aMule-SVN-r10179-ccav2000/src/libs/common/StringFunctions.cpp 2010-05-25 11:47:06.542029454 +0800
@@ -109,6 +109,24 @@
  return result;
 }
 
+size_t UTF8Len (char byte)
+{
+ if ((byte & 0x80) == 0x0) {
+ return 1;
+ } else if ((byte & 0xE0) == 0xC0) {
+ return 2;
+ } else if ((byte & 0xF0) == 0xE0) {
+ return 3;
+ } else if ((byte & 0xF8) == 0xF0) {
+ return 4;
+ } else if ((byte & 0xFC) == 0xF8) {
+ return 5;
+ } else if ((byte & 0xFE) == 0xFC) {
+ return 6;
+ } else {
+ return 1;
+ }
+}
 
 wxString UnescapeHTML( const wxString& str )
 {
@@ -118,15 +136,30 @@
  for ( size_t i = 0; i < str.Len(); ++i ) {
  if ( str.GetChar(i) == wxT('%') && ( i + 2 < str.Len() ) ) {
  wxChar unesc = HexToDec( str.Mid( i + 1, 2 ) );
-
- if ( unesc ) {
- i += 2;
-
- result += unesc;
- } else {
- // If conversion failed, then we just add the escape-code
- // and continue past it like nothing happened.
- result += str.at(i);
+ size_t len;
+ if ((len=UTF8Len(unesc))>1) {
+ if ((i+(len*3)) >= str.Len()) break;
+ unsigned char s[8];
+ memset (s, 0, sizeof(s));
+ s[0] = (unsigned char)unesc;
+ i += 3;
+ for (int j=1; j<len; ++j, i+=3) {
+ unesc = HexToDec (str.Mid(i+1,2));
+ s[j] = (unsigned char)unesc;
+ }
+ result += UTF82unicode ((char*)s);
+ --i;
+ }
+ else {
+ if ( unesc ) {
+ i += 2;
+
+ result += unesc;
+ } else {
+ // If conversion failed, then we just add the escape-code
+ // and continue past it like nothing happened.
+ result += str.at(i);
+ }
  }
  } else {
  result += str.at(i);
Logged

Kry

  • Ex-developer
  • Retired admin
  • Hero Member
  • *****
  • Karma: -665
  • Offline Offline
  • Posts: 5822
Re: Wrong files name in Chinese
« Reply #5 on: May 25, 2010, 03:23:11 PM »

There are a couple extra style problems, but I'll take care of them when I review and include it (and I will, most probably). Thanks for the contribution!
Logged

GonoszTopi

  • The current man in charge of most things.
  • Administrator
  • Hero Member
  • *****
  • Karma: 166
  • Offline Offline
  • Posts: 2679
Re: Wrong files name in Chinese
« Reply #6 on: May 25, 2010, 07:17:49 PM »

Code: [Select]
+ result += UTF82unicode ((char*)s);
The default wxConvUTF8 object will ignore invalid UTF-8 sequences, thus all %-escaped non-UTF-8 sequences will be removed from the result. For example, an ISO-8859-1 (Latin-1) encoded file name might include the "Ã" character encoded as %C3, which is also a valid (two-byte) UTF-8 sequence header. Since it's unlikely that the next character will follow as a valid UTF-8 sequence (still assuming the file name is encoded in ISO-8859-1), the whole sequence (the "Ã" and the following character) will be dropped from the file name.

Another thing is that this decoder requires the whole UTF-8 sequence be either %-encoded or left as-is. If only part of a sequence is %-encoded, it will either be ignored (starting byte is %-encoded) or %-decoded but each byte left as a separate character (starting byte isn't %-encoded).

I'd therefore suggest that we first %-decode the whole string, and then try to interpret it as a UTF-8 encoded string. If that fails for some reason (e.g. the string is not encoded in UTF-8), fall back to some other (probably ISO-8859-1) encoding.
Logged
concordia cum veritate

Kry

  • Ex-developer
  • Retired admin
  • Hero Member
  • *****
  • Karma: -665
  • Offline Offline
  • Posts: 5822
Re: Wrong files name in Chinese
« Reply #7 on: May 25, 2010, 07:56:45 PM »

You know what, GonoszTopi just got the task of making this work.
Logged

GonoszTopi

  • The current man in charge of most things.
  • Administrator
  • Hero Member
  • *****
  • Karma: 166
  • Offline Offline
  • Posts: 2679
Re: Wrong files name in Chinese
« Reply #8 on: May 25, 2010, 08:03:55 PM »

You know what, GonoszTopi just got the task of making this work.

Oh, nooo....

I don't feel like implementing it today, and I'll forget it by tomorrow ;) Anyone volunteering?
Logged
concordia cum veritate

Kry

  • Ex-developer
  • Retired admin
  • Hero Member
  • *****
  • Karma: -665
  • Offline Offline
  • Posts: 5822
Re: Wrong files name in Chinese
« Reply #9 on: May 25, 2010, 08:13:18 PM »

You.
Logged

GonoszTopi

  • The current man in charge of most things.
  • Administrator
  • Hero Member
  • *****
  • Karma: 166
  • Offline Offline
  • Posts: 2679
Re: Wrong files name in Chinese
« Reply #10 on: May 25, 2010, 10:22:44 PM »

You.

k, then, this should do the trick.
Logged
concordia cum veritate

Stu Redman

  • Administrator
  • Hero Member
  • *****
  • Karma: 214
  • Offline Offline
  • Posts: 3762
  • Engines screaming
Re: Wrong files name in Chinese
« Reply #11 on: May 29, 2010, 07:48:02 PM »

Did someone take away your commit rights?  ;)
Logged
The image of mother goddess, lying dormant in the eyes of the dead, the sheaf of the corn is broken, end the harvest, throw the dead on the pyre -- Iron Maiden, Isle of Avalon

GonoszTopi

  • The current man in charge of most things.
  • Administrator
  • Hero Member
  • *****
  • Karma: 166
  • Offline Offline
  • Posts: 2679
Re: Wrong files name in Chinese
« Reply #12 on: May 29, 2010, 11:50:51 PM »

Did someone take away your commit rights?  ;)
I hope not...

I was waiting for
a) you fix your last commit
b) someone actually trying and verifying the patch
Logged
concordia cum veritate

wolverine864

  • Approved Newbie
  • *
  • Karma: 0
  • Offline Offline
  • Posts: 14
Re: Wrong files name in Chinese
« Reply #13 on: June 11, 2010, 03:58:30 PM »

Ok, I didn't check whether it was meant to solve the blank filenames containing asian characters under OSX, but I tried
the code and it did not solve the problem.
Logged

GonoszTopi

  • The current man in charge of most things.
  • Administrator
  • Hero Member
  • *****
  • Karma: 166
  • Offline Offline
  • Posts: 2679
Re: Wrong files name in Chinese
« Reply #14 on: June 12, 2010, 09:27:02 PM »

No, it wasn't meant to fix that.
Logged
concordia cum veritate