aMule Forum

English => en_Bugs => Topic started by: ccav2000 on May 22, 2010, 04:01:52 AM

Title: Wrong files name in Chinese
Post by: ccav2000 on May 22, 2010, 04:01:52 AM
UnescapeHTML have a problem to convert the Chinese UTF-8 to Unicode

Code: [Select]
--- ../aMule-2.2.6/src/libs/common/StringFunctions.cpp 2009-03-29 03:29:59.000000000 +0800
+++ ./src/libs/common/StringFunctions.cpp 2010-05-08 09:52:24.382776700 +0800
@@ -159,6 +159,25 @@
 }
 
 
+__inline size_t UTF8Len(char byte)
+{
+   if ((byte & 0x80) == 0x0) {
+      return 1;
+   } else if ((byte & 0xE0) == 0xC0) {
+      return 2;
+   } else if ((byte & 0xF0) == 0xE0) {
+      return 3;
+   } else if ((byte & 0xF8) == 0xF0) {
+      return 4;
+   } else if ((byte & 0xFC) == 0xF8) {
+      return 5;
+   } else if ((byte & 0xFE) == 0xFC) {
+      return 6;
+   } else {
+      return 1;
+   }
+}
+
 wxString UnescapeHTML( const wxString& str )
 {
  wxString result;
@@ -168,14 +187,32 @@
  if ( str.GetChar(i) == wxT('%') && ( i + 2 < str.Len() ) ) {
  wxChar unesc = HexToDec( str.Mid( i + 1, 2 ) );
 
- if ( unesc ) {
- i += 2;
-
- result += unesc;
- } else {
- // If conversion failed, then we just add the escape-code
- // and continue past it like nothing happened.
- result += str.at(i);
+ size_t len;
+ if ((len=UTF8Len (unesc))>1) {
+    //multi chars
+    if ((i+(len*3)) >= str.Len()) break;
+    unsigned char s[8];
+            s[0] = (unsigned char)unesc;
+            size_t ii;
+            i+=3;
+    for (ii=1; ii<len; i+=3, ++ii) {
+       unesc = HexToDec (str.Mid(i+1,2));
+       s[ii] = (unsigned char)unesc;
+    }
+    s[ii] = 0;
+    result += UTF82unicode((char*)s);
+    --i;
+ }
+ else {
+            if ( unesc ) {
+               i += 2;
+
+               result += unesc;
+            } else {
+               // If conversion failed, then we just add the escape-code
+               // and continue past it like nothing happened.
+               result += str.at(i);
+            }
  }
  } else {
  result += str.at(i);
Title: Re: Wrong files name in Chinese
Post by: Kry on May 22, 2010, 11:01:30 AM
Thanks for your contribution, but:

1) Please create the patch against current SVN, not the 2.2.x releases.
2) Please make sure the code is correctly indented, as the current patch is barely readable with the indentation being wrong all over the place.
Title: Re: Wrong files name in Chinese
Post by: ccav2000 on May 25, 2010, 02:05:24 AM
ok, where can i get the svn of amule???
Title: Re: Wrong files name in Chinese
Post by: Kry on May 25, 2010, 02:22:36 AM
http://forum.amule.org/index.php?topic=16651.0
Title: Re: Wrong files name in Chinese
Post by: ccav2000 on May 25, 2010, 05:52:48 AM
patch for aMule-SVN-r10179
Code: [Select]
diff -Nur aMule-SVN-r10179/src/libs/common/StringFunctions.cpp aMule-SVN-r10179-ccav2000/src/libs/common/StringFunctions.cpp
--- aMule-SVN-r10179/src/libs/common/StringFunctions.cpp 2010-04-19 04:52:05.000000000 +0800
+++ aMule-SVN-r10179-ccav2000/src/libs/common/StringFunctions.cpp 2010-05-25 11:47:06.542029454 +0800
@@ -109,6 +109,24 @@
  return result;
 }
 
+size_t UTF8Len (char byte)
+{
+ if ((byte & 0x80) == 0x0) {
+ return 1;
+ } else if ((byte & 0xE0) == 0xC0) {
+ return 2;
+ } else if ((byte & 0xF0) == 0xE0) {
+ return 3;
+ } else if ((byte & 0xF8) == 0xF0) {
+ return 4;
+ } else if ((byte & 0xFC) == 0xF8) {
+ return 5;
+ } else if ((byte & 0xFE) == 0xFC) {
+ return 6;
+ } else {
+ return 1;
+ }
+}
 
 wxString UnescapeHTML( const wxString& str )
 {
@@ -118,15 +136,30 @@
  for ( size_t i = 0; i < str.Len(); ++i ) {
  if ( str.GetChar(i) == wxT('%') && ( i + 2 < str.Len() ) ) {
  wxChar unesc = HexToDec( str.Mid( i + 1, 2 ) );
-
- if ( unesc ) {
- i += 2;
-
- result += unesc;
- } else {
- // If conversion failed, then we just add the escape-code
- // and continue past it like nothing happened.
- result += str.at(i);
+ size_t len;
+ if ((len=UTF8Len(unesc))>1) {
+ if ((i+(len*3)) >= str.Len()) break;
+ unsigned char s[8];
+ memset (s, 0, sizeof(s));
+ s[0] = (unsigned char)unesc;
+ i += 3;
+ for (int j=1; j<len; ++j, i+=3) {
+ unesc = HexToDec (str.Mid(i+1,2));
+ s[j] = (unsigned char)unesc;
+ }
+ result += UTF82unicode ((char*)s);
+ --i;
+ }
+ else {
+ if ( unesc ) {
+ i += 2;
+
+ result += unesc;
+ } else {
+ // If conversion failed, then we just add the escape-code
+ // and continue past it like nothing happened.
+ result += str.at(i);
+ }
  }
  } else {
  result += str.at(i);
Title: Re: Wrong files name in Chinese
Post by: Kry on May 25, 2010, 03:23:11 PM
There are a couple extra style problems, but I'll take care of them when I review and include it (and I will, most probably). Thanks for the contribution!
Title: Re: Wrong files name in Chinese
Post by: GonoszTopi on May 25, 2010, 07:17:49 PM
Code: [Select]
+ result += UTF82unicode ((char*)s);
The default wxConvUTF8 object will ignore invalid UTF-8 sequences, thus all %-escaped non-UTF-8 sequences will be removed from the result. For example, an ISO-8859-1 (Latin-1) encoded file name might include the "Ã" character encoded as %C3, which is also a valid (two-byte) UTF-8 sequence header. Since it's unlikely that the next character will follow as a valid UTF-8 sequence (still assuming the file name is encoded in ISO-8859-1), the whole sequence (the "Ã" and the following character) will be dropped from the file name.

Another thing is that this decoder requires the whole UTF-8 sequence be either %-encoded or left as-is. If only part of a sequence is %-encoded, it will either be ignored (starting byte is %-encoded) or %-decoded but each byte left as a separate character (starting byte isn't %-encoded).

I'd therefore suggest that we first %-decode the whole string, and then try to interpret it as a UTF-8 encoded string. If that fails for some reason (e.g. the string is not encoded in UTF-8), fall back to some other (probably ISO-8859-1) encoding.
Title: Re: Wrong files name in Chinese
Post by: Kry on May 25, 2010, 07:56:45 PM
You know what, GonoszTopi just got the task of making this work.
Title: Re: Wrong files name in Chinese
Post by: GonoszTopi on May 25, 2010, 08:03:55 PM
You know what, GonoszTopi just got the task of making this work.

Oh, nooo....

I don't feel like implementing it today, and I'll forget it by tomorrow ;) Anyone volunteering?
Title: Re: Wrong files name in Chinese
Post by: Kry on May 25, 2010, 08:13:18 PM
You.
Title: Re: Wrong files name in Chinese
Post by: GonoszTopi on May 25, 2010, 10:22:44 PM
You.

k, then, this should do the trick.
Title: Re: Wrong files name in Chinese
Post by: Stu Redman on May 29, 2010, 07:48:02 PM
Did someone take away your commit rights?  ;)
Title: Re: Wrong files name in Chinese
Post by: GonoszTopi on May 29, 2010, 11:50:51 PM
Did someone take away your commit rights?  ;)
I hope not...

I was waiting for
a) you fix your last commit
b) someone actually trying and verifying the patch
Title: Re: Wrong files name in Chinese
Post by: wolverine864 on June 11, 2010, 03:58:30 PM
Ok, I didn't check whether it was meant to solve the blank filenames containing asian characters under OSX, but I tried
the code and it did not solve the problem.
Title: Re: Wrong files name in Chinese
Post by: GonoszTopi on June 12, 2010, 09:27:02 PM
No, it wasn't meant to fix that.