aMule Forum

English => en_Bugs => Topic started by: Gerd78 on November 11, 2005, 08:26:29 PM

Title: Inconsistent man page encodings
Post by: Gerd78 on November 11, 2005, 08:26:29 PM
Hi,

the man pages are encoded inconsistently. Most of them are in ISO-8859-1, but the following German ones are in UTF-8:

src/utils/aLinkCreator/docs/alc.de.1
src/utils/aLinkCreator/docs/alcc.de.1
docs/man/amule.de.1
docs/man/amulecmd.de.1
docs/man/amuleweb.de.1
src/utils/cas/docs/cas.de.1

The other German ones are also in ISO-8859-1.

I don't know anything about the distros' patchery, but the original vanilla man pager doesn't even pretend to support anything else than ISO-8859-1, even in UTF-8 locales. But even if UTF-8 were correct, it should be consistens in all man pages and languages within the aMule package.
Title: Re: Inconsistent man page encodings
Post by: Vollstrecker on November 12, 2005, 01:27:15 AM
That depends on who created that page, and which editor he used.
Are they useable and that's a nice to know report, or do you get problems because this issue?
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 12, 2005, 02:16:29 AM
Quote
Originally posted by Vollstrecker
Are they useable and that's a nice to know report, or do you get problems because this issue?
The latter.

Good man page in ISO-8859-1 (src/utils/wxCas/docs/wxcas.de.1):

(http://img396.imageshack.us/img396/5145/gut8my.png)

Bad man page in UTF-8 (src/utils/cas/docs/cas.de.1)

(http://img396.imageshack.us/img396/8970/schlecht8yt.png)

Please note that my system is set to LANG=de_DE.UTF-8. But this doesn't help with UTF-8 man pages because the program "/usr/bin/man" itself doesn't understand UTF-8 at all.

Man pages should be in the locale's native character set (i.e. ISO-8859-1 for English, French, German, Spanish, Hungarian, as it is the case in all other man pages than the German ones, and even in some of the German ones, but not all).

"man" will support UTF-8 in the future, but it's not yet decided how to do it without breaking existing packages. One approach is a special tag for UTF-8 in the man pages itself, and another approach is using "/usr/share/man/de.UTF-8/man1" instead of "/usr/share/man/de/man1", but neither of them is implemented right now.

If some distros accept UTF-8 man pages, first it's still non-standard and second, it means that they probably accept UTF-8 man pages only, which means that the man pages inside the aMule package should at least all have the same encoding.

"recode" or "iconv" can be used to convert text files.
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 12, 2005, 02:49:25 AM
This is a patch for today's CVS that converts all man pages to ISO-8859-1.
Title: Re: Inconsistent man page encodings
Post by: thedude0001 on November 12, 2005, 02:35:14 PM
Quote
Originally posted by Gerd78
Man pages should be in the locale's native character set (i.e. ISO-8859-1 for English, French, German, Spanish, Hungarian, as it is the case in all other man pages than the German ones, and even in some of the German ones, but not all).

I'm not familiar with man or the creation of man pages, but it seems wrong to me that ISO-8859-1 should be the native character set since it doesn't even include the symbol for the currency we're using for nearly 4 years now...
Title: Re: Inconsistent man page encodings
Post by: Vollstrecker on November 12, 2005, 04:38:13 PM
Good point, but how many man-pages have you seen, that use that symbol? I can't even rember only one.
Title: Re: Inconsistent man page encodings
Post by: Jacobo221 on November 12, 2005, 05:17:01 PM
If vanilla man pages do not support UTF-8, then that sounds enough as an argument to take UTF-8 away from our man apges and port them to 8859-1.
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 12, 2005, 06:25:05 PM
Quote
Originally posted by thedude0001
I'm not familiar with man or the creation of man pages, but it seems wrong to me that ISO-8859-1 should be the native character set since it doesn't even include the symbol for the currency we're using for nearly 4 years now...
That's correct and that's why I was wrong: It must be ISO-8859-15, not ISO-8859-1. But it doesn't matter that much because in the above-mentioned man pages only the common subset of both is used anyway.
Title: Re: Inconsistent man page encodings
Post by: stefanero on November 12, 2005, 09:50:36 PM
no sorry, no iso whatsoever!

I made them utf-8 on perspose, since its the most standart on most distros anyways

also its more commen to all systems all over the world. also all po strings are utf8 aswell.

so if you want, you can  edit the rest to utf8, not "fix" the right once
Title: Re: Inconsistent man page encodings
Post by: Kry on November 12, 2005, 10:53:20 PM
There are manpages in a lot of encodings. "man" is not enforced to iso8859-1, or japanese people will be really fucked up, if you know what I mean.
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 13, 2005, 12:16:07 AM
Quote
Originally posted by stefanero
no sorry, no iso whatsoever!

I made them utf-8 on perspose, since its the most standart on most distros anyways
Why? Where do you know from that this is the correct thing to do?

UTF-8 is indeed the most widely used character set in Linux distros today, but not for man pages. As you can see above, umlauts in UTF-8 man pages are garbage in an UTF-8 environment, whereas and ISO-8859-1 man pages work fine. Man pages are not locale specific.

Using an UTF-8 locale e.g. means that the file system is in UTF-8, but not that every file needs to be converted to UTF-8. Example: .desktop files always have to be UTF-8 in any(!) locale and European man pages always have to be ISO-8859-1 in any(!) locale, whereas HTML can be in any encoding because the encoding is stored in the documents themselves.
Quote
Originally posted by stefanero
also its more commen to all systems all over the world. also all po strings are utf8 aswell.
Man pages have nothing to do with .po files. That's a completely different thing. There it's correct to use UTF-8.
Quote
Originally posted by stefanero
so if you want, you can  edit the rest to utf8, not "fix" the right once
Please read the following thread:

http://mail.nl.linux.org/linux-utf8/2005-07/msg00004.html

Especially the following post:

http://mail.nl.linux.org/linux-utf8/2005-07/msg00009.html
Quote
Originally posted by Kry
There are manpages in a lot of encodings. "man" is not enforced to iso8859-1, or japanese people will be really fucked up, if you know what I mean.
See here:

http://mail.nl.linux.org/linux-utf8/2005-07/msg00009.html

Japanese man pages are not UTF-8 either. They are EUC-JP. Japanese man pages in UTF-8 are unreadable in any locale, because /usr/bin/man expects EUC-JP input and treats every input as EUC-JP, even if it's UTF-8. The result is garbage.

If you misunderstood what I said: man is indeed not limited to ISO-8859-1, but this doesn't mean that it understands UTF-8. It doesn't. It understands only one specific character set per language, and this is accidentally ISO-8859-1 for all languages you have man pages for.

If you read the linked thread carefully, you will see that the linux man page maintainers want to support UTF-8 man pages in the future, but they don't even know how to do it. Probably they will do it by introducing a special tag in UTF-8 man pages like in HTML so that the man pager knows how to handle them, but currently it doesn't.

BTW it's not that I would "have to" convince anyone. It's no problem for me to convert them myself.
Title: Re: Inconsistent man page encodings
Post by: thedude0001 on November 13, 2005, 02:24:06 AM
Quote
Originally posted by Vollstrecker
Good point, but how many man-pages have you seen, that use that symbol? I can't even rember only one.

Actually there are some more differences between ISO-8859-1 and -15, check wikipedia (http://de.wikipedia.org/wiki/ISO-8859-1#8859-1_vs._-15_vs._Windows-1252). Chances are that most of these are not very common in manpages, but some might actually appear...
Title: Re: Inconsistent man page encodings
Post by: Vollstrecker on November 13, 2005, 03:25:19 AM
Sure, but if someone claims about missing currency-sign, I say it isn't used /at least afaik). The other diffs hadn't been asked.
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 13, 2005, 06:12:13 AM
Check "man iso_8859-15" (not the content, but the file itself). The file is /usr/share/man/man7/iso_8859-15.7.gz or similar. It's in ISO-8859-15 and works as expected.

By the way, to eliminate any misunderstandings: This is not an accusation or something like that, just a suggestion for a tiny problem. Broken umlauts are not nice, but they won't hurt anyone.
Title: Re: Inconsistent man page encodings
Post by: stefanero on November 13, 2005, 09:08:52 AM
hmm you only have brocken umlauts on the console if your console does not support utf8....

I can read the manpages here just fine
Title: Re: Inconsistent man page encodings
Post by: Gerd78 on November 13, 2005, 11:11:08 PM
Quote
Originally posted by stefanero
hmm you only have brocken umlauts on the console if your console does not support utf8....
My console supports UTF-8. UTF-8 plain text files are OK with "less" and "cat". But man pages are not plain text files...
Quote
Originally posted by stefanero
I can read the manpages here just fine
Which distro are you using? Can you read UTF-8 man pages only or do both work for you?