Accented characters (UTF8 to UTF16 shenanigans)

Posted by: Roger

Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 07:08

Here's how I rip my music.

First, I rip the CD on my Windows XP Pro box using EAC. If the CD has accented characters in the name (for example Café del Mar), then EAC writes the files to disk in UTF8. This means that the name is scambled. (é becomes é).

Now, I can fix this on the PC end, converting the UTF8-on-UTF16 nonsense to the correct UTF16 character.

Then I copy the files onto my Ubuntu 6.06 fileserver, where Samba correctly converts my UTF16 (on NTFS) filenames back into UTF8 (on ext3) filenames. As long as my terminal (either xterm or PuTTY) knows to interpret them as UTF8, then they're displayed correctly, too.

So far, so good.

Where I run into problems is when I attempt to copy the MP3s to my work PC (Windows 2003 SP1), using Cygwin's rsync (v2.6.6).

It picks up the UTF8 names and turns them back into the individual octets before writing them back to the UTF16 filesystem, which means that my 'é' turns back into 'é'.

For example:

Code:

C:\Temp>rsync -e ssh -auv peculiar.home.differentpla.net:/home/music/flac/Compilations/*.flac .
receiving file list ... done
01 - Jos\303\251 Padilla - Agua.flac



How do I persuade rsync that my filenames are Unicode, and to do the necessary UTF8 to UTF16 conversion?

I've tried googling for an answer, but my google-fu is weak this morning. Any ideas?
Posted by: mlord

Re: Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 11:39

Quote:

Code:

C:\Temp>rsync -e ssh -auv peculiar.home.differentpla.net:/home/music/flac/Compilations/*.flac .
receiving file list ... done
01 - Jos\303\251 Padilla - Agua.flac



How do I persuade rsync that my filenames are Unicode, and to do the necessary UTF8 to UTF16 conversion?

I've tried googling for an answer, but my google-fu is weak this morning. Any ideas?


Just hack the rsync source code -- should be a quick fix.

Cheers
Posted by: Roger

Re: Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 13:00

Quote:
Just hack the rsync source code -- should be a quick fix.


Nope. I took a look. It looks like it'd be quite an involved fix. The basic problem is that Cygwin simply doesn't support Unicode properly.

As far as I can tell, I'd (probably) have to add a custom _O_NAMEISUTF8 flag to the implementation of _open, which would then convert that to wide characters (UTF16) before calling the underlying Win32 CreateFile function.

Meaning that I'd need a custom Cygwin DLL, and a hack to rsync.

Dammit.

Maybe I'll just write a script that renames the files between the two character sets before I do the sync. I.e. it'd break the names on the Windows side, run rsync, and then fix them again.
Posted by: Roger

Re: Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 13:16

Quote:
As far as I can tell, I'd (probably) have to add a custom _O_NAMEISUTF8 flag to the implementation of _open, which would then convert that to wide characters (UTF16) before calling the underlying Win32 CreateFile function.


It looks like I'd have to change the Cygwin source (specifically fhandler_base::open_9x in winsup/cygwin/fhandler.cc) to call CreateFileW, instead of CreateFile, and make it convert the name from UTF8 to UTF16 at that point. I'd probably make it depend on an environment variable, rather than hack on rsync, though.

If I can get the Cygwin compiler up and running, I might give it a go.
Posted by: peter

Re: Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 13:30

Quote:
I'd probably make it depend on an environment variable

LC_ALL?

Peter
Posted by: Roger

Re: Accented characters (UTF8 to UTF16 shenanigans) - 02/09/2006 14:06

Quote:
LC_ALL?


Sounds like a plan. Unfortunately, to do a proper job, it looks like I'd have to touch quite a lot of the file handling code. I'll see if I can find time to look at it later this week.
Posted by: mac

Re: Accented characters (UTF8 to UTF16 shenanigans) - 04/09/2006 09:42

What Windows really needs is a way to set the "Language for non-Unicode programs" on the third tab of "Regional and Language Options" to "Any (UTF-8)". ANSI (cough) programs in theory already have to deal with MBCS so UTF-8 isn't any harder for them. I wonder if it's possible to hack on the registry to make this happen? It might be worth asking Michael Kaplan.

Mike.
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 09/08/2007 14:38

I finally found a fix for this:

http://www.okisoft.co.jp/esc/utf8-cygwin/

Works excellently with rsync.

Pim
Posted by: Roger

Re: Accented characters (UTF8 to UTF16 shenanigans) - 09/08/2007 14:49

Quote:
It might be worth asking Michael Kaplan.


Somebody did. He deals with that here:
http://blogs.msdn.com/michkap/archive/2006/10/11/816996.aspx
...and here:
http://blogs.msdn.com/michkap/archive/2006/07/14/665714.aspx
Posted by: Roger

Re: Accented characters (UTF8 to UTF16 shenanigans) - 09/08/2007 14:49

Quote:
Works excellently with rsync.


Excellent. I'll take a look at that later.
Posted by: wfaulk

Re: Accented characters (UTF8 to UTF16 shenanigans) - 09/08/2007 15:34

Wow. Basically what he seems to be saying is "the code in our old text functions is shit, so instead of fixing it, we decided to create new incompatible functions". Sounds like Microsoft.
Posted by: webroach

Re: Accented characters (UTF8 to UTF16 shenanigans) - 09/08/2007 20:40

Yup. When it comes right down to it, Microsoft's handling (or rather, lack thereof) of Unicode is what finally got me to switch to the Mac. Clear sailing with Unicode ever since. Except for some problems with ext3 on Feisty getting filenames with accented characters to work right. And that has nothing to do with the Mac.

Don't get me wrong; I have no idea if my problem is related to ext3... just happens to be the file system in question.
Posted by: peter

Re: Accented characters (UTF8 to UTF16 shenanigans) - 10/08/2007 10:29

Quote:
Somebody did. He deals with that

Shortsightedly, IMO. Even if Microsoft aren't going to learn from Apple's greatest technical achievement and have each future iteration of Windows N contain an emulator for Windows N-1 (and have UTF-8 be the only "narrow" codepage in the very next version), they could at least do (inside each A function):

Code:
if (codepage == CP_UTF8)
new_sane_code();
else
legacy_code_they_darent_touch();



Peter
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 13/08/2007 21:07

Now that rsync handles unicode names on both Windows and Linux just fine, the only stuff that can't be synced is under E.S.T. , R.E.M. and T.A.T.U.

I wouldn't mind dropping the trailing the dot on the Windows side, but I would on the Linux side.

Does anyone have an idea?

Pim
Posted by: webroach

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 15:26

Quote:
Now that rsync handles unicode names on both Windows and Linux just fine, the only stuff that can't be synced is under E.S.T. , R.E.M. and T.A.T.U.

I wouldn't mind dropping the trailing the dot on the Windows side, but I would on the Linux side.

Does anyone have an idea?

Pim


I wish I had a suggestion. That was a constant thorn in my side when I was still on Windows. I was pleased to see that iTunes (if you choose to allow it to manage your music library) happily replaces the trailing period with an underscore for folder names.

One solution to the t.A.T.u. dilemma is to put it in the original Russian format: Taty
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 15:44

The software that I use to manage my music collection can replace any character from the tags to any replacement character. So I already have Unicode replacements for
Code:

| / \ " * :


that look quite similar.

But I can't find something that looks like a full stop but isn't.
U+2804 (Braille pattern dots-3) is the best I could find, but that's not available on Windows.
All the others are too high or too low or too bold or both.

And there doesn't seem to be a way to have only the trailing dot replaced, yet keeping other dots.

Pim
Posted by: wfaulk

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 15:59

How about Dot Leader, One U+2024?
Posted by: peter

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 16:01

Quote:
But I can't find something that looks like a full stop but isn't.

U+FF0E? For all my non-Windows-safe characters in tags, I use the "fullwidth form" in filenames (i.e. add 0xFEE0), which isn't quite right, but isn't at all bad.

Peter
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 16:24

Quote:
U+FF0E?


Yes that's very good. It makes it look like there's a space behind the dot, but most of the times, that is appropriate.

Thanks, Peter.

Pim
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 14/08/2007 16:56

Hmm, my format is /mp3/Artist/Album/Tracknumber. Title.extension
If I ask for the dot to be replaced, jack does that too for the dot after
the tracknumber. This means every tune will be renamed.

I think I'd rather hack jack, and have only trailing dots replaced,
for directories only. And then I might as well just hardcode a
space after the dot.

Pim
Posted by: pim

Re: Accented characters (UTF8 to UTF16 shenanigans) - 15/08/2007 14:22

Darn, Windows will not allow filenames ending with dot-space either!

Fortunately there's the no break space. Hacking Jack was surprisingly easy:

Code:

if (x.endswith(".")):
x = x + u"\u00A0".encode("utf-8")



I'm beginning to like python...

Pim