Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#286280 - 02/09/2006 07:08 Accented characters (UTF8 to UTF16 shenanigans)
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
Here's how I rip my music.

First, I rip the CD on my Windows XP Pro box using EAC. If the CD has accented characters in the name (for example Café del Mar), then EAC writes the files to disk in UTF8. This means that the name is scambled. (é becomes é).

Now, I can fix this on the PC end, converting the UTF8-on-UTF16 nonsense to the correct UTF16 character.

Then I copy the files onto my Ubuntu 6.06 fileserver, where Samba correctly converts my UTF16 (on NTFS) filenames back into UTF8 (on ext3) filenames. As long as my terminal (either xterm or PuTTY) knows to interpret them as UTF8, then they're displayed correctly, too.

So far, so good.

Where I run into problems is when I attempt to copy the MP3s to my work PC (Windows 2003 SP1), using Cygwin's rsync (v2.6.6).

It picks up the UTF8 names and turns them back into the individual octets before writing them back to the UTF16 filesystem, which means that my 'é' turns back into 'é'.

For example:

Code:

C:\Temp>rsync -e ssh -auv peculiar.home.differentpla.net:/home/music/flac/Compilations/*.flac .
receiving file list ... done
01 - Jos\303\251 Padilla - Agua.flac



How do I persuade rsync that my filenames are Unicode, and to do the necessary UTF8 to UTF16 conversion?

I've tried googling for an answer, but my google-fu is weak this morning. Any ideas?
_________________________
-- roger

Top
#286281 - 02/09/2006 11:39 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14478
Loc: Canada
Quote:

Code:

C:\Temp>rsync -e ssh -auv peculiar.home.differentpla.net:/home/music/flac/Compilations/*.flac .
receiving file list ... done
01 - Jos\303\251 Padilla - Agua.flac



How do I persuade rsync that my filenames are Unicode, and to do the necessary UTF8 to UTF16 conversion?

I've tried googling for an answer, but my google-fu is weak this morning. Any ideas?


Just hack the rsync source code -- should be a quick fix.

Cheers

Top
#286282 - 02/09/2006 13:00 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: mlord]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
Quote:
Just hack the rsync source code -- should be a quick fix.


Nope. I took a look. It looks like it'd be quite an involved fix. The basic problem is that Cygwin simply doesn't support Unicode properly.

As far as I can tell, I'd (probably) have to add a custom _O_NAMEISUTF8 flag to the implementation of _open, which would then convert that to wide characters (UTF16) before calling the underlying Win32 CreateFile function.

Meaning that I'd need a custom Cygwin DLL, and a hack to rsync.

Dammit.

Maybe I'll just write a script that renames the files between the two character sets before I do the sync. I.e. it'd break the names on the Windows side, run rsync, and then fix them again.
_________________________
-- roger

Top
#286283 - 02/09/2006 13:16 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
Quote:
As far as I can tell, I'd (probably) have to add a custom _O_NAMEISUTF8 flag to the implementation of _open, which would then convert that to wide characters (UTF16) before calling the underlying Win32 CreateFile function.


It looks like I'd have to change the Cygwin source (specifically fhandler_base::open_9x in winsup/cygwin/fhandler.cc) to call CreateFileW, instead of CreateFile, and make it convert the name from UTF8 to UTF16 at that point. I'd probably make it depend on an environment variable, rather than hack on rsync, though.

If I can get the Cygwin compiler up and running, I might give it a go.
_________________________
-- roger

Top
#286284 - 02/09/2006 13:30 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4172
Loc: Cambridge, England
Quote:
I'd probably make it depend on an environment variable

LC_ALL?

Peter

Top
#286285 - 02/09/2006 14:06 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: peter]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
Quote:
LC_ALL?


Sounds like a plan. Unfortunately, to do a proper job, it looks like I'd have to touch quite a lot of the file handling code. I'll see if I can find time to look at it later this week.
_________________________
-- roger

Top
#286286 - 04/09/2006 09:42 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
mac
addict

Registered: 20/05/1999
Posts: 411
Loc: Cambridge, UK
What Windows really needs is a way to set the "Language for non-Unicode programs" on the third tab of "Regional and Language Options" to "Any (UTF-8)". ANSI (cough) programs in theory already have to deal with MBCS so UTF-8 isn't any harder for them. I wonder if it's possible to hack on the registry to make this happen? It might be worth asking Michael Kaplan.

Mike.

Top
#286287 - 09/08/2007 14:38 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
I finally found a fix for this:

http://www.okisoft.co.jp/esc/utf8-cygwin/

Works excellently with rsync.

Pim

Top
#286288 - 09/08/2007 14:49 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: mac]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
_________________________
-- roger

Top
#286289 - 09/08/2007 14:49 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5680
Loc: London, UK
Quote:
Works excellently with rsync.


Excellent. I'll take a look at that later.
_________________________
-- roger

Top
#286290 - 09/08/2007 15:34 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Wow. Basically what he seems to be saying is "the code in our old text functions is shit, so instead of fixing it, we decided to create new incompatible functions". Sounds like Microsoft.
_________________________
Bitt Faulk

Top
#286291 - 09/08/2007 20:40 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: wfaulk]
webroach
old hand

Registered: 23/07/2003
Posts: 869
Loc: Colorado
Yup. When it comes right down to it, Microsoft's handling (or rather, lack thereof) of Unicode is what finally got me to switch to the Mac. Clear sailing with Unicode ever since. Except for some problems with ext3 on Feisty getting filenames with accented characters to work right. And that has nothing to do with the Mac.

Don't get me wrong; I have no idea if my problem is related to ext3... just happens to be the file system in question.
_________________________
Dave

Top
#286292 - 10/08/2007 10:29 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4172
Loc: Cambridge, England
Quote:
Somebody did. He deals with that

Shortsightedly, IMO. Even if Microsoft aren't going to learn from Apple's greatest technical achievement and have each future iteration of Windows N contain an emulator for Windows N-1 (and have UTF-8 be the only "narrow" codepage in the very next version), they could at least do (inside each A function):

Code:
if (codepage == CP_UTF8)
new_sane_code();
else
legacy_code_they_darent_touch();



Peter

Top
#286293 - 13/08/2007 21:07 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: Roger]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Now that rsync handles unicode names on both Windows and Linux just fine, the only stuff that can't be synced is under E.S.T. , R.E.M. and T.A.T.U.

I wouldn't mind dropping the trailing the dot on the Windows side, but I would on the Linux side.

Does anyone have an idea?

Pim

Top
#286294 - 14/08/2007 15:26 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
webroach
old hand

Registered: 23/07/2003
Posts: 869
Loc: Colorado
Quote:
Now that rsync handles unicode names on both Windows and Linux just fine, the only stuff that can't be synced is under E.S.T. , R.E.M. and T.A.T.U.

I wouldn't mind dropping the trailing the dot on the Windows side, but I would on the Linux side.

Does anyone have an idea?

Pim


I wish I had a suggestion. That was a constant thorn in my side when I was still on Windows. I was pleased to see that iTunes (if you choose to allow it to manage your music library) happily replaces the trailing period with an underscore for folder names.

One solution to the t.A.T.u. dilemma is to put it in the original Russian format: Taty
_________________________
Dave

Top
#286295 - 14/08/2007 15:44 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: webroach]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
The software that I use to manage my music collection can replace any character from the tags to any replacement character. So I already have Unicode replacements for
Code:

| / \ " * :


that look quite similar.

But I can't find something that looks like a full stop but isn't.
U+2804 (Braille pattern dots-3) is the best I could find, but that's not available on Windows.
All the others are too high or too low or too bold or both.

And there doesn't seem to be a way to have only the trailing dot replaced, yet keeping other dots.

Pim

Top
#286296 - 14/08/2007 15:59 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
How about Dot Leader, One U+2024?
_________________________
Bitt Faulk

Top
#286297 - 14/08/2007 16:01 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4172
Loc: Cambridge, England
Quote:
But I can't find something that looks like a full stop but isn't.

U+FF0E? For all my non-Windows-safe characters in tags, I use the "fullwidth form" in filenames (i.e. add 0xFEE0), which isn't quite right, but isn't at all bad.

Peter

Top
#286298 - 14/08/2007 16:24 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: peter]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Quote:
U+FF0E?


Yes that's very good. It makes it look like there's a space behind the dot, but most of the times, that is appropriate.

Thanks, Peter.

Pim

Top
#286299 - 14/08/2007 16:56 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Hmm, my format is /mp3/Artist/Album/Tracknumber. Title.extension
If I ask for the dot to be replaced, jack does that too for the dot after
the tracknumber. This means every tune will be renamed.

I think I'd rather hack jack, and have only trailing dots replaced,
for directories only. And then I might as well just hardcode a
space after the dot.

Pim

Top
#286300 - 15/08/2007 14:22 Re: Accented characters (UTF8 to UTF16 shenanigans) [Re: pim]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Darn, Windows will not allow filenames ending with dot-space either!

Fortunately there's the no break space. Hacking Jack was surprisingly easy:

Code:

if (x.endswith(".")):
x = x + u"\u00A0".encode("utf-8")



I'm beginning to like python...

Pim

Top