Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#177137 - 27/08/2003 13:18 id3 tags and character encoding
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Does anyone know of a tool or have example code that show what
character encoding is used inside id3 tags?

I've seen very little documentation about the proper use of character
sets in id3 tags. From my understanding, id3v1 tags should always
be encoded using iso8859-1. id3v2 tags can be encoded as iso8859-1,
utf-8 or utf-16. The character encoding should be stored somewhere,
so a decode knows how to interpret the data. Please correct me if
these assumptions are wrong.

Now that Samba 3 is finally out (well, it's a release candidate), I can
finally use UTF-8 encoded filenames on Linux and display them
right in Windows. Changing the filenames from iso8859-1 encoded
names to UTF-8 involved converting all my cached CDDB data to
UTF-8 and rerunning all my tunes through my ripper. This not only
changed the names, but also retagged the tunes. The tunes now
have UTF-8 encoded tags, but now several clients are corrupting
the tags, as they do another iso8859-1 to UTF-8 translation.

This makes me suspicious about my ripper. It may be using UTF-8
encoded tags but storing them as if they are iso8859-1 encoded.

Unfortunately, I have not been able to find any tool showing these
properties, and libid3tag, which I use for mp3tofid just hides
all this mess and offers the tags UCS-4 encoded.

Pim

Top
#177138 - 27/08/2003 13:25 Re: id3 tags and character encoding [Re: pim]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31578
Loc: Seattle, WA
You can examine the files yourself in a binary editor, and look at this spec to see what's what. That covers ID3v1, but has a link to the ID3v2 spec.
_________________________
Tony Fabris

Top
#177139 - 27/08/2003 13:36 Re: id3 tags and character encoding [Re: pim]
Roger
carpal tunnel

Registered: 18/01/2000
Posts: 5682
Loc: London, UK
http://www.id3.org/develop.html

One thing to note: several broken tagging programs (in particular Japanese ones) store ID3v2 tags in the local codepage, but claim to be storing them in 8859-1.

As for ID3v1, I don't believe that the codepage is actually specified anywhere. Mostly, we just assume local codepage, IIRC.
_________________________
-- roger

Top
#177140 - 27/08/2003 13:42 Re: id3 tags and character encoding [Re: tfabris]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
Right. This must be it:


If nothing else is said a string is represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented as <text string>, or <full text string> if newlines are allowed, in the frame descriptions. All Unicode strings use 16-bit unicode 2.0 (ISO/IEC 10646-1:1993, UCS-2). Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.

All numeric strings and URLs are always encoded as ISO-8859-1. Terminated strings are terminated with $00 if encoded with ISO-8859-1 and $00 00 if encoded as unicode. If nothing else is said newline character is forbidden. In ISO-8859-1 a new line is represented, when allowed, with $0A only. Frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01. Strings dependent on encoding is represented as <text string according to encoding>, or <full text string according to encoding> if newlines are allowed. Any empty Unicode strings which are NULL-terminated may have the Unicode BOM followed by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).


This probably means it's either iso8859-1 or utf-16. As binary editing my tunes does not show
utf-16, they must have iso8859-1 encoded tags containing utf-8 text

Pim

Top
#177141 - 27/08/2003 15:23 Re: id3 tags and character encoding [Re: tfabris]
RobotCaleb
pooh-bah

Registered: 15/01/2002
Posts: 1866
Loc: Austin
tony, that page is almost exactly what ive been searching for for about a week now. thanks.

Top
#177142 - 28/08/2003 01:30 Re: id3 tags and character encoding [Re: pim]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4174
Loc: Cambridge, England
text encoding description byte [...] If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01.
A later (2.4?) version of the ID3v2 spec also allows $02 for "UTF-16 big endian without BOM" (why?) and $03 for UTF-8. But for best compatibility people should stick with $00 and $01. It's possible that your ripper is writing UTF-8 as per 2.4 spec and your other clients are ignoring it as per 2.3 spec -- but it sounds more likely that your ripper is just broken.

Peter

Top
#177143 - 28/08/2003 10:38 Re: id3 tags and character encoding [Re: RobotCaleb]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31578
Loc: Seattle, WA
tony, that page is almost exactly what ive been searching for for about a week now. thanks.
Yeah, I like that one. I keep referring back to it each time I do coding work that involves MP3 files.
_________________________
Tony Fabris

Top
#177144 - 31/08/2003 14:49 Re: id3 tags and character encoding [Re: peter]
pim
addict

Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
but it sounds more likely that your ripper is just broken


Most likely. But this makes me wonder. How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this.

Pim

Top
#177145 - 01/09/2003 00:56 Re: id3 tags and character encoding [Re: pim]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4174
Loc: Cambridge, England
How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this.
All you can really do is check whether it's valid UTF-8 (fortunately, "accidentally valid" UTF-8 is reasonably rare) and, if not, assume it's local code page. If you're being really swish you can offer little menus for changing the interpretation -- otherwise you'll never see 8859-1 characters in JIS locales, or vice versa.

Peter

Top