#177137 - 27/08/2003 13:18
id3 tags and character encoding
|
addict
Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
|
Does anyone know of a tool or have example code that show what
character encoding is used inside id3 tags?
I've seen very little documentation about the proper use of character
sets in id3 tags. From my understanding, id3v1 tags should always
be encoded using iso8859-1. id3v2 tags can be encoded as iso8859-1,
utf-8 or utf-16. The character encoding should be stored somewhere,
so a decode knows how to interpret the data. Please correct me if
these assumptions are wrong.
Now that Samba 3 is finally out (well, it's a release candidate), I can
finally use UTF-8 encoded filenames on Linux and display them
right in Windows. Changing the filenames from iso8859-1 encoded
names to UTF-8 involved converting all my cached CDDB data to
UTF-8 and rerunning all my tunes through my ripper. This not only
changed the names, but also retagged the tunes. The tunes now
have UTF-8 encoded tags, but now several clients are corrupting
the tags, as they do another iso8859-1 to UTF-8 translation.
This makes me suspicious about my ripper. It may be using UTF-8
encoded tags but storing them as if they are iso8859-1 encoded.
Unfortunately, I have not been able to find any tool showing these
properties, and libid3tag, which I use for mp3tofid just hides
all this mess and offers the tags UCS-4 encoded.
Pim
|
Top
|
|
|
|
#177138 - 27/08/2003 13:25
Re: id3 tags and character encoding
[Re: pim]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31614
Loc: Seattle, WA
|
You can examine the files yourself in a binary editor, and look at this spec to see what's what. That covers ID3v1, but has a link to the ID3v2 spec.
|
Top
|
|
|
|
#177139 - 27/08/2003 13:36
Re: id3 tags and character encoding
[Re: pim]
|
carpal tunnel
Registered: 18/01/2000
Posts: 5687
Loc: London, UK
|
http://www.id3.org/develop.html
One thing to note: several broken tagging programs (in particular Japanese ones) store ID3v2 tags in the local codepage, but claim to be storing them in 8859-1.
As for ID3v1, I don't believe that the codepage is actually specified anywhere. Mostly, we just assume local codepage, IIRC.
_________________________
-- roger
|
Top
|
|
|
|
#177140 - 27/08/2003 13:42
Re: id3 tags and character encoding
[Re: tfabris]
|
addict
Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
|
Right. This must be it:
If nothing else is said a string is represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented as <text string>, or <full text string> if newlines are allowed, in the frame descriptions. All Unicode strings use 16-bit unicode 2.0 (ISO/IEC 10646-1:1993, UCS-2). Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.
All numeric strings and URLs are always encoded as ISO-8859-1. Terminated strings are terminated with $00 if encoded with ISO-8859-1 and $00 00 if encoded as unicode. If nothing else is said newline character is forbidden. In ISO-8859-1 a new line is represented, when allowed, with $0A only. Frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01. Strings dependent on encoding is represented as <text string according to encoding>, or <full text string according to encoding> if newlines are allowed. Any empty Unicode strings which are NULL-terminated may have the Unicode BOM followed by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).
This probably means it's either iso8859-1 or utf-16. As binary editing my tunes does not show
utf-16, they must have iso8859-1 encoded tags containing utf-8 text
Pim
|
Top
|
|
|
|
#177141 - 27/08/2003 15:23
Re: id3 tags and character encoding
[Re: tfabris]
|
pooh-bah
Registered: 15/01/2002
Posts: 1866
Loc: Austin
|
tony, that page is almost exactly what ive been searching for for about a week now. thanks.
|
Top
|
|
|
|
#177142 - 28/08/2003 01:30
Re: id3 tags and character encoding
[Re: pim]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4181
Loc: Cambridge, England
|
text encoding description byte [...] If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01. A later (2.4?) version of the ID3v2 spec also allows $02 for "UTF-16 big endian without BOM" (why?) and $03 for UTF-8. But for best compatibility people should stick with $00 and $01. It's possible that your ripper is writing UTF-8 as per 2.4 spec and your other clients are ignoring it as per 2.3 spec -- but it sounds more likely that your ripper is just broken.
Peter
|
Top
|
|
|
|
#177143 - 28/08/2003 10:38
Re: id3 tags and character encoding
[Re: RobotCaleb]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31614
Loc: Seattle, WA
|
tony, that page is almost exactly what ive been searching for for about a week now. thanks. Yeah, I like that one. I keep referring back to it each time I do coding work that involves MP3 files.
|
Top
|
|
|
|
#177144 - 31/08/2003 14:49
Re: id3 tags and character encoding
[Re: peter]
|
addict
Registered: 14/11/2000
Posts: 474
Loc: The Hague, the Netherlands
|
but it sounds more likely that your ripper is just broken
Most likely. But this makes me wonder. How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this.
Pim
|
Top
|
|
|
|
#177145 - 01/09/2003 00:56
Re: id3 tags and character encoding
[Re: pim]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4181
Loc: Cambridge, England
|
How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this. All you can really do is check whether it's valid UTF-8 (fortunately, "accidentally valid" UTF-8 is reasonably rare) and, if not, assume it's local code page. If you're being really swish you can offer little menus for changing the interpretation -- otherwise you'll never see 8859-1 characters in JIS locales, or vice versa.
Peter
|
Top
|
|
|
|
|
|