id3 tags and character encoding | Programming | unofficial empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

You are not logged in. [Log In] empegbbs.com » Forums » empeg-car » Programming » id3 tags and character encoding

Topic Options

#177137 - 27/08/2003 13:18 id3 tags and character encoding
pim addict Registered: 14/11/2000 Posts: 474 Loc: The Hague, the Netherlands	Does anyone know of a tool or have example code that show what character encoding is used inside id3 tags? I've seen very little documentation about the proper use of character sets in id3 tags. From my understanding, id3v1 tags should always be encoded using iso8859-1. id3v2 tags can be encoded as iso8859-1, utf-8 or utf-16. The character encoding should be stored somewhere, so a decode knows how to interpret the data. Please correct me if these assumptions are wrong. Now that Samba 3 is finally out (well, it's a release candidate), I can finally use UTF-8 encoded filenames on Linux and display them right in Windows. Changing the filenames from iso8859-1 encoded names to UTF-8 involved converting all my cached CDDB data to UTF-8 and rerunning all my tunes through my ripper. This not only changed the names, but also retagged the tunes. The tunes now have UTF-8 encoded tags, but now several clients are corrupting the tags, as they do another iso8859-1 to UTF-8 translation. This makes me suspicious about my ripper. It may be using UTF-8 encoded tags but storing them as if they are iso8859-1 encoded. Unfortunately, I have not been able to find any tool showing these properties, and libid3tag, which I use for mp3tofid just hides all this mess and offers the tags UCS-4 encoded. Pim
Top

#177138 - 27/08/2003 13:25 Re: id3 tags and character encoding [Re: pim]
tfabris carpal tunnel Registered: 20/12/1999 Posts: 31578 Loc: Seattle, WA	You can examine the files yourself in a binary editor, and look at this spec to see what's what. That covers ID3v1, but has a link to the ID3v2 spec. _________________________ Tony Fabris
Top

#177139 - 27/08/2003 13:36 Re: id3 tags and character encoding [Re: pim]
Roger carpal tunnel Registered: 18/01/2000 Posts: 5682 Loc: London, UK	http://www.id3.org/develop.html One thing to note: several broken tagging programs (in particular Japanese ones) store ID3v2 tags in the local codepage, but claim to be storing them in 8859-1. As for ID3v1, I don't believe that the codepage is actually specified anywhere. Mostly, we just assume local codepage, IIRC. _________________________ -- roger
Top

#177140 - 27/08/2003 13:42 Re: id3 tags and character encoding [Re: tfabris]
pim addict Registered: 14/11/2000 Posts: 474 Loc: The Hague, the Netherlands	Right. This must be it: If nothing else is said a string is represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented as <text string>, or <full text string> if newlines are allowed, in the frame descriptions. All Unicode strings use 16-bit unicode 2.0 (ISO/IEC 10646-1:1993, UCS-2). Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order. All numeric strings and URLs are always encoded as ISO-8859-1. Terminated strings are terminated with $00 if encoded with ISO-8859-1 and $00 00 if encoded as unicode. If nothing else is said newline character is forbidden. In ISO-8859-1 a new line is represented, when allowed, with $0A only. Frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01. Strings dependent on encoding is represented as <text string according to encoding>, or <full text string according to encoding> if newlines are allowed. Any empty Unicode strings which are NULL-terminated may have the Unicode BOM followed by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00). This probably means it's either iso8859-1 or utf-16. As binary editing my tunes does not show utf-16, they must have iso8859-1 encoded tags containing utf-8 text Pim
Top

#177141 - 27/08/2003 15:23 Re: id3 tags and character encoding [Re: tfabris]
RobotCaleb pooh-bah Registered: 15/01/2002 Posts: 1866 Loc: Austin	tony, that page is almost exactly what ive been searching for for about a week now. thanks.
Top

#177142 - 28/08/2003 01:30 Re: id3 tags and character encoding [Re: pim]
peter carpal tunnel Registered: 13/07/2000 Posts: 4174 Loc: Cambridge, England	text encoding description byte [...] If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01. A later (2.4?) version of the ID3v2 spec also allows $02 for "UTF-16 big endian without BOM" (why?) and $03 for UTF-8. But for best compatibility people should stick with $00 and $01. It's possible that your ripper is writing UTF-8 as per 2.4 spec and your other clients are ignoring it as per 2.3 spec -- but it sounds more likely that your ripper is just broken. Peter
Top

#177143 - 28/08/2003 10:38 Re: id3 tags and character encoding [Re: RobotCaleb]
tfabris carpal tunnel Registered: 20/12/1999 Posts: 31578 Loc: Seattle, WA	tony, that page is almost exactly what ive been searching for for about a week now. thanks. Yeah, I like that one. I keep referring back to it each time I do coding work that involves MP3 files. _________________________ Tony Fabris
Top

#177144 - 31/08/2003 14:49 Re: id3 tags and character encoding [Re: peter]
pim addict Registered: 14/11/2000 Posts: 474 Loc: The Hague, the Netherlands	but it sounds more likely that your ripper is just broken Most likely. But this makes me wonder. How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this. Pim
Top

#177145 - 01/09/2003 00:56 Re: id3 tags and character encoding [Re: pim]
peter carpal tunnel Registered: 13/07/2000 Posts: 4174 Loc: Cambridge, England	How should rippers or other CDDB/freedb clients know what character encoding to use? It appears CDDB/freedb data can be either iso8859-1 or UTF-8 encoded, and there's nothing in the format to tell this. All you can really do is check whether it's valid UTF-8 (fortunately, "accidentally valid" UTF-8 is reasonably rare) and, if not, assume it's local code page. If you're being really swish you can offer little menus for changing the interpretation -- otherwise you'll never see 8859-1 characters in JIS locales, or vice versa. Peter
Top

View All Topics