Excellent,
excellent point Dylan.
In fact, I just got done discussing this very subject with Tanstaafl in private. Although the objective measurements are vital to knowing the details of how the encoders perform, what good does it do if there's no audible difference from one encoder to the next?
My problem with the R3mix tests and this other recent test is this: The tests were made by computer hackers, not audiophiles. What we need, in addition to those detailed tests, are double-blind listening tests made by audiophiles with golden ears, using samples of real music.
I actually saw such a test, once. Unfortunately, it was made over a year ago, using older versions of a few different encoders. Since the encoders have changed so much with each version, the results of that test are useless today.
The test was done by a music magazine, and the testers were their audio reviewers-people whose jobs depended on being able to hear subtle differences in sound reproduction. The test had a very interesting protocol, one which I think is possibly the perfect protocol for tests of this type.
The author of the test first chose small musical passages from a variety of styles. There was classical, single dry solo vocal (very good for hearing certain kinds of artifacts), rock/pop (including, to my pleasant surprise, the filter-sweep passage from "Ray of Light"), and some others.
Then he ripped those sections into .WAV files and encoded them to MP3 with the encoders he was testing. Then he decoded the resulting MP3 files back into .WAVs and burned those to a CD as audio tracks in a very interesting way:
Each "test" was a group of tracks. It would start with the original source file as the first track. Following that track would be the encoded versions scrambled randomly with a second copy of the original source file again. So unless you really had golden ears, you wouldn't know whether you were listening to a copy of the original track or an MP3-encoded one.
The listeners' job was to rate each of the "following" tracks in comparison to the first track in the group (the known original). The "control" was the repeat of the original track hidden among the encoded tracks- in theory, if the listeners were doing their job right, the original track would rate as perfect and the encoded versions would rate as somewhat less than perfect. This was the "blind" part of the experiment.
What made the experiment double-blind was that the author of the test made a few different sets of the CDs with the tracks randomized in different orders. Each one was given a serial number, and the actual order of tracks was recorded according to the serial number. But
which listener got
which CD was randomized, and not known to the test author until the end when the results were tallied.
In essence, it was a perfect double-blind listening test. It tested only the encoders' sound quality, not the quality of the playback equipment. Each listener got to listen on their favorite home equipment using their favorite CD player. An exact, immediate, side-by-side comparison of the encoded versions to the original was possible, but without the "confirmation bias" of knowing whether they were really listening to an encoded version or not.
Even with the strict protocol, it was amazing how well these audiophiles were able to pick out the differences among the encoders. They consistently rated the various encoders lower than the "control" track and noted their reasons for doing so. Out of the whole test, there was only one abberant "spike" of one tester mistaking the control track for an encoded track. I was impressed.
Now I want to see this done for recent versions of LAME vs. Xing at high bit rates.
___________
Tony Fabris