The easiest[1] way I can think of to do timing is with a VR system. Just have someone sing along. The VR software can be reading the lyrics so it knows what to expect rather than needing a comprehensive, pre-learned dictionary. Then it matches up the timing of the playback with the singing. Prolly still better if it used the mic than trying to pick the singer out of the mp3 so the music doesn't confuse it.

[1]: sure, one you've got a very good VR implimentation, everything else is easy, right?
--The Amigo