Have you had success with this yourself?

No, I haven't done this myself. Here's how I'd go about it:

First, the quick and dirty way. Look at display_queue_add() in arch/arm/special/empeg_display.c - all screen updates go through there. After the final memcpy in that function you can modify the destination buffer to overlay your own image over whatever the player is displaying.

You'll need to watch out for palette changes - the correspondance between values in memory and colours on the screen can be changed by the EMPEG_DISPLAY_WRITE_PALETTE ioctl. It's probably easiest to keep two copies of your overlay image (one for the standard and one for Toby's palette), then switch between them when the above ioctl is issued.

That leaves the question of how do you get your image into the kernel. You can't use the mmaped buffer as the player is already using that. You can add a new ioctl that copies your image from user memory into a static kernel buffer - not as fast as mmap but should be OK if your image is small and you don't do it too often.

If you want to do this properly, you need first to come up with some reasonable semantics for a shared display buffer that remains compatible with the way the player is currently using it. For example, you can allow any number of application to open the device, all but one (the player) with some special flag. The player can do whatever it can do now. All the others can mmap the device but get their private buffer. They also provide a layer number and a overlay template. That template describes how to combine pixels from different layers (e.g. which pixel are transparent, which pixels to xor, etc.). The only allowed ioctl for the additional display users is the "refresh", that just tells to kernel to make a private copy of the current contents of the buffer.

The actual refresh of the screen is still controlled by the player. On each frame, after the player's buffer is copied in display_queue_add(), the driver would go through the layers in order and combine them according to the template rules.

On the driver side, you can use the private_data pointer of each struct file (you get a fresh one each time you open the device) to store the private buffer of that application. You can keep these in a linked list so that you can iterate over them in display_queue_add().

Yup, a lot of work if you just have one application in mind...

Sorry if this is a bit vague, I'd be happy to help with actual code if you can tell me more about what you want to do.

Good luck,
Borislav