I've been studying your code and thinking more about the problem.
It may not be quite as straightforward as I originally thought as I had forgotten about the 2 pixels per byte packing.
Code:
{
fy = iy + py;
fx = ix + px;
if ( fy < 0 || fy >= EMPEG_SCREEN_ROWS || fx < 0 || fx >= EMPEG_SCREEN_COLS )
continue;
if ( px == 0 && py == 0 )
continue;
n = 64 * fy + fx / 2;
This kills you because you hit it (or its sister) every time until done=1. An immediate obvious improvement would be to shift the "if ( px == 0 && py == 0 )" test up a few lines so that you can avoid the two additions and tortuous bounds test for that pixel. The real killer is the bounds test - it only has any effect for 318 out of 4096 pixels, yet you have to do the 4 comparisons every time.
This is something that I hope my method will avoid. By doing the 8 additions sequentially in turn, each addition could be bounded within the 'for' loops.
However, your algorithm only has to deal with the buffer packing problem once per pixel when you burn the bg buffer, or 4096 times per screen refresh. My solution would require this to be done for every addition...ie in the order of 32768 times. It's not the straightforward addition that I thought it was. You can't bytewise add either as you'd get carry problems between the nibbles.
There may be a way around this...it occurs to me that we only care if a pixel in our temporary buffer is non-zero after all the additions. Whether a nibble is 0x1 or 0xf is irrelevant - we burn the bg buffer for that pixel.
Because of that we may be able to use a bitwise OR instead of addition and actually achieve the same result in half the operations.
In order to shift the fg buffer one pixel left or right still is an issue though. If we could 'rotate with carry bits' across 64 bytes then it would be fairly easy.
I'm beginning to wonder whether unpacking the fg image into a byte per pixel copy might be worth considering to make the math faster.