There is an option called "Joint Stereo" for most encoders. I have only a vague understanding of how it works. I think it conserves bits by not encoding the tracks as two separate entities, but rather devotes bits to the stereo separation only as needed.

I don't know the exact algorithm either but I've seen a simple explanation of what joint stereo does that sounds plausible. Suppose you want to store two 16 bit numbers A and B. One option is to store them separately, that takes 32 bits. Another is to store A+B (17 bits) and A-B (up to 16 bits but potentially much fewer if A and B are close). That's essentially what the MP3 encoder does in joint stereo mode, except it has a limited number of bits to work with (determined by the chosen bitrate). It just allocates more bits to the sum of the channels than to the difference, hoping that the stereo separation isn't too high.

So yeah, try it without joint stereo and see if that helps.

Best,
Borislav