While H.264 is pretty efficient it still is far from unbeatable. In fact it is possible to do even better than H.265. The trick is to write better and more complex prediction algorithms, possibly using more data from more other frames, to predict a frame (and, theoretically, also non frame data, like computing a prediction using data that yielded mis-predictions, knowing that it did mispredict).
Using more complex algorithms, however, requires more computational power (and possibly more memory).
In fact we do not even store differences between actual frame and predicted frame. We also employ prediction of pixels in a frame from other pixels of that same frame. And we exploit known common regularities in the kind of data that actually appears in real images by assuming they are there so that we can efficiently code real images at the price of being unable to correctly encode images that do not appear in reality.
On top of this, our compression techniques are lossy. That is we do not care to really encode a frame: we accept to encode a different frame as long as it "looks" sufficiently similar to the real one. To do this we exploit limits in our sensory (optical) system. In other words we "delete" from the frames a lot of the information, but we care to only delete the kind of information that our eyes cannot detect. This is why it's important (in professional settings) to only compress the delivery file and not to compress much the raw shooting material and the intermediate files (in other words most consumer and prosumer cameras are not suited for high end video post production treatment, even if the raw videos they produce are visually identical, and sometimes perceptively better, than uncompressed or very lightly compressed material from pro eequipment.