Add more cache layout optimizations
I suspect full-blown pyramids as Sebastien demonstrated is overkill, and a truncated pyramid that's just deep enough to yield satisfactory L3 cache performance would yield better performance. But need to test this.