I've found an interesting converse with macro benchmarks: small improvements to ...

I've found an interesting converse with macro benchmarks: small improvements to any of the top functions shown in your profiler output will have a relatively larger than expected effect on the overall speed. I mean you make a change that improves just one function by a little bit on its own (within the benchmark), but the overall is better than this. I think you end up freeing resources used by everything else.

The trick is to optimize the right macro benchmark- one that matches your customer's key use case I suppose.