While it has some new material, it's also not very well-written, and some of the benchmarks could be argued as being unrealistic. For example, Brad should have also included results using applications such as Redis (which by default uses jemalloc), or MongoDB, or many others. In many other scenarios not detailed in the paper, the allocator can use much more memory compared to tcmalloc.
The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario that Vyukov him or herself thought of to cause memory bloat in Intel's TBB just by examining the code. Whether or not it actually occurs in a real application is debatable.
If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is still a viable alternative today.
There are many other papers I would recommend in addition to this reading, e.g., "Dynamic Storage Allocation
A Survey and Critical Review" Wilson et al. despite it being quite old.
This might get me down votes, but it is wise to remember that just because work has the MIT stamp on it, doesn't mean it's always the most top-quality.
The only limitation of "Dynamic Storage Allocation A Survey and Critical Review" is that it does not include a discussion on multithreaded allocator design - because the paper predates when people really started caring. I'm unaware of a survey paper which covers that aspect of memory allocation.
(By the way: yours is a good comment! There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html)
Oops. Sorry, I will be careful next time. Thanks for pointing that out.
> does not include a discussion on multithreaded allocator design
This is very true. tcmalloc seems to have been the earliest design with thread-local pools. jemalloc didn't originally have this design[1], and over time many allocators just adopted it, including SuperMalloc and others.
Actually, thread-local pools predates tcmalloc by quite a few years. Cribbing from the related work section from a paper I'm a co-author on from 2006 (http://www.scott-a-s.com/files/ismm06.pdf):
"Streamflow uses segregated object allocation in thread-private
heaps, as in several other thread-safe allocators including Hoard
[3], Maged Michael’s lock-free memory allocator [18], Tcmalloc
from Google’s performance tools [10], LKmalloc [15], ptmalloc
[9], and Vee and Hsu’s allocator [25]. In particular, Streamflow
uses strictly thread-local object allocation, both thread-local and
remote deallocation and mechanisms for recycling free page blocks
to avoid false sharing and memory blowup [3, 18]."
[3] E. Berger, K. Mckinley, R. Blumofe, and P. Wilson. Hoard: A Scalable
Memory Allocator for Multithreaded Applications. In Proc. of the 9th
International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 117–128, Cambridge, MA,
November 2000.
[15] P. Larson and M. Krishnan. Memory Allocation for Long-Running
Server Applications. In Proceedings of the First International
Symposium on Memory Management, pages 176–185, Vancouver,
BC, October 1998.
[18] M. Michael. Scalable Lock-free Dynamic Memory Allocation. In
Proceedings of the ACM SIGPLAN 2004 Conference on Programming
Language Design and Implementation, pages 35–46, Washington,
DC, June 2004.
[25] V. Vee and W. Hsu. A Scalable and Efficient Storage Allocator
on Shared Memory Multiprocessors. In Proceedings of the 1999
International Symposium on Parallel Architectures, Algorithms and
Networks, pages 230–235, Perth, Australia, June 1999.
The earliest appears to be Larson and Krishnan from 1998. It appears that in the late '90s and early 2000s, it was SMP focused, for servers. Then in the early to mid 2000s, people (including my advisor) started realizing this whole "multicore" thing was for real, and system software would have to change.
I wasn't sure where it appeared first, either! I had to dig out that old related work section. There may be work that predates the '98 reference, but it may not have gotten much attention. (I had assumed Hoard would be the first in the literature, but that's from 2000.) I think when it shows up is more related to the available hardware at the time, and what people were doing with it. It's not a huge stretch to imagine thread-local pools, but I don't think enough people were paying attention to the problem before then.
If you haven't spent much time reading about allocation strategies and are looking for a good place to start, _Dynamic Storage Allocation A Survey and Critical Review_ is a fantastic start. One of my all-time favorite survey papers.
That's a good paper, but it's worth noting that most of the interesting work these days is in multithreaded memory allocators, which weren't as important in 2005. Scaling well under multithreading (i.e. not taking a global malloc lock) changes the design space considerably: you need to have per-thread heaps and rebalance them from time to time, which is itself a very interesting problem.
For anyone interesting in this area, take a look at the memory allocator (and GC) for HotSpot. Specifically "Hierarchical PLABs, CLABs, TLABs in Hotspot" [0]
The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario that Vyukov him or herself thought of to cause memory bloat in Intel's TBB just by examining the code. Whether or not it actually occurs in a real application is debatable.
If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is still a viable alternative today.
There are many other papers I would recommend in addition to this reading, e.g., "Dynamic Storage Allocation A Survey and Critical Review" Wilson et al. despite it being quite old.
This might get me down votes, but it is wise to remember that just because work has the MIT stamp on it, doesn't mean it's always the most top-quality.