If you'd like to learn more, here's an MIT research paper on fast memory allocat...

eatbitseveryday · on Aug 26, 2016

While it has some new material, it's also not very well-written, and some of the benchmarks could be argued as being unrealistic. For example, Brad should have also included results using applications such as Redis (which by default uses jemalloc), or MongoDB, or many others. In many other scenarios not detailed in the paper, the allocator can use much more memory compared to tcmalloc.

The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario that Vyukov him or herself thought of to cause memory bloat in Intel's TBB just by examining the code. Whether or not it actually occurs in a real application is debatable.

If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is still a viable alternative today.

There are many other papers I would recommend in addition to this reading, e.g., "Dynamic Storage Allocation A Survey and Critical Review" Wilson et al. despite it being quite old.

This might get me down votes, but it is wise to remember that just because work has the MIT stamp on it, doesn't mean it's always the most top-quality.

scott_s · on Aug 26, 2016

The only limitation of "Dynamic Storage Allocation A Survey and Critical Review" is that it does not include a discussion on multithreaded allocator design - because the paper predates when people really started caring. I'm unaware of a survey paper which covers that aspect of memory allocation.

(By the way: yours is a good comment! There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html)

eatbitseveryday · on Aug 26, 2016

> There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html

Oops. Sorry, I will be careful next time. Thanks for pointing that out.

> does not include a discussion on multithreaded allocator design

This is very true. tcmalloc seems to have been the earliest design with thread-local pools. jemalloc didn't originally have this design[1], and over time many allocators just adopted it, including SuperMalloc and others.

[1] https://www.facebook.com/notes/facebook-engineering/scalable... (search for tcmalloc)

scott_s · on Aug 26, 2016

Actually, thread-local pools predates tcmalloc by quite a few years. Cribbing from the related work section from a paper I'm a co-author on from 2006 (http://www.scott-a-s.com/files/ismm06.pdf):

"Streamflow uses segregated object allocation in thread-private heaps, as in several other thread-safe allocators including Hoard [3], Maged Michael’s lock-free memory allocator [18], Tcmalloc from Google’s performance tools [10], LKmalloc [15], ptmalloc [9], and Vee and Hsu’s allocator [25]. In particular, Streamflow uses strictly thread-local object allocation, both thread-local and remote deallocation and mechanisms for recycling free page blocks to avoid false sharing and memory blowup [3, 18]."

[3] E. Berger, K. Mckinley, R. Blumofe, and P. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117–128, Cambridge, MA, November 2000.

[9] Wolfram Gloger. Dynamic Memory Allocator Implementations in Linux System Libraries. http://www.dent.med.unimuenchen.de/wmglo/malloc-slides.html.

[10] Google. Google Performance Tools. http://goog-perftools.sourceforge.net/.

[15] P. Larson and M. Krishnan. Memory Allocation for Long-Running Server Applications. In Proceedings of the First International Symposium on Memory Management, pages 176–185, Vancouver, BC, October 1998.

[18] M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35–46, Washington, DC, June 2004.

[25] V. Vee and W. Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. In Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks, pages 230–235, Perth, Australia, June 1999.

The earliest appears to be Larson and Krishnan from 1998. It appears that in the late '90s and early 2000s, it was SMP focused, for servers. Then in the early to mid 2000s, people (including my advisor) started realizing this whole "multicore" thing was for real, and system software would have to change.

eatbitseveryday · on Aug 26, 2016

I had read your paper :) but did not dig enough to find whether tcmalloc had introduced the concept or not. Thanks for pointing those out!

scott_s · on Aug 26, 2016

I wasn't sure where it appeared first, either! I had to dig out that old related work section. There may be work that predates the '98 reference, but it may not have gotten much attention. (I had assumed Hoard would be the first in the literature, but that's from 2000.) I think when it shows up is more related to the available hardware at the time, and what people were doing with it. It's not a huge stretch to imagine thread-local pools, but I don't think enough people were paying attention to the problem before then.

tptacek · on Aug 26, 2016

If you haven't spent much time reading about allocation strategies and are looking for a good place to start, _Dynamic Storage Allocation A Survey and Critical Review_ is a fantastic start. One of my all-time favorite survey papers.

http://www.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf

pcwalton · on Aug 26, 2016

That's a good paper, but it's worth noting that most of the interesting work these days is in multithreaded memory allocators, which weren't as important in 2005. Scaling well under multithreading (i.e. not taking a global malloc lock) changes the design space considerably: you need to have per-thread heaps and rebalance them from time to time, which is itself a very interesting problem.

sitkack · on Aug 26, 2016

For anyone interesting in this area, take a look at the memory allocator (and GC) for HotSpot. Specifically "Hierarchical PLABs, CLABs, TLABs in Hotspot" [0]

[0] http://cs.uni-salzburg.at/~hpayer/

mangodrunk · on Aug 26, 2016

Thanks for the link. Do you mind sharing some of these other all-time favorite survey papers?

g0xA52A2A · on Aug 26, 2016

On the subject of papers, for those who have not read them Jeff Bonwick's work on the Solaris slab allocator is well worth reading.

https://www.usenix.org/legacy/publications/library/proceedin...

https://www.usenix.org/legacy/event/usenix01/full_papers/bon...

Pretty much every OS kernel has adopted the techniques outlined in these papers.

eatbitseveryday · on Aug 26, 2016

These ideas are also excellent.

While slab allocators can be very fast and space-efficient, they are meant to be used in scenarios where you know in advance what object sizes are, as each slab cache only allocates one kind of object. The kernel is a great use case, because you know which structs will be frequently allocated/destroyed, thus create a slab cache for just that one data structure (and others for other structs). And the kernel itself does not change frequently, nor does it allocate objects of sizes imposed by user-space (like a file system or key-value store would have to).

Memcached employs (-ed ?) its own allocator that is almost analogous to a slab allocator. It creates each cache not from understanding specific object sizes (since it cannot - it is a key-value store), but from a range of sizes incremented by some runtime constant. This can lead to really severe memory bloat in some instances.

Memory allocators are not easy. Even "general-purpose" allocators are not truly meant for general-purpose allocation; the DSA paper mentioned above states (page 6 top left):

> It has been proven that for any possible allocation algorithm, there will always be the possibility that some application program will allocate and deallocate blocks in some fashion that defeats the allocator's strategy, and forces it into severe fragmentation

rayiner · on Aug 26, 2016

The SuperMalloc paper has some really interesting ideas about how to manage metadata. When you do a free(), you need to know the size of the underlying allocation. You also probably need to get access to some metadata relating to that allocation. Ideally, you want to even be a bit robust against someone freeing something that isn't a pointer to the beginning of a heap object.

Jemalloc uses a radix tree to map from addresses to the headers of the 1-2MB "chunks" it uses as the top level of allocation. SuperMalloc allocates into 2MB chunks, but such that each chunk has objects of the same size. It then uses an array with one double word per 1MB chunk in the address space, which holds the size of the objects in each chunk. That also allows you to quickly find the metadata, which is stored at the beginning of each chunk.

The array can get big--512MB for an array covering a 48-bit address space with 2MB chunks, but it's space and the virtual memory system won't map that to real memory until each page is accessed.

geocar · on Aug 26, 2016

I don't know how fast SuperMalloc is for other people, but in my experience it's not as fast as some other mallocs() (like Hoard or tcmalloc) and, in the end, it's still just another malloc().

Giving up the malloc() interface allows a very simple page-long allocator to outperform every malloc() I've ever tried (including SuperMalloc), and it would truly surprise me to learn there was some exotic memory technique (HTM, caching, etc) that could beat me.

ciniglio · on Aug 26, 2016

Can you elaborate on what you mean by "Giving up the malloc() interface"?

scott_s · on Aug 26, 2016

The API for malloc is very simple:

  void* malloc(size_t len);
  void free(void* ptr);

Basically, ask for an amount of memory, and give back a pointer to memory. But it does not allow you communicate anything but the amount on an allocation, and you can only give a memory pointer on freeing. If, for example, malloc had an interface like:

  void* malloc(size_t len, duration_t dur);
  enum duration_t { SHORT, MEDIUM, LONG};

One can imagine you could optimize based on the given hint. Once you open up one kind of hint, you can imagine many other kinds of things users could communicate explicitly to the memory allocator about the allocation and access patterns of their data.

All implementations of free also need to find the related metadata for the given pointer. We could also imagine an interface for free which required the user to maintain that, but could perhaps speed things up:

  void free(void* ptr, meta_t d);

Or, we could also imagine communicating how important it is to free this memory:

  void free(void* ptr, immediacy_t im);
  enum immediacy_t { IMMEDIATE, DELAYABLE };

In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

I'm not sure any of these ideas would help, but the point is that because of the limited interface, we really can't explore them. We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

MaulingMonkey · on Aug 27, 2016

> In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

Amusingly, when I've used an IMMEDIATE/DELAYABLE style hint, it was for the opposite purpose: I had some batched deallocs that I would either delay (to spread out over multiple frames instead of handling as a single batch, to eliminate the framerate hitch we were getting), or perform immediately as a single batch (to achieve greater throughput when switching scenes as delayed deallocation was adding untenable amounts of overhead.)

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

Changing such a fundamental building block if C is impossible.

However, providing a second alternative interface, for those applications which could really benefit from such fiddly high performance tweaks, already happens a good bit in games at least. Pool allocators, allocators with extra debug information, allocation of entirely different styles of memory (e.g. write combined memory for texture uploads)... lots of stuff out there. Some low level graphics APIs now make you decide if e.g. you want to put shader constants in their own GPU buffers, or just interleave them into the command buffers themselves...

pcwalton · on Aug 26, 2016

jemalloc has an alternative API that allows specifying the size of the allocation: sdallocx [1].

[1]: http://www.canonware.com/download/jemalloc/jemalloc-latest/d...

Someone · on Aug 27, 2016

"jemalloc has an alternative API that allows specifying the size of the allocation"

I would like to see an API for malloc where you don't need to specify the size of the allocation :-)

For those who wonder: it takes additional flags specifying alignment, whether to zero memory, whether to store data in a thread-specific cache, or am arena to use.

vmorgulis · on Aug 26, 2016

Interesting idea.

Another way of improvement is to use alloca() for small local (to function) objects but there is no direct way to know if a variable is local or not (in C and C++ at least).

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

I can be made with static analysis or binary instrumentation.

geocar · on Aug 27, 2016

> but there is no direct way to know if a variable is local or not (in C and C++ at least).

If you only have one stack, and the stack is at the top of memory and it grows down, you can:

    int onstackp(void*x){char a;return x>&a;}

vmorgulis · on Aug 27, 2016

That's so simple! Thank you a lot.

geocar · on Aug 26, 2016

malloc(sizeof(foo)) is slower than alloc_foo() because the latter can simply be a pointer increment, but there is no way to tell the malloc() interface that you're going to be doing nothing but allocating foo for a while.

malloc() can guess this with heuristics, but a good malloc() needs to perform well for a wide variety of use cases: Surely you can appreciate that balance has a cost that the specialised allocator simply doesn't have to pay.

scottlamb · on Aug 26, 2016

> SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 MiB chunks which contain objects all the same size.

Hmm, this must not interact well with huge pages: either all those (1 huge page-sized) chunks actually take up physical memory or you're stuck with 4 KiB pages.

Transparent huge pages on Linux are a noticeable performance difference in practice. In one of my applications, a coworker discovered that dTLB misses were a significant source of slowness. He pointed out that we had a lot of "holes" in our memory allocation caused by the malloc implementation calling madvise on 4 KiB pages when they were empty, thus preventing them from being coalesced into a 2 MiB huge page (or causing them to be migrate back to small pages). Setting a commandline flag to prevent this decreased CPU usage by 7%.

scott_s · on Aug 26, 2016

HN discussion on the SuperMalloc paper: https://news.ycombinator.com/item?id=10770781

anarazel · on Aug 27, 2016

> SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap.

On the other hand that can considerably increase the size of the page table, and the cost and frequency of TLB misses. So I'm not sure this all a meaningful comparison.