Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you'd like to learn more, here's an MIT research paper on fast memory allocation that has some really clever ideas: http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf

Abstract:

SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM). It turns out that the same design decisions also make it fast even without HTM. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size.

SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 MiB chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using O(1) instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.



While it has some new material, it's also not very well-written, and some of the benchmarks could be argued as being unrealistic. For example, Brad should have also included results using applications such as Redis (which by default uses jemalloc), or MongoDB, or many others. In many other scenarios not detailed in the paper, the allocator can use much more memory compared to tcmalloc.

The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario that Vyukov him or herself thought of to cause memory bloat in Intel's TBB just by examining the code. Whether or not it actually occurs in a real application is debatable.

If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is still a viable alternative today.

There are many other papers I would recommend in addition to this reading, e.g., "Dynamic Storage Allocation A Survey and Critical Review" Wilson et al. despite it being quite old.

This might get me down votes, but it is wise to remember that just because work has the MIT stamp on it, doesn't mean it's always the most top-quality.


The only limitation of "Dynamic Storage Allocation A Survey and Critical Review" is that it does not include a discussion on multithreaded allocator design - because the paper predates when people really started caring. I'm unaware of a survey paper which covers that aspect of memory allocation.

(By the way: yours is a good comment! There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html)


> There's no need to mention you may get downvotes. We ask people not to do that, as it can tend to have a "I dare you" effect: https://news.ycombinator.com/newsguidelines.html

Oops. Sorry, I will be careful next time. Thanks for pointing that out.

> does not include a discussion on multithreaded allocator design

This is very true. tcmalloc seems to have been the earliest design with thread-local pools. jemalloc didn't originally have this design[1], and over time many allocators just adopted it, including SuperMalloc and others.

[1] https://www.facebook.com/notes/facebook-engineering/scalable... (search for tcmalloc)


Actually, thread-local pools predates tcmalloc by quite a few years. Cribbing from the related work section from a paper I'm a co-author on from 2006 (http://www.scott-a-s.com/files/ismm06.pdf):

"Streamflow uses segregated object allocation in thread-private heaps, as in several other thread-safe allocators including Hoard [3], Maged Michael’s lock-free memory allocator [18], Tcmalloc from Google’s performance tools [10], LKmalloc [15], ptmalloc [9], and Vee and Hsu’s allocator [25]. In particular, Streamflow uses strictly thread-local object allocation, both thread-local and remote deallocation and mechanisms for recycling free page blocks to avoid false sharing and memory blowup [3, 18]."

[3] E. Berger, K. Mckinley, R. Blumofe, and P. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117–128, Cambridge, MA, November 2000.

[9] Wolfram Gloger. Dynamic Memory Allocator Implementations in Linux System Libraries. http://www.dent.med.unimuenchen.de/wmglo/malloc-slides.html.

[10] Google. Google Performance Tools. http://goog-perftools.sourceforge.net/.

[15] P. Larson and M. Krishnan. Memory Allocation for Long-Running Server Applications. In Proceedings of the First International Symposium on Memory Management, pages 176–185, Vancouver, BC, October 1998.

[18] M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35–46, Washington, DC, June 2004.

[25] V. Vee and W. Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. In Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks, pages 230–235, Perth, Australia, June 1999.

The earliest appears to be Larson and Krishnan from 1998. It appears that in the late '90s and early 2000s, it was SMP focused, for servers. Then in the early to mid 2000s, people (including my advisor) started realizing this whole "multicore" thing was for real, and system software would have to change.


I had read your paper :) but did not dig enough to find whether tcmalloc had introduced the concept or not. Thanks for pointing those out!


I wasn't sure where it appeared first, either! I had to dig out that old related work section. There may be work that predates the '98 reference, but it may not have gotten much attention. (I had assumed Hoard would be the first in the literature, but that's from 2000.) I think when it shows up is more related to the available hardware at the time, and what people were doing with it. It's not a huge stretch to imagine thread-local pools, but I don't think enough people were paying attention to the problem before then.


If you haven't spent much time reading about allocation strategies and are looking for a good place to start, _Dynamic Storage Allocation A Survey and Critical Review_ is a fantastic start. One of my all-time favorite survey papers.

http://www.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf


That's a good paper, but it's worth noting that most of the interesting work these days is in multithreaded memory allocators, which weren't as important in 2005. Scaling well under multithreading (i.e. not taking a global malloc lock) changes the design space considerably: you need to have per-thread heaps and rebalance them from time to time, which is itself a very interesting problem.


For anyone interesting in this area, take a look at the memory allocator (and GC) for HotSpot. Specifically "Hierarchical PLABs, CLABs, TLABs in Hotspot" [0]

[0] http://cs.uni-salzburg.at/~hpayer/


Thanks for the link. Do you mind sharing some of these other all-time favorite survey papers?


On the subject of papers, for those who have not read them Jeff Bonwick's work on the Solaris slab allocator is well worth reading.

https://www.usenix.org/legacy/publications/library/proceedin...

https://www.usenix.org/legacy/event/usenix01/full_papers/bon...

Pretty much every OS kernel has adopted the techniques outlined in these papers.


These ideas are also excellent.

While slab allocators can be very fast and space-efficient, they are meant to be used in scenarios where you know in advance what object sizes are, as each slab cache only allocates one kind of object. The kernel is a great use case, because you know which structs will be frequently allocated/destroyed, thus create a slab cache for just that one data structure (and others for other structs). And the kernel itself does not change frequently, nor does it allocate objects of sizes imposed by user-space (like a file system or key-value store would have to).

Memcached employs (-ed ?) its own allocator that is almost analogous to a slab allocator. It creates each cache not from understanding specific object sizes (since it cannot - it is a key-value store), but from a range of sizes incremented by some runtime constant. This can lead to really severe memory bloat in some instances.

Memory allocators are not easy. Even "general-purpose" allocators are not truly meant for general-purpose allocation; the DSA paper mentioned above states (page 6 top left):

> It has been proven that for any possible allocation algorithm, there will always be the possibility that some application program will allocate and deallocate blocks in some fashion that defeats the allocator's strategy, and forces it into severe fragmentation


The SuperMalloc paper has some really interesting ideas about how to manage metadata. When you do a free(), you need to know the size of the underlying allocation. You also probably need to get access to some metadata relating to that allocation. Ideally, you want to even be a bit robust against someone freeing something that isn't a pointer to the beginning of a heap object.

Jemalloc uses a radix tree to map from addresses to the headers of the 1-2MB "chunks" it uses as the top level of allocation. SuperMalloc allocates into 2MB chunks, but such that each chunk has objects of the same size. It then uses an array with one double word per 1MB chunk in the address space, which holds the size of the objects in each chunk. That also allows you to quickly find the metadata, which is stored at the beginning of each chunk.

The array can get big--512MB for an array covering a 48-bit address space with 2MB chunks, but it's space and the virtual memory system won't map that to real memory until each page is accessed.


I don't know how fast SuperMalloc is for other people, but in my experience it's not as fast as some other mallocs() (like Hoard or tcmalloc) and, in the end, it's still just another malloc().

Giving up the malloc() interface allows a very simple page-long allocator to outperform every malloc() I've ever tried (including SuperMalloc), and it would truly surprise me to learn there was some exotic memory technique (HTM, caching, etc) that could beat me.


Can you elaborate on what you mean by "Giving up the malloc() interface"?


The API for malloc is very simple:

  void* malloc(size_t len);
  void free(void* ptr);
Basically, ask for an amount of memory, and give back a pointer to memory. But it does not allow you communicate anything but the amount on an allocation, and you can only give a memory pointer on freeing. If, for example, malloc had an interface like:

  void* malloc(size_t len, duration_t dur);
  enum duration_t { SHORT, MEDIUM, LONG};
One can imagine you could optimize based on the given hint. Once you open up one kind of hint, you can imagine many other kinds of things users could communicate explicitly to the memory allocator about the allocation and access patterns of their data.

All implementations of free also need to find the related metadata for the given pointer. We could also imagine an interface for free which required the user to maintain that, but could perhaps speed things up:

  void free(void* ptr, meta_t d);
Or, we could also imagine communicating how important it is to free this memory:

  void free(void* ptr, immediacy_t im);
  enum immediacy_t { IMMEDIATE, DELAYABLE };
In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

I'm not sure any of these ideas would help, but the point is that because of the limited interface, we really can't explore them. We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.


> In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)

Amusingly, when I've used an IMMEDIATE/DELAYABLE style hint, it was for the opposite purpose: I had some batched deallocs that I would either delay (to spread out over multiple frames instead of handling as a single batch, to eliminate the framerate hitch we were getting), or perform immediately as a single batch (to achieve greater throughput when switching scenes as delayed deallocation was adding untenable amounts of overhead.)

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

Changing such a fundamental building block if C is impossible.

However, providing a second alternative interface, for those applications which could really benefit from such fiddly high performance tweaks, already happens a good bit in games at least. Pool allocators, allocators with extra debug information, allocation of entirely different styles of memory (e.g. write combined memory for texture uploads)... lots of stuff out there. Some low level graphics APIs now make you decide if e.g. you want to put shader constants in their own GPU buffers, or just interleave them into the command buffers themselves...


jemalloc has an alternative API that allows specifying the size of the allocation: sdallocx [1].

[1]: http://www.canonware.com/download/jemalloc/jemalloc-latest/d...


"jemalloc has an alternative API that allows specifying the size of the allocation"

I would like to see an API for malloc where you don't need to specify the size of the allocation :-)

For those who wonder: it takes additional flags specifying alignment, whether to zero memory, whether to store data in a thread-specific cache, or am arena to use.


Interesting idea.

Another way of improvement is to use alloca() for small local (to function) objects but there is no direct way to know if a variable is local or not (in C and C++ at least).

> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.

I can be made with static analysis or binary instrumentation.


> but there is no direct way to know if a variable is local or not (in C and C++ at least).

If you only have one stack, and the stack is at the top of memory and it grows down, you can:

    int onstackp(void*x){char a;return x>&a;}


That's so simple! Thank you a lot.


malloc(sizeof(foo)) is slower than alloc_foo() because the latter can simply be a pointer increment, but there is no way to tell the malloc() interface that you're going to be doing nothing but allocating foo for a while.

malloc() can guess this with heuristics, but a good malloc() needs to perform well for a wide variety of use cases: Surely you can appreciate that balance has a cost that the specialised allocator simply doesn't have to pay.


> SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 MiB chunks which contain objects all the same size.

Hmm, this must not interact well with huge pages: either all those (1 huge page-sized) chunks actually take up physical memory or you're stuck with 4 KiB pages.

Transparent huge pages on Linux are a noticeable performance difference in practice. In one of my applications, a coworker discovered that dTLB misses were a significant source of slowness. He pointed out that we had a lot of "holes" in our memory allocation caused by the malloc implementation calling madvise on 4 KiB pages when they were empty, thus preventing them from being coalesced into a 2 MiB huge page (or causing them to be migrate back to small pages). Setting a commandline flag to prevent this decreased CPU usage by 7%.


HN discussion on the SuperMalloc paper: https://news.ycombinator.com/item?id=10770781


> SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap.

On the other hand that can considerably increase the size of the page table, and the cost and frequency of TLB misses. So I'm not sure this all a meaningful comparison.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: