Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
So what's wrong with 1975 programming? (2008) (varnish-cache.org)
207 points by nwjsmith on Dec 5, 2012 | hide | past | favorite | 128 comments


So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.

This quote, much like various scientific quantum mechanics quotes adopted by the laymen, keeps haunting honest systems programmers because people with a little bit of knowledge read it, misinterpret (or misunderstand) it, and then share it.

Look, I don't know how Squid is designed, but most database systems use this strategy and it does not get into wars with the kernel for a whole slew of reasons that aren't addressed in the article. I know, because we've done a ton of sophisticated benchmarking comparing custom use case cache performance to general purpose page cache performance. Here are a few of the many, many reasons why this quote cannot be applied to sensibly designed pieces of systems software:

1. If the database/proxy/whatever server is designed correctly, it'll always use just enough RAM that it won't go into swap. That means the kernel won't magically page out its memory preventing it from doing its job.

2. In fact, kernels provide mechanisms to guarantee this by using various mechanisms (such as mlock).

3. Also, if your process misbehaves, modern kernels will just deploy the OOM killer (depending on how things are configured), so you can't just get into fights with the page cache without being sniped.

4. Of course you have to be smart and read from the file in a way that bypasses the page cache (via DIRECT_IO). Yes, it complicates things greatly for systems programmers (all sorts of alignment issues, journal data filesystems issues, etc.) but if you want high performance, especially on SSDs, and have special use cases to warrant it, it's worth it.

5. If you really know what you're doing, a custom cache can be significantly more efficient than the general purpose kernel cache, which in turn can make significant impact on performance bottom line. For example, a b-tree aware caching scheme has to do less bookkeeping, is more efficient, and has more information to make decisions than the general purpose LRU-K cache.

In fact, it is absolutely astounding how many 1975 abstractions translate wonderfully into the world of 2012. Architecturally, almost everything that worked back then still works now, including OS research, PL research, algorithms research, and software engineering research -- the four pillars that are holding up the modern software world. Some things are obsolete, perhaps, but far, far fewer than one might think.

Incidentally, this is also one of the reasons why I cringe when people say "the world is changing so fast, it's getting harder and harder to keep up". In matters of fashion, perhaps, but as far as core principles go (in computer science, mathematics, human emotions/interaction, and pretty much everything else of consequence) the world is moving at a glacial pace. Shakespeare might be a bit clunky to read these days because the language is a bit out of style, but what Hamlet had to say in 1600 is, amazingly, just as relevant today (and likely much more useful, because instead of actually reading Hamlet, most people read things like The Purple Cow, The 22 Immutable Laws of Marketing, The 99 Immutable Laws of Leadership, etc.)


As others have noted the rant is from 2008, which is interesting because this was the transition point between wide spread adoption of 64 bit aware OSes from 32 bit aware OSes [1]. You can see lots of folks who are just getting their feet wet with 64 bit Linux [2], and configured memory sizes are getting up over 4GB 'real' memory.

One of the wonderful things about 64 bit address spaces? You don't ever have to re-use an address. Once folks figure that out you can do some amazing things that would have seemed stupid in the 70's, you can hard code the address for every single library function on your machine. Can you even imagine how weird that would be? Linking time would be instantaneous, calling printf? its always at 0x10015081aaf10000, all of libc? starting at 0x1000000000000 and working up. One giant database of the 'standard place' to put every single function. Remember when the 32 bit OS would put the kernel at 0x80000000 ? You know right on the 2G border, above that, kernel space, below that user space.

Anyway, I completely concur that abstractions that worked before, work wonderfully today. But I also don't worry about an array of 100M items on a server in RAM anymore. Using mmap to map a 1G file into the address space? Not a problem.

Its interesting to watch the behavior of systems when they are essentially all in memory.

[1] http://www.tomshardware.com/reviews/vista-workshop,1775-3.ht...

[2] http://blekko.com/ws/?q=64+bit+linux+%2Fdate%3D2008


> One of the wonderful things about 64 bit address spaces? You don't ever have to re-use an address.

Never say never. Current x86-64 CPUs only support 48 bits of address space, which you could scan through in under two hours on a 2.5GHz machine (at 16 bytes/cycle). 48 bits is a lot, but not infinite.

If and when it actually is 64 bits, that will be a different story. :)


True, the physically addressable memory is "only" 256TB but the virtual space is still 64 bits. The point in time when the 256TB becomes a problem is when your working set (stuff resident in memory) is ideally more than 256TB. We've got mostly 96GB machines and those are modestly hard to exhaust. I don't think even a 1TB physical memory machine would be exhaustible with our current data sets.

Even today, reading in a 32GB index shard from SSD still takes a couple of minutes.


Amen to that!

I remember how wonderful it was to have this flat 32-bit address space, where each memory location could be unambiguously identified by a single numeric address, and you did not have to make any crazy distinctions between short and long pointers.


0x80000000 on Windows. Linux i386 sets the boundary at 0xc0000000 :) This is yet another aspect where Linux made a better choice than Windows, as this gives processes 3GB of virtual memory instead of only 2GB.


Incidentally, you can change this on Windows i386 in the boot.ini [1] file using the /3GB option. This is for exactly the case where you want your database software (for instance) to be able to address more user-mode memory.

As you might imagine, it occasionally trips up badly written kernel-mode code that makes assumptions about what type of address it's dealing with based on whether the MSB is set or not.

[1] http://support.microsoft.com/kb/833721


Ah yes, I remember that 6 months of Moore's law where my Windows XP32 machine had 4GB of RAM. My code project grew to need the /3GB switch in order to compile successfully (until we just moved to 64 bit dev boxes).


0xC0000000 on Linux. Mac OS X i386 sets the boundary at 0x00000000 :) This is yet another aspect where Mac OS X made a better choice than Linux, as this gives processes 4GB of virtual memory instead of only 3GB.

Seriously: there are trade offs everywhere. Initially, a 2/2 split seemed good enough, it makes it easier to dicsciminate between user and kernel addresses, the difference between 2GB and 3GB is not that high (about a year in my restatement of Moore's law "a bit of address space every 18 months"), and Microsoft values backward compatibility higher than Linux does.


But this extra 1GB, that Mac OS X's 4GB/4GB split has over a 3GB/1GB split, is very costly as it forces a page table switch for each syscall. This is extra overhead that Linux doesn't have. Ingo Molnar, who developed a 4GB/4GB patch for Linux, explained this performance hit: http://lwn.net/Articles/39283/

This reinforces the fact that Linux's choice seems to be the better tradeoff between what Windows and Mac OS X do.


http://blogs.msdn.com/b/oldnewthing/archive/2004/08/06/20984...:

"One of the adverse consequences of the /3GB switch is that it forces the kernel to operate inside a much smaller space.

One of the biggest casualties of the limited address space is the video driver. To manage the memory on the video card, the driver needs to be able to address it, and the apertures required are typically quite large. When the video driver requests a 256MB aperture, the call is likely to fail since there simply isn't that much address space available to spare."

As I said: there are trade offs everywhere. AFAIK, Linux has not found a magic way round this. For example, http://us.download.nvidia.com/XFree86/Linux-x86/173.14.12/RE... states:

"The NVIDIA kernel module requires portions of the kernel's virtual address space for each GPU and for certain memory allocations. If no more than 128MB are available to the kernel and device drivers at boot time, the NVIDIA kernel module may be unable to initialize all GPUs, or fail memory allocations"


Yes there is a workaround: use the vmalloc=xxx parameter to increase it to a value greater than 128MB. This will work in most cases where the reserved virtual space can almost always at least be increased up to 256MB. I had to use this parameter myself 7 years ago when I was running 50-100 QEMU VMs on a 32-bit host with 32GB RAM and PAE.

The only cases where vmalloc=xxx won't solve your virtual address space is in exceedingly rare software/hardware configuration (more than 100+ QEMU VMs, more than probably about 4 Nvidia GPUs since they seem to use ~64MB each).

So all in all, yes for 99%+ of users not running these rare configs, the 3GB/1GB split is just fine. Which is why Linux never changed this default. Microsoft has more problem with a 3GB/1GB split because their kernel is less space-optimized (mostly due to bad coding practices because too many Windows drivers/kernel developers assume that 2GB is available.) I maintain my opinion that 3GB/1GB is a better tradeoff.


Would it be possible to change this in OS X without making the whole user-land incompatible?


Linux can be configured to set the boundary at 0x80000000. 3G/1G might be more common, but "Linux" didn't make any choice at all here.


Linux made a choice: the choice of what the default should be (3G/1G).


Strongly agree with all of this. I think PHK is right that you don't want user-space and the kernel to fight, but the solution isn't "leave it all to the kernel," it's "leave it out of the kernel."

The fact that you can outperform the kernel's abstractions (as you mentioned) clearly demonstrates that the kernel's algorithms aren't "one size fits all." So why are they in the kernel at all? Thank goodness that facilities like O_DIRECT let you bypass them (even though Linus thinks you should never do this: https://lkml.org/lkml/2007/1/10/233).

I am not arguing that everyone should roll their own memory management. We have a lot of sophisticated ways of sharing code in user-space, like shared libraries. The beautiful thing about user-space libraries is that you voluntarily choose them based on the value they provide to you. If caching library A gets one-up'd by caching library B, competitive forces will allow the upstart competitor to displace the incumbent. But OS kernels are in a privileged position; they have exclusive access to the hardware. It's not easy to just drop in a competing OS scheduler, even though other approaches may be better than what Linux does. That's why kernels should contain as little functionality as possible.

Another reason is accounting and isolation. The resources used by kernel-space are shared between processes, which can lower overall resource use but also means that no user-space process can be "charged" for them or guaranteed an isolated share of them. For this reason, some systems I've been in contact with lately actually do all disk caching in user-space, so that (for example) a huge read from one process can't evict all the pages from someone else's process.


BTW, that's called the Exokernel. http://en.wikipedia.org/wiki/Exokernel

The problem with using Unix like an Exokernel, as I see it, is that Unix really doesn't want to be an Exokernel.


The anecdote about databases really isn't true.

Here's a recent (54 day old) problem where PostgreSQL tweaks were at odds with a Linux kernel update[1], and the HN post to go with it.[2]

[1] http://lwn.net/Articles/518330/ [2] http://news.ycombinator.com/item?id=4640529


Sounds like an awful amount work to bypass kernel caching and swapping to revert back to how things are without the virtual memory abstraction.

I'd just start as the author described. Then, if and when I know that I know better than the kernel, I might implement some sort of userspace caching instead, proven that it won't hamper overall performance. I might just mlock critical btree nodes or indexes and still leave data blobs or less frequently used structures under vmem.

Virtual memory is one of the things that surprisingly just works and does that without many drawbacks. I don't see the point of rejecting its offering straight off the bat.


The second thing you should probably do is use hinting mechanisms like posix_fadvise() to help the kernel do the right thing.


I'd say the kernel going on a shooting spree wherein the OOM kill starts taking out innocent processes as they touch memory pages they were legitimately allocated is a perfect example of "getting into wars with the kernel".


My point was that (a) if the server actually goes to war with the kernel, on modern setups it will usually be taken out by the OOM killer instead of just getting slower like the article states, and (b) most database servers run just fine without being taken out by the OOM killer, which means they're designed in a way that actually does not go to war with the kernel.


Good point. Wars between userspace processes and the kernel are very short.


The first few lines mentioned "acoustic delay lines" which piqued my interest. Wikipedia has a page on this old technology: http://en.wikipedia.org/wiki/Delay_line_memory#Acoustic_dela...

It was a pretty amazing hack, before magnetic memory cores. Because sound moved at a slow rate through a medium like mercury, an acoustic wave (that is, a sound) could be applied to one side of a volume of mercury and be expected to arrive at the other end after a predictable, useful delay. So what would be done is that a column of mercury with transducers on both ends would function as speakers and microphones, which in an acoustic medium are the equivalent of read and write heads!

The system memory would be a collection of these columns, each I guess storing one bit. The memory would of course have to be refreshed: when the signal arrived at the other end, it would be fed back into the column, assuming I suppose that there wasn't a new signal waiting to be written to that bit instead. The article mentions that this was not randomly accessible memory, but rather serially accessible. From that and other bits of information, I gather that the device would visit each bit in sequence, according to some clock, and produce a signal on the read line corresponding to the value in that bit. You had to wait for the memory device to read out the particular bit you were waiting for.

Does anyone know if this a correct understanding of how this kind of storage worked? What a cool way to store bits!


The Hodges biography of Turing has lots of in-passing mention of fascinating technology like this. I think my favourite was the use of a CRT as a memory array (by picking up the charge on the fluorescent screen and feeding it back to the electron gun to refresh it!) which suggested to Turing the idea of using light to stimulate the feedback cycle and thus writing directly to memory with a very real "light pen"!

I'm probably borking up the details there but my point that the biog is great stands.

I guess there are readers of HN who never encountered the later "light pens." A photodiode picks up the raster on a CRT based monitor and with appropriate timing logic uses this to decide where to draw pixels. I had a cheap light pen on an 80s microcomputer before I ever got my hands on a mouse.

Ah, happy (but often frustrating) days...


> A photodiode picks up the raster on a CRT based monitor and with appropriate timing logic uses this to decide where to draw pixels

Incidentally this is how "light guns" worked, and also why they sadly do not work on LCD or plasma screens.


Once I made a program that was like duck hunt, except you held a webcam at the screen. When you fired, the screen flashed black with a patch of white where the duck was, the camera took a snapshot and looked for how close to the center of the webcam's frame the white was to decide if you had hit it or not. It worked surprisingly well.


Awesome! I think some light gun games used this method too :)


> Because sound moved at a slow rate through a medium like mercury

Actually, it moves pretty fast through mercury; 1450 m/s.

> each I guess storing one bit. (...) Does anyone know if this a correct understanding of how this kind of storage worked?

From the Wikipedia article:

> Typically many pulses would be "in flight" through the delay, and the computer would count the pulses by comparing to a master clock to find the particular bit it was looking for.

> EDSAC, designed to be the first stored-program digital computer, began operation with 512 35-bit words of memory, stored in 32 delay lines holding 576 bits each

> The average access time was about 222 microseconds

Hmm... That would be a memory loop time of 444 us. At a speed of 1450 m/s, you need a delay line at least 64.38 cm long. And you'd need an impressive 1.3 Mb/s data transmission rate.

Another interesting fact:

> Since the speed of sound changes with temperature (because of the change in density with temperature) the tubes were heated in large ovens to keep them at a precise temperature. Other systems instead adjusted the computer clock rate according to the ambient temperature to achieve the same effect.


For an idea of scale, this page includes a photo of Maurice Wilkes next to the mercury delay lines of EDSAC: http://amturing.acm.org/info/wilkes_1001395.cfm

The tubes are 5 feet long (the whole tank is about 6' including the transducers at each end). I think that tank holds 16 tubes, so I'm guessing from the figures in geon's post that there must have been another tank too.


An aside, I believe the early computer witnessed by Lawrence Waterhouse in Cryptonomicon used this mercury column system, it was described as visibly moving up and down inside the columns.

I hadn't heard of anything like it at the time, it was an intriguing idea - I wasn't even sure Stephenson hadn't made the whole thing up.


I think the version in Cryptonomicon was based on actual sound waves (in tubes filled with air) rather than in mercury as the book mentions to noise coming from his laboratory.


Earlier discussion, fwiw (though it was 2 1/2 years ago): http://news.ycombinator.com/item?id=1554656

Among other things, contains an interesting alternate perspective from a former Squid developer, about some of Squid's design decisions, some of which were driven by a goal of being maximally cross-platform and compatible with all possible clients/servers. Others were driven by the fact that Unix VM systems were actually not very good much more recently than 1975, like in the 1990s.


Well, today computers really only have one kind of storage, and it is usually some sort of disk, the operating system and the virtual memory management hardware has converted the RAM to a cache for the disk storage.

I used to think that too. Specifically Windows NT was said to need a pagefile at least as large as physical RAM. This was back when a workstation might have 16MB RAM and a 1GB disk. I thought this was because the kernel might be eliminating the need for some indirection by direct mapping physical RAM addresses to pagefile addresses. I was wrong.

On the Linux side, you would typically see the recommendation to make a swap partition "twice the size off RAM". Despite the possibility of using swap files, most distros still give dire warnings if you don't define a fixed-size swap partition on installation.

I don't think there was ever a solid justification for this "twice RAM" heuristic. A better method might be something like "max amount of memory you're ever going to need minus physical RAM" or "max amount of time you're willing to be stuck in the weeds divided by the expected disk bandwidth under heavy thrashing".

Regardless, if your server is actively swapping at all you're probably doing it wrong. It's not just that swapping is slow, it's that your database or your web cache have special knowledge about the workload that, in theory, should allow it to perform caching more intelligently.

I'd prefer to disable swap entirely, but there are occasions where it can make the difference in being able to SSH into a box on which some process has started running away with CPU and RAM.

But this guy is a kernel developer so he seems to feel that the kernel should manage the "one true cache". I like the ease and performance of memory-mapped files as much as the next guy, but I wouldn't go sneering at other developers for attempting to manage their disk IO in a more hands-on fashion.


On NT, if the kernel bugchecks and it's configured to do a full or kernel memory dump, it dumps to the pagefile [1] (which get's copied elsewhere after you reboot). If you were looking for another good reason to have a pagefile the size of your physical memory on NT, there you go :-)

[1] http://support.microsoft.com/kb/254649


Yep, me and like 3 other guys I know actually do look at kernel minidumps and have a big pagefile for that reason.


On at least one (admittedly, misconfigured) system of mine, Linux got up to a working set around 4x physical RAM before kswapd became CPU bound instead of disk IOPS bound and everything stopped working. Assuming that anecdote is the usual outcome, you can justify the RAM:Swap ratio rule of thumb as an upper limit on how much swap can be usefully used for anything other than a band-aid on a memory leak.

edit: NT really does need substantially more swap than most Linux configurations, as it always runs with the Linux equivalent of overcommit disabled and a high swappiness.


That's how the story plays out for me on a desktop system if a single process starts running away with memory.

But on a production server this is guaranteed to happen at the worst possible time (i.e. at the peak of the daily load cycle). As soon as you start swapping, it starts slowing down and the outstanding transactions begin stacking up. The response time goes all hockey-stick shaped and it's a death spiral.


If you have overcommit disabled and a process that is using most of your RAM (ahem, Firefox) wants to fork+exec, you will need a large amount of swap. This is not a problem for most Linux users since they run with overcommit on, but it could be a problem on more rigorous OSes. Just one of those dark corners that people don't think about.


Is not suspend to disk in Linux normally done with a swap partition? My understanding is that the swap size heuristic is influenced by that.


Yes, also in Windows. So it's another reason to have a swap partition, but not to actually swap to it in normal operation.


No harm swapping to it if you have it, as far as I can tell.


If you have a server attempting to process a steady stream of incoming transactions at the edge of its capacity and it begins turning memory IO operations into disk IO (which is 1000 times slower), that is harmful to response times.


There probably isn't a lot of sense in having swap for suspend to disk on a server anyway of course.

I've only heard the heuristic in the context of workstations/laptops.


tl;dr - Premature optimization is (still) the root of all evil.

I'm not familiar with squid, but I'm quite familiar with the idea of programmers writing their own systems on top of other systems that are basically a worse implementation of something the underlying system is already doing.

To my chagrin, I occasionally catch myself doing this sort of thing once in a while when I'm first moving into new language/API/concept and don't really understand what is going on underneath.

It is always a good idea to try the simplest thing that could possibly work first, and then measure it, and only then try to improve it and always make sure you measure your "improvements" against the baseline. And make sure you're measuring the right things. I think this is a concept most developers are aware of but one of those things you have to constantly checklist yourself on because it is too easy to backslide on.


Except that the primary purpose of a web cache is to utilize storage to avoid duplication of work. Caching is a fundamental operation the storage hierarchy, so as soon as the possibility exists that your cached data will exceed available physical RAM, you're "optimizing" disk IO whether you admit it or not.


One reason to reimplement something on top is portability. The tradeoffs between portability and performance are hard.


It's still generally a bad idea, it's much better to write an interface and then provide multiple implementations that take advantage of the native features of the OS.

I think we all remember the last that language that decided to be so portable they shipped their own reimplemented GUI toolkit with the standard library.


> it's much better to write an interface and then provide multiple implementations

That is a reimplementation.


One thing to keep in mind with this talk of virtual memory is that current smartphones and tablets have basically regressed to the 1975 model when it comes to swap. That is to say, there isn't any. If you take the approach of "allocate plenty of memory and let the kernel sort out what should go to disk" then you'll end up killed by the OS if you're running on e.g. an iPhone, because there's no swap.


Coming from iOS:

You still get memory-mapped file I/O. There's no swap file, but you can still map files into virtual memory and the OS can page pieces in and out as necessary.

Virtual memory isn't just about swap.


The point is: you still have to make a conscious move to prevent memory overload.


I know that VM is more than just swap, but the way it was used in this article, it pretty much only discussed swap.


"Varnish allocate some virtual memory, it tells the operating system to back this memory with space from a disk file."

I'm pretty sure the article's talking specifically about mapping a gigantic file into memory and pretending it's all in RAM, and isn't talking about swap at all.


It's hard to tell whether they mean that it's done explicitly or not:

"...all we need to have in Varnish is a pointer into virtual memory and a length, the kernel does the rest."

If you're manually memory mapping stuff, you'd need more than that. In any case, the first part of the article is definitely talking about swap when it comes to fighting with the kernel over whether something should be in RAM or on disk. Explicitly memory mapping a large file will work on iOS, although the lack of sparse file support on the filesystem would seem to make it painful.


Wait a minute -- I'm not a sysadmin guy, but all the servers I've ever dealt with had swapping / virtual memory turned off. Because you'd rather a web request failed, then start churning things on disk.

When you're dealing with a web cache, don't you want to explicitly know whether your cache contents are in memory or on disk, and be able to fine-tune that? It seems like the last thing you want is the OS making decisions about memory vs disk for you. Am I missing something?


Yes you are missing something. Varnish doesn't use the swap. Varnish uses regular files, cached in memory via the kernel's regular buffercache mechanism.


When you're dealing with a web cache, don't you want to explicitly know whether your cache contents are in memory or on disk, and be able to fine-tune that?

The point of this article is that you think you want to control that, but you actually don't. The kernel can probably do a fine job. (Of course, PHK develops the kernel, so when it doesn't do what he wants he can just change it. Many others are not so lucky.) PHK is telling normal programmers to follow the rules; extraordinary ninja rockstar programmers are smart enough to know when to break the rules. If you want extremely high performance you should manage everything yourself, but this is so difficult to do right that documenting how to do it would just encourage people to shoot themselves in the foot.

On a practical note, which is faster, HAProxy or Varnish?


HAProxy isn't a cache. Its not a good comparison.


This article is one of those interesting things that doesn't affect me directly because I don't do systems programming, but holds a great deal of fascination. I've often wondered about how the kernel allocates memory and deals with disk, and how that affects the behavior of an application that may do it's own memory allocation.

In object oriented programming there is a thing called a CRC card[1] where you list what the responsibilities of important classes are. This helps the developer visualize and understand how the system works, and to keep things as orthogonal as practical. Here we have an example of someone pointing out that the system-level "CRC cards" are stepping on each other's toes. Pretty compelling stuff.

An aside - would there be any benefit to using `go` rather than `c` for writing something like varnish if you were starting in 2012?

[1] http://en.wikipedia.org/wiki/Class-responsibility-collaborat...


"An aside - would there be any benefit to using `go` rather than `c` for writing something like varnish if you were starting in 2012?"

go isn't fast yet, from what I can tell. So, you'd be writing a slow proxy, for the time being.

But, from a simplicity of design perspective, yes, it'd be awesome to work in a language with really good concurrency primitives. Proxies are an ideal example of an embarrassingly parallel problem; a thousand simultaneous users is a thousand independent tasks with very little shared state. go is designed for exactly this sort of task. And, in five years, by which time go will probably be really fast, you'll have a simple and fast proxy server.


Go's pretty damn fast now. It's fast enough that we've got physicists writing simulations in it. It's fast enough that a lot of the people I know (systems and HPC guys) just write in Go unless they have to write in anything else. I've personally used it to write cluster management tools, a cpu-intensive simulation, a modem emulator, and a decent handful of servers.

If nothing else, it's so easy to do concurrency that I end up hiding a lot of slowness.


Is it pure Go, or are they offloading the more demanding computations to external libraries written in other languages, like people tend to do with Python?


> It's fast enough that we've got physicists writing simulations in it.

Cool, source?


Sadly not available yet, sorry. It's a hassle to get new projects open-sourced, and given that HPC codes traditionally require that you have a million-dollar supercomputer anyway...


I meant the source of what you mentioned, not the source code :) Like are there any articles or papers?


I might be missing something here but assuming varnish is using mmap()+madvise(), accessing memory might block the thread until the page fault is served, which is not ideal for a user-facing server.

If you manage your own memory/swap, at least you can use async IO and free up the thread while the IO request is being served by the OS.


There are ways to be more clever than just mmap() + madvise() to improve performance in cases that might block (e.g. mincore() and some other system interfaces) but the further down that road you go the less portable and more complicated the implementation becomes.

In short, mmap() and friends were designed for a world where context-switching (e.g. multithread or multiprocess) is a great idea. Unfortunately, context-switching has become extraordinarily expensive for a lot of server software which makes multithreading a less palatable option.

These days, you want (1) native async I/O, (2) strict control of cache replacement behavior, (3) strict control of I/O scheduling, (4) minimal context-switching, and (5) memory locality control. On Linux, this means DIRECT_IO, io_submit, managing your own physical cache RAM, and locking one thread to every core (ignoring hyperthreads for the moment). This is more complicated to implement because there is not a simple, portable interface like mmap() widely available but it is also much more efficient and performant when done correctly. To make matters worse, some useful interfaces (like io_submit) are poorly documented.

But yes, if you build your own memory and swapping subsystem directly on top of the native interfaces, and use threads as an abstraction of a core rather than a swappable context, you can build very efficient server engines that stall minimally on I/O. (Note: even using io_submit and IO_DIRECT for async disk access, there are conditions that can cause blocking. They are just much rarer and easier to manage than mmap().)


I'll just repost my famous comment from two years ago:

If you want your program to take page faults as PHK suggests, it has to be multithreaded. If you choose event-driven concurrency you can't afford to take page faults in mmap() or read(). When you make the threads vs. events decision you're implicitly making a bunch of related decisions about I/O and scheduling as well; a hybrid approach (like using events and mmap) won't work well. http://news.ycombinator.com/item?id=1760642


Varnish is heavily threaded. It maintains queues where connections are put and worker threads pull them out. It is expected that a single connection gets a dedicated thread.


Since each thread is an actual kernel thread, this will limit the concurrent connections to the maximum number of threads a kernel can handle which isn't that high.


Not as such. The connection itself can be accepted and put into a queue on a different thread than the one that serves the request. This means that only the actual number of requests concurrently being fulfilled (cached value is being retrieved from ram or storage, or is being written to the socket) is limited by the amount of kernel threads. With that in place that number doesn't really even have to be that high to handle lots of concurrent load.


Linux can create over 250,000 threads, but that may have been on a 32-bit system. On 64-bit it should be limited only by RAM.


The overhead of context switching becomes pretty high. Some say that context switching has become cheap, but you still at the very least need to update the tlb, and schedule the next pthread.


At least the performance of context switching should scale with the number of cores, which seems to be the main direction of increased performance in hardware looking into the future.


Sometimes it is a good idea, but it works only if:

1) You have a threaded implementation, otherwise your single thread blocks every time you access a page on the swap.

2) You have decently sized continuous objects. If instead a request involves many fragments of data from many different pages, it is not going to work well.

There are other issues but probably 1 & 2 are the most important.


Oh joy.

“these days so small that girls get disappointed if think they got hold of something else than the MP3 player you had in your pocket.”

An otherwise interesting article.


Here's the thing wrong with any kind of programming. The "best" way is highly contextual. Your situation, the OS, the hardware, the problem domain, the target market -- these all change the situation and bring their own particular trade-offs. There will always be something "wrong" with the way most anyone programs from the point of view of somebody not familiar with a particular situation.


I wish all that this guy said is true. I desperately wish a perfect virtual memory can relieve me from all the pains of caching.

Take an example, in a word processor, can we just keep all possible cursor positions (for moving the cursor around) and all line-breaking, page breaking info, each character's location in virtual memory?


Moving the cursor in a word processor is a people-time operation; as long as you can manage it in 100ms or so, it's "good enough". That shouldn't be hard, but somehow LibreOffice still manages to take multiple seconds to update the cursor location...


Let me explain more on the details of moving a cursor:

In a terminal, usually mono-width fonts with fixed font size are used, so cursor positions can be computedly easily.

But for a word processor, variable-width fonts with variable font size are more common. To move a cursor to its next position, we need to know the width of current character, otherwise it does not know where to show the cursor. Should we cache these width or compute them in fly? It is more complicated if a word appears different from as a group of single characters in some languages. Should we cache character positions for these words? Should we keep them all in virtual memory?

Computers may even have no swap partition allocated, can application just reject to run in this case?


I understand that a word processor faces greater challenges in cursor positioning than a simple text editor--that's why I write my documents in LaTeX :)

As for swap, I always figure that if you're hitting swap, you're dead. It's time to buy more RAM. I've had instances where running Firefox and a kernel compile at the same time on my 4 GB laptop turn it into an absolute thrashfest--I have to walk away for 15-30 minutes, because I cannot do anything at that point. Applications, even plain old xterm, are big enough that swapping them in and out makes the machine unusable.


+1 for latex.


My 128k Mac circa 1985 seems to be able to move the cursor around in proportionally spaced text without noticeable lag.


All this abstraction with garbage collection and virtual memory etc etc is only taking us further away from the hardware. In some ways its good to think like a 1975 programmer because you are acknowledging the fact that there's hardware underneath. If you completely ignore that and rely on the abstractions provided to you by an OS layer, the end result is you get a system that uses resources very wastefully. Look at how much software has bloated in the last 35 years. A large reason for that is the amount of abstraction of the hardware and lower layers of software we've started relying on. The more abstraction you use, the easier your job becomes, but it also results in a less lean system


My goodness, non-sequiturs wrapped in inconsistencies disguised by absurdity!

Virtual memory uses CPU traps to do page faults in x86 land. Abstractions are useful and in fact even if you go all the way to assembly, you've still got a thin abstraction over the hardware. If you don't have an operating system, then no software will be written. If you look at software over the last 35 years, you'll notice it has become easier to use, and it does more complex things.


The memory, hard disk and CPU speed requirement of modern day PCs has also gone up substantially. I wouldn't call them lean. That, in my opinion, is where the future lies and to get there we have to get closer to the hardware, not abstract it away further. Most CS background programmers tend to leave the details of the hardware to the OS's abstraction layers thereby isolating themselves from the hardware designs of the EEs. In the future, we need design teams that do both. That's driven by the lean requirement.


This is a really interesting talk by the author of this article and program: http://archive.org/details/VarnishHttpCacheServer


This article is wrong, just wrong.

I would love it if there were just one kind of storage, and my code could ignore the distinction between disk and memory. But it can't, for three reasons: 10 ms seek times, RAM that is much smaller than disk, and garbage collection.

10 ms seek times mean that fast random access across large disk files just isn't possible. There is a vast amount of literature and research devoted to getting over this specific limitation. And it isn't old, either: all of the recent work on big data is aimed at resolving the tension between sequential disk access, which is fast, and random access, which is required for executing queries.

RAM that is smaller than disk means that virtual disk files don't work very well when you have large data files. If you try to map more than the amount of physical RAM you get a mess: http://stackoverflow.com/questions/12572157/using-lots-of-ma...

Garbage collection means that it is easy to allocate a bit of memory, and then let it go when the reference goes out of scope. There's no need to explicitly deallocate it. It's one of the things that makes modern programming efficient. With disk, you don't get that; if you write something, you've got to erase it or disk fills up.

In short, this guy's casual contempt for "1975 programming" is irksome, because it's clear that he isn't working on the same class of problems that the rest of us are. He may be able to get away with virtual memory for his limited application, but the rest of us can't.


(1) Varnish exists, so we can actually run it and analyze its performance. There's no need for "this can't work because X" arguments because we know whether it can work or not.

The author claims Varnish works with huge mappings too. In another article: "For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical."

(2) Varnish doesn't ignore the fact disk is slower than RAM. Its data structures are built to minimize page faults, and thus seeks, for this reason. See also: http://queue.acm.org/detail.cfm?id=1814327

The virtual memory abstraction leaks, just like every other abstraction. That doesn't make it worthless.

(3) Files aren't append-only: you can reuse space for a different purpose when you don't need it for its original purpose anymore. How do you think databases work? Or filesystems?

(4) The author's not talking about using disk-backed memory for your general purpose heap. He's talking about using the virtual memory system to access a giant cache on disk.

So is the author wrong about everything? Varnish seems to work, so if he's wrong he's getting away with it.


I can not agree more with you. The author just has no clue on memory management. He is imaging that the virtual memory model can solve all cache pains. The reality is much painful.


What is a "virtual disk file"?


I think he is referring to mmap? Not entirely sure though...


Perhaps conflating swap files with virtual memory?


mmap makes use of the virtual memory system. The author does not appear to be conflating anything.


Yeah, that would be great if a. the link was to a question on StackExchange about mmap, but it's not, and b. if the use of mmap was commonly known as using a "virtual disk file", but again, it's not.


5.2% of the worlds top 10,000 websites use it, as of July 11, 2012: http://royal.pingdom.com/2012/07/11/how-popular-is-varnish/

So the question is- if it is so great, why only 5.2%? I'm not being sarcastic. This is a totally serious question.


Every site is different. Varnish is great for a particular fairly common use case, but it's not an all purpose "make site go faster" button. For example, putting varnish in front a CMS is usually a fantastic idea, or anything that has a high ratio of reads vs. writes and serves the same pages to multiple users. That could be anything from a content blog or site, an online store, etc. However, for other types of sites it doesn't make as much sense. A site like facebook or twitter would gain almost no advantage from it, since the overwhelmingly most common use case is for every single user to receive different pages on every single visit. Similarly, it doesn't make sense for search engines, or for web mail apps, etc.

Also, most really large sites have probably already developed some other method of caching if it suits their site needs, so it wouldn't make sense for them to switch over to varnish all of a sudden.


Facebook uses Varnish, so does Twitter. They use it where it makes sense, where reads are high and content is less dynamic. To say they'd gain almost no advantage of it is oversimplification as they have various requirements and some of those do indeed benefit from caching.


>Sites that use Varnish normally return the X-Varnish HTTP response header when you access the site. We used this as the indicator if a site is using Varnish or not, and scanned the response headers for the top 10,000 websites in the world according to Alexa.

Terrible methodology. I suspect it is vastly more popular than that methodology would lead you to believe.


The same reason that Windows 3.1x was used when people could have been using X. Sometimes these things take a while to catch on.


Because X sucked. I hated X Windows, nothing worked right at all back then. Actually, the last time I used it, it still sucked. And how has X caught on?


I am on a dual screen X terminal, logged on a remote linux server (virtual quad core) and using rdesktop to access my desktop PC because I am not in my office. In front of me, all the worktations (38) are running linux. X windows is definitely the tool I use the most.


Worked fine for me. X is very popular now. Don't know when you last used it, but "it still sucked" brings nothing to this conversation.


Very popular, to whom? Linux has what, 1% of the desktop market share. Linux is very popular on android, but I was pretty sure that Android didn't use X. That's like saying you were popular in High School because you had two friends. And apparently, from an article on HN, Ubuntu isn't going to be using it any further.


Very popular amongst Linux distributions, obviously not including Android. Sorry, is that the deafening silence of a non-answer to my question?


No, I've been compiling a list. It's a big list, so it's taking me a while.


Similar in size to the one I'm compiling about Windows 3.11?


Varnish isn't life-changing software. One thing Poul-Henning Kamp is very good at is selling Varnish, but once you use it for a bit you'll understand why it's a pretty advanced tool that many people avoid.

I've deployed Varnish once and I was annoyed by how it is configured. PHK will, of course, spin that as a positive, but my preference is not to write C when configuring my infrastructure support.

Beyond that, caching is a complicated topic with specific nuances for every deployment. As an example, I work on a high-traffic social site and we have absolutely no need for Varnish or any software like it. At its essence, Varnish is fixing a problem that many sites (a) do not have (yet), or (b) fix in other ways. We don't cache our dynamic views at all, and our system keeps up just fine. When it doesn't, we fix the system. Caching is on the radar as an improvement but we have determined that in terms of reward it does not make sense for our environment (invalidation is too frequent).

If you find yourself needing Varnish, ask yourself why. The answer to that question might lead you down some things to fix before investing in a big cache tier. There's a reason Facebook uses Varnish and you're wondering why others don't.


It isn't so great. We tried it and went back to squid.


If you take the premise of this article literally - then since 1975 computers have gotten inordinately more complex, but we've developed no abstractions to help programmers deal with it.


I rather thought he was saying that since 1975 there have been abstractions developed, but people aren't taking advantage of them. Or perhaps people are not trusting them in this case.


Isn't it the abstraction (virtual memory) creating a problem in the first place? By programmers not understanding that an abstraction has been applied.


No. Virtual Memory has been known about since the early 60s, the issue was that the x86 architecture picked it up in the '90s with the 386 CPU.

The problem with programmers not knowing how the operating system works is not the fault of the operating system, it's the fault of the developers.


[2008]


The article is misleading and the author has totally no clue on the complexity of user space memory management. Random on-disk virtual memory access will be a disaster if we just keep everything in so-called virtual memory without complicated cache mechanism.


Let's not resort to attacks on the author. Consider that this article is at least 4 years old.


Are you sure? PHK is a quite seasoned FreeBSD kernel developer and did phkmalloc which was used in FreeBSD for a long time, see http://phk.freebsd.dk/pubs/malloc.pdf.


I understand he's a kernel developer but to me this sounds exactly the same as people who kept repeating, since years:

"Don't create a ramdisk (a true, fixed size, one, that you prevent from ever getting to disk) because the (Linux) kernel is so good and so sentient that you won't gain anything by doing that"

Yet anyone compiling from scratch big projects made of thousands of source file know that it's much faster to write the compiled files to the ramdisk.

I can't tell how many times I've seen this argument between "pro 'kernel is sentient'" and "pro 'compile into a real ramdisk'" but I can tell you that, by experience (and it's hard to beat that), the ramdisk Just Works [TM] faster than the 'sentient kernel'.

So how is it different this time?


You mean, there's a performance difference between writing to RAM, and writing to RAM and flushing writes to disk every few seconds? Doh!


Fuck this guy gets on my nerves, acting like he's the only person in the world who knows what virtual memory is or that paging is some kind of dark magic only understood by kernel developers, rather than standard subject matter for any intro to computer architecture/OS concepts course.


Consider: what proportion of programmers have taken an OS course? Specifically OSes, 'cause I know that virtual memory was not covered in any of the computer architecture courses I've taken.

Now, how many of those programmers who know about virtual memory (larger than the number who have taken a course on operating systems, but still far from 100%, I'd wager) have actually realized "hey, virtual memory can make it so that I never have to explicitly write stuff to disk?" I certainly hadn't.

Just because you expect people to know basic concepts, doesn't mean that an article explaining what they're useful for is useless. Quite the opposite, in fact.


My experience has been that most programmers are more or less clueless about how virtual memory works. At best, they understand it as letting you swap out memory to disk when memory runs low.

Even ignoring Kamp's "mmap the world" approach, if I brought up any of his other ideas in a meeting with most programmers, there'd be immediate cries of premature optimization (and reinventing the wheel, in the case of using a single malloc'd chunk for workspaces). Never mind what we're talking about building or what we already know about the performance of different approaches, and never mind that a lot of important performance decisions are architectural and are a lot harder to change later on.

These ideas just aren't on most programmers' radar—it's all evil voodoo to be avoided at all costs to them.

How many programmers know the effect of a write from one CPU on the next read from another CPU on the same cache line? How many programmers know the relative cost of a syscall vs. a function call? How many programmers ever think about optimizing their use of CPU cache?

Most of the time they get away with ignoring these things because they really don't matter in context. But sometimes they don't get away with it because these things do matter, and in those moments I wish more programmers had a better understanding of their machines and their operating systems.


Whoa! Why the hostility? This article seems interesting, not arrogant.


You'll often see comments like OP on articles or comments from Kamp. I've left some.

Poul-Henning Kamp is a polarizing figure. I sympathize with the sentiment on occasion, and I'm occasionally annoyed by his writing style ("everything but Varnish is written poorly" is evident in this one). Another one that bothered me was You're Doing it Wrong[1]; great information, offensive tone -- most of his publications in Queue are the same way.

That said, I can also disregard most of that annoyance as he's a genuinely smart guy, and I want to know what he's trying to tell me. Put it this way: I'd listen to anything PHK has to say, but I'd (probably) never hire him because I need to foster a team environment, and PHK can be divisive when he communicates.

I'm sure the commenters you see commit karma suicide like this feel the same way and are just quite poor at communicating what I have. OP dug in on a detail, but reading the rest of his comment another way, it's a criticism of the writing style and tone (in my reading).

[1]: http://queue.acm.org/detail.cfm?id=1814327


I am impressed with how deftly you reframed (and redeemed) the GGP's bilious outburst. That was charitable of you, and makes me wish more people would do it.


It did feel useful, as many new programs actually do manage much of their data exactly as he warns not to - keeping an in-memory cache of something they're (re)reading from disk.


It might be a subject matter for most, but how many can honestly say they make use of, and eventually even remember, it after years of programming in high level programming languages on modern OSs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: