Ask HN: How do Flickr, Youtube and other high traffic websites handle file uploads?

mlLK · on Nov 12, 2008

Here are some interesting stats in how much data 4chan handles on any given day while the upstream might not quite compare to something of the likes of Flickr or Youtube, but from what I can tell while software is certainly important it is your hardware that will make or break whether or not your site can handle such volume.

4chan is currently powered by seven servers (five content, two database). We are colocated on a full 500mbps Global Crossing connection, allowing us to push over 5TB (5,000GB) of data per day [Image: http://content.4chan.org/img/traffic.feb5-12.png]

stillmotion · on Nov 12, 2008

I'm not entirely sure about them, but I use S3 and EC2 with SQS so that the user can upload the file, it will be encoded while waiting in a queue, then place itself back into storage. That way nothing ever touches my production server.

bprater · on Nov 12, 2008

Yep, I use this model, too. Using this model is great, too, because multiple disparate servers can be dealing with the media coming from/heading to S3. You don't get stuck with media files on one box.

It's like a giant disconnected filesystem.

Oh, and use whatever language you are most comfortable using.

goodgoblin · on Nov 13, 2008

I use merb on ec2 with a redirect back to the app server to write the db reference. We are using a regular html based file uploader and also have a flash uploader - which sometimes gives us fits.

eax · on Nov 12, 2008

one of the Flickr engineers, cal Henderson, wrote a book with a title something like "building scalable websites" that was published by O'reilly. I'm pretty sure he covers that topic. You may be able to get acces online via your public libraries web site (you can in seattle, at least).

There are the obvious issues with file uploads, they can take a lot of bandwidth, and disk space, but there are a lot of less obvious problems.

1. File uploade take a lot longer than most web requests, both because of the size of the data, and because most client connections download faster than they upload.

2. As a result, file upload requests hold server resources longer than other requests. This usually comes down to memory, but there can also be file handle and socket limits. Also, more in the past than in the present day, just the CPU overhead from dealing with lots of open sockets could get to be an issue.

3. File uploads often carry a lot of memory overhead. The braindead simple way of handling way fileuploads in PHP, etc ends up buffering the whole file in memory until it the upload is complete. That can really add up. Furthermore, the process handling the upload request has the memory overhead of the PHP (or ruby, or python...) interpreter, and any code and libraries associated with your application. This overhead is carried even though most of that code and data structures are unnecessary for most of the request durration.

This memory useage really stacks up when each upload request lives for seconds, or minutes, rather than the milliseconds required for most requests.

There are lots of ways to deal with the resource issues. Writing the upload to disk as it arrives is a big improvement. You can go further by having a separate app/server instance that is tuned to minimize the size of each app/interpreter instance is another.

There are also ways to take advantage of file upload features built in to a front end webserver (like nginx) to buffer the whole upload to disk before your app has to get involved. Not to mention the amazon examples mentioned.

Turning to a specialized custom file upload server written in Java or C seems like an optimization you undertake if you outgrow the other solutions (including more memory per server, or more servers)

Imple

ezmobius · on Nov 12, 2008

Use this nginx module: http://www.grid.net.ru/nginx/upload.en.html

It is highly scalable and spools the uploads to disk and does the mime parsing in efficient nginx C code> then once it finishes it just passes some params to your backend processes with the location of the file on disk and you can process it however you want.

bjclark · on Nov 13, 2008

I use this too.

And, FWIW, I see some people suggesting merb, which is cool, but even the guy that wrote merb(ezmobius) uses this now.

staunch · on Nov 12, 2008

One really easy way that works extremely well is to use an old school CGI script to handle the upload. It will die as soon as the upload process is finished, which keeps things very self contained and clean.

The most important thing (as others have noted) is that you do processing asynchronously. Get the file on the server, queue it (however simply), and then process the uploads in however big of batches your machine(s) can handle optimally. 99% of the time you're going to want ffmpeg for videos and ImageMagick for images.

mdasen · on Nov 12, 2008

So, what you want to do is stream the file to disk. The problem that most people will face is that someone is uploading a 100MB video file and your code is trying to hold it in memory. Bad! Get it on disk, then deal with it by opening the file.

In terms of a daemon, you don't need one. PHP and other languages can execute other processes. So, you want to convert that AVI to Flash? Save it to disk, then execute another process to convert it.

bprater · on Nov 12, 2008

Sniff around the Amazon S3/EC2 documentation and you'll find pipelines demonstrating what you want to do, such as:

http://developer.amazonwebservices.com/connect/entry.jspa?ex...

tzury · on Nov 12, 2008

If you plan to deploy your client application as an HTTP application then there isn't much you can do other than let your web application handle it (php/python/ruby whateveer).

Another option is writing an uploader as a stand-alone application (such as flickr uploader, facebook's iPhoto plug-in and alike)

The third option is BAD but still exists on facebook as java applet within the browser.

The forth option is to write the client in one (or many) of the browser extenders such as Google Gears, MS Silverlight, Adobe Air/Flex/Flash to do this (look also at http://www.jnext.org and http://www.google.com/search?q=flash+uploader).

All these 3 can be implemented at the server side in whichever language you choose.

ars · on Nov 12, 2008

You are mistaken and it's not true. If your site is php, let php handle the uploads.

jdavid · on Nov 12, 2008

so for the most part i don't think i can answer in detail, but in public situations hi5's response has been that they have a dedicated pool of servers running the upload, and they store the file on a static server. once that server fills up, they provision a new one. i will call this the viking image upload process, as its like a viking burial process. each one of those images is then buffered by a CDN.

i should point out that hi5 has more photo uploads per day than flickr does.

dhotson · on Nov 13, 2008

http://code.flickr.com/svn/trunk/uploadr/