MapD Open Sources GPU-Powered Database

ccleve · on May 8, 2017

I would very much like to see a blog post or some other explanation of how a GPU speeds up a database query.

Most traditional SQL databases are I/O-bound, meaning that the cost of pulling index pages from disk into memory overwhelms everything else, and adding more CPU doesn't do much good. Even for in-memory databases the cost of pulling a page into GPU memory is likely to be high. The calculations you do are pretty simple -- you mostly traverse arrays of pointers and do intersections and accumulations. No floating point, just increment and compare integers.

GPUs really shine when you can parallelize operations on blocks of data that are already in GPU memory. That's why they're great for graphics and games. I'm not seeing how they help with database queries over large volumes of data which cannot fit in GPU memory.

I'd really love to read some kind of tutorial on the topic. There's likely some kind of magic here that I don't see.

paulmd · on May 8, 2017

They don't help on large volumes of data which cannot fit into GPU memory. Obviously. PCIe is even slower than standard CPU memory, pulling non-trivial amounts of data across the PCIe bus is catastrophic for any GPU program.

GPUs shine because their VRAM has enormous volumes of bandwidth (up to ~512 GB/s for high-end modern GPUs). That's an order of magnitude more than CPU memory.

So basically you throw indexes and perhaps a few key columns in VRAM, but as little actual data as possible, and just return the row IDs (or ranges) of interesting data. That lets you cross-reference back into host memory, and at that point you would perhaps perform some further limitation criteria based on something that's not available in GPU memory.

GPU memory isn't super-abundant, but high-end cards have up to 24 GB, and if you stack a bunch of them in a rack then you are talking about a reasonable amount of memory for most use-cases. Four cards in a box is easy, and you can put four cards in a 1U chassis if you use GP100 mezzanine cards. So now we're talking 96GB of VRAM per box or per 1U, which works out to quite a bit of index data.

The other use-case would be something that's really computationally intensive to perform, but for some reason cannot be pre-computed and cached. So - for example find me all rows where "sha256(INTEGER my_table.my_val AND now() AS INTEGER) > 256". Which as you can imagine mostly trends towards crypto kinds of stuff.

dogma1138 · on May 8, 2017

Pascal supports upto 49bit memory addressing which covers 512TB worth of addresses, AST between the GPU and system memory addresses is done in hardware which means prefetches via usage hints / branch prediction and page faults are handled by the GPU.

In this case you are only bound by PCIE bandwidth (or NVLINK if you are using a supporting GPU) and performance wise it's pretty near native speeds.

You can also access network and other types of storage from within the GPU since MIMO is universal these days.

AMD will add similar functionality starting with VEGA.

This means that with both vendors you are no longer bound by the GPU onboard memory.

And if PCIE x16 isn't enough then NVLINK comes with upto 200GT/s these days and has an infiniband end point.

paulmd · on May 8, 2017

The problem is "PCIe bandwidth" is really bad. PCIe 3.0x16 is only 16 GB/s, which again is less than half of you would get from any old DDR4 memory. You might as well do that on the CPU then.

NVLink is fine as far as it goes... the problem is it's a two-party transaction, NVLink still has to be fed by something and that's either another GPU or the CPU or something like a SSD via DMA. With a database type program, the CPU and GPU use-cases are pointless because you could just be doing the search there and returning results directly instead of device-to-device DMA. And NVMe SSDs are usually limited to PCIe 2.0x4 which is 2 GB/s. Absolute maximum, 16 GB/s, same as any other PCIe-attached device.

Remember, GPUs aren't always used on compute-bound tasks. There is a very valid use-case for I/O bound tasks where you are just using it as a super-cache. GPU database programs often (but not always) fall into that category. You are just searching indexes/target columns like crazy and throwing the resulting row IDs back over the wall to the CPU.

dogma1138 · on May 8, 2017

NVLINK is GPU to GPU or GPU to CPU link, if you are using it as a GPU to CPU link you are taking to the CPU at native NVLINK speeds either through a native NVLINK bus if you are using a POWER CPU which supports it or via an Infiniband/OmniPath interconnect on Intel/AMD CPUs.

16GB/s is a bit tight but that's where prefetches based on pre-defined usage hints as well as the branch predictor in the GPU and the CUDA Compiler/Runtime that feeds the GPU without stalling as much as possible.

OFC you can I/O bound a GPU that's is actually pretty easy these days even in normal (gaming) workloads (e.g. draw call limit).

Let's say you are building a 4 GPU DB, if you are using NVLINK you each GPU can see the entire memory space of the other GPUs connected on the same NVLINK (there is also RDMA which can be done over the network but this is a complicated scenario) in this case you have 96GB of high speed VRAM + which ever amount of memory you can put on your CPU which can be upto 1.5TB or so per socket these days, you also have 64GB/s of bandwidth to feed that 96GB of memory with which means that it takes a second and a half to completely refill the memory completely (and I'm still not sure that the same lossless compression schemes that GPUs use these days for won't work for your data which would give you much higher effective bandwidth that the compression/decompression if free on the GPU side).

With these figures I can't see a reason why you can't optimize your memory residency to have the best of both worlds fast key-value/hash or index lookups for IO bound tasks and then programmatically prefetching the workload or if the branch prediction is good enough just letting it roll in processing bound tasks.

So yes it's not "perfect" but nothing ever will, the question is it much faster or even orders of magnitude faster for certain tasks / workloads / implementations than a CPU only DB and clearly it is. How well would it compete against other contenders like Intel's Xeon Phis, programmable FPGAs and other emerging technologies that aim to solve this problem from another direction i don't know.

paulmd · on May 8, 2017

Even if you have a super-fast interconnect, you can only fill as fast as something else can send it, right?

I mean, if it's in CPU memory than you are limited to whatever the CPU memory can deliver. Which is 40 GB/s per socket from quad-channel DDR4, more or less. You can probably double that with compression, but it's still vastly less than you'd get from VRAM.

And you get the same speedups from storing the data compressed in VRAM assuming you have some fast method for indexing into the stream, or you're searching the whole stream. But yes, assuming your data can be dehydrated and then stored in CPU memory you can squeeze a little more bandwidth out of it this way.

If it's GPU memory (in another GPU across NVLink/GPUDirect/etc), then you should be doing the searching on that GPU instead.

The stipulation here has always been "tasks that don't fit into GPU memory aren't going to work" so if you are assuming that the data lives on another GPU... then you are violating the pre-condition of all of this discussion, because it fits into GPU memory.

Yes, if your data fits into GPU memory it's very fast, because VRAM bandwidth is enormous. The catch is always getting it to fit into GPU memory. That's why you need a "tiered" system - indexes or important columns live on the GPU, then you return them to the CPU where you do additional processing with them. It's like L1/L2/L3/memory hierarchy on a CPU - sure, your program that fits into cache is super fast, but many real-world tasks don't fit into cache.

If your entire dataset is small enough that it fits in VRAM on one box's worth of GPUs (4-8 per box), or within a few racks or whatever... that's great, go hog wild. But it's not very cost effective versus a tiered approach, this is like asking for a CPU with 16 GB of cache (or enough CPUs to add up to 16 GB of cache). It's outright impossible for very large datasets (96-192 GB is not a very large dataset in this context, but it's a reasonable amount of index space for a 1.5 TB or 3 TB dataset on the CPU socket).

Once you start having to transfer stuff into and out of VRAM, GPU performance typically starts to rapidly degrade. It's "only a second or two" to you... but that's a terabyte's worth of VRAM bandwidth that sat idle for that time. "Rolling" approaches like this do not work very well on GPUs. You want to avoid transferring stuff on or off as much as possible because you just can't do it fast enough.

Computational intensity is one way to avoid doing that, both in VRAM and to host memory. If you can transfer a byte (per core) every P cycles, and you perform at least P cycles' worth of computation on it... you're compute bound. And at that point you can probably make a "rolling" algorithm work. But the problem is doing that (without just being obviously wasteful). In practice, P is like 1 byte every 72 cycles or something (working from memory here). It's really hard to find that much work to do on a byte in many cases. So despite their massive compute power... GPUs are often I/O bound on most tasks. Crypto tasks (hashing) are a notable exception.

Note that I/O bound here can happen in different areas too. You can be I/O bound over the GPU bus, or on the VRAM. It depends on where the piece of data you need lives.

Also, in gaming it's typically the CPU which is bound in draw-calls. The CPU can't assemble the command lists fast enough to saturate the GPU (at least not on a single thread). I'm not sure I'd say it's necessarily I/O bound in this case, either, but it's possible.

Normally GPUs will only use a fraction of their VRAM bandwidth while gaming - however, like anything else, throwing superfluous resources at it will still produce a speedup even if it's not really "the KEY bottleneck". Having memory calls return in fewer cycles will let your GPU get its shaders back to work quicker, even though it means the memory may be idle for a greater percent of time (i.e. utilization continues to fall).

Assuming you were going to do a DB where you store literally everything on the GPU - I'm not sure whether it would be better to go columnar or by row. In a columnar approach you would have one(or more) GPUs per column, and they would either sequentially broadcast their resulting rowIDs and all perform set-intersection on their own datasets, or dump their set of output rowIDs to a single "master" GPU which would find the set intersection. In a row approach, you would have each GPU find "candidate" row-IDs and then dump them to the CPU, and the CPU handles broadcasting and intersecting (since hopefully it's not a large set-intersection).

Columnar might be faster for supported queries, but row would be simpler to implement and more flexible (since the critical data processing would more or be taking place on the CPU).

I would have to play with it to be sure, and I don't have access to a cluster of GPUs anymore (and never had access to one with NVLink or GPUDirect).

dogma1138 · on May 8, 2017

But the AST is effectively a "tiered system" lets say you have a 20TB of DB, 96GB fits in VRAM, 2 fits in your RAM, 17 and change are on your disks.

Today the GPU can see this as a single continuous memory space, you have it does the AST and handles page faults when needed and tries to optimize the execution of each task by prefetching the needed data to the VRAM by it self through branch prediction or by programming the usage hints yourself.

Here is a short paper/write up on the unified memory in Pascal, look at the difference between optimizing/profiling your memory usage and using the prefetch hints, you going from increasing your data set by 4 times of max GPU memory and getting performance reduced by about 3.8-4 times relying on page faults alone to only losing less than 50% of your performance while going over 4 times your maximum available VRAM.

The same thing can be done on cards with 24GB, and the magic is that the performance drop actually levels off at about 2-3 times your maximum available VRAM which is why the difference between 2 and 4 times the amount of memory isn't that big in terms of performance hit/gain to over allocation ratio, so 8 times your amount of memory is not that far behind the penalty you already pay for using 4 times more.

And yes there is no scenario in which you do not lose performance, but the unified memory solution in NVIDIA GPUs is a very good solution to reduce the penalty and if you really need to go balls to the wall you'll actually benefit from the scaling.

tmostak · on May 8, 2017

These are relatively old but might be helpful: https://www.mapd.com/blog/2015/06/04/mapd/ and https://www.mapd.com/blog/2015/06/25/mapd-massive-throughput....

Bedon292 · on May 8, 2017

To my understanding it makes improvements in a few areas. The first being the tiered approach to where the data is stored. So for instance geospatial data may only store coordinates, and datetime info in the actual gpu memory. So you get massive speed ups on that. Then the next most interesting fields are in main memory, for a bit better performance. And than the rest, is on disk. I believe they also designed ways of indexing the data which perform better on GPUs with massive parallelization. And then the other major improvement is using the GPUs to render visualizations of the data you are trying to understand. That way what gets passed from the server could only be an image and not tens of millions of points of data.

goldenshale · on May 9, 2017

The benefits would come from running query kernels in parallel on thousands of GPU cores simultaneously, which necessarily only have access to memory on the card. Of course pre-computation is always preferable, but that's not feasible for data exploration and many types of queries. My guess is they start with databases or i dexes that can sit entirely in GPU RAM, and then later versions will support streaming this data from main memory and disk.

Some types of queries that would likely benefit from this strategy would be reductions (sum, avg, count), filtering (find items matching a particular value or predicate), and geospatial queries against geometry. These all require iterating over all or most of the data, and so doing that massively parallel will result in major speedups.

To support larger datasets you can always shard across GPUs and machines. This is how the google index is structured, for example.

I bet real-time metrics will be a good application.

fulafel · on May 8, 2017

Most SQL databases fit in RAM cheaply (and most GPUs share memory with the CPU).

rubatuga · on May 8, 2017

Most discrete GPUs do not share memory

valarauca1 · on May 8, 2017

Actually all GPUs do.

PCIe is DMA (Direct Memory Access). The Vulkan API's have specific constructs for building a "shared" memory space between the native application, and its shaders on GPU. Then the driver will manage prefetching and caching from RAM to NVRAM on the GPU..

---

OFC Vulkan is only supported on newer cards. But OpenCL has had this feature for nearly a decade.

fulafel · on May 9, 2017

There's a good argument that server GPU acceleration is more likely to become mainstream through integrated GPUs than through discrete GPUs.

gopalv · on May 8, 2017

> Most traditional SQL databases are I/O-bound, meaning that the cost of pulling index pages from disk into memory overwhelms everything else, and adding more CPU doesn't do much good.

A traditional database is somewhat tied to a disk for persistence, particularly if it is transactional, but this is more of a top fuel dragster with a mile to run.

The first assumption is that you can fit the current working set into memory - this might not be a raw stream of data, but perhaps hundreds of data-points per cellphone tower which probably occupies around 60-100Gb.

Something like Postgres would be stuck on a single core at this point mostly waiting on memory bandwidth (with associated NUMA bandwidth issues of picking 1 core).

Being able to use the memory bandwidth of the GPUs, plus being forced to write code which is entirely branch independent (CPUs are good, but they still have silicon to expect branches), gives you a massive scan speedup once your data fits within the scale-up single node situation.

If you're willing to shard out your data across multiple machines and never do joins across the shards, you can get distributed scans.

This is never going to be your primary system of record, but just something that can hold a cube/matrix and slice/dice through it faster than a CPU based system can.

If you replace the "GPU and open-source" with "custom silicon and proprietary", you might be looking at something which is very much like Netezza.

anarazel · on May 8, 2017

> A traditional database is somewhat tied to a disk for persistence, particularly if it is transactional, but this is more of a top fuel dragster with a mile to run.

Sure, for writing that's somewhat true (in many traditional RDBMs you can relax guarantees e.g. with UNLOGGED tables in postgres). But for reads, not really. Most databases have a buffer layer in front of the disk / os.

> Something like Postgres would be stuck on a single core at this point mostly waiting on memory bandwidth (with associated NUMA bandwidth issues of picking 1 core).

Obviously depends on the type of query (filtering, aggregation, joins ...) you're doing, but postgres can actually parallelize things to multiple cores these days (since 9.6, a lot more to be released with v10).

Unfortunately, if you're forced to use a single core, memory bandwidth is seldomly the actual bottleneck. In many cases postgres will be CPU bound elsewhere ("tuple deforming", evaluating filters / aggregation expressions), or for other types of queries memory latency will be much more important due to random access nature (think of hash-joins/aggs or nested-loop joins). We're working on that.

> Being able to use the memory bandwidth of the GPUs

That only really helps you if a significant portion of the working set fits onto GPUs. Which'll usually not if we're talking 60-100GB of main memory resident data. If you're reading sequentially through all that data you'll be bound by the main memory latency (including those NUMA issues etc you mentioned).

tmostak · on May 9, 2017

60-100GB? Many of our users run with 8 GPUs and 192GB VRAM, with larger VRAM buffers likely coming soon. And that's before you consider the fact that you don't need to cache all columns and partitions in a column store or compress the data. Its pretty easy to query 15+ active columns on 5B rows on a single server, and then there's scale out if you need more.

qeternity · on May 9, 2017

I want to see how this compares to something like Memsql, which we've had amazing success with.

arnon · on May 8, 2017

The GPU may help in compressing and decompressing data, encrypting and decrypting as well as sorting it (block-wise) ahead of heavy operations.

Compressing and decompressing are examples of reducing I/O, and shifting it over to compute instead - and the GPU is ideal for that.

Most decent columnar databases are compute bound. I/O is less of a bottleneck compared to OLTP style databases.

dtnewman · on May 8, 2017

From their product page [1]:

> MapD Core is an in-memory, column store, SQL relational database that was designed from the ground up to run on GPUs.

However, memory is getting cheaper and cheaper these days.

[1] https://www.mapd.com/products/core/

imaginenore · on May 8, 2017

Your claim of I/O bound is probably only true for very large databases where indexes don't fit into tens/hundreds of GB of RAM typically available.

Modern SSDs, even at the consumer level are already doing 2.5GB/s reads and 100K IOPs.

I'd say most database setups are CPU limited.

drej · on May 8, 2017

I ran cloc to get a better idea of the tech in this. Interesting.

Edit: I noticed a lot of Go, but only one live script in the repo. So I excluded the 'ThirdParty' folder, where some vendored libraries reside. Edit2: That 90K SLOC single-file C goodness is a full SQLite release, which is not vendored in the ThirdParty folder, taken it out as well.

  --------------------------------------------------------------------------------
  Language                      files          blank        comment           code
  --------------------------------------------------------------------------------
  C++                             127           4385           3960          54609
  C/C++ Header                    158           2960           4273          14301
  Java                             42           1599           1498           9539
  CUDA                              8            301             47           3125
  CMake                            36            377            967           2616
  yacc                              1            133            156            742
  Maven                             6              0            126            507
  Go                                1             72             12            491
  Bourne Shell                      5             72             43            369
  Bourne Again Shell                3             38              0            208
  lex                               1             28             12            204
  XML                               1             12             12            182
  Python                            5             65             25            168
  C#                                1             12             11            112
  JSON                              1              0              0             73
  SQL                               3              0              0             25
  --------------------------------------------------------------------------------
  SUM:                            399          10054          11142          87271
  --------------------------------------------------------------------------------

solidr53 · on May 8, 2017

Wow thats interesting, most of the languages other than C,C++, CUDA, Go and Java must be build related things, right?

mattnewton · on May 8, 2017

I assume they are bindings? But I haven't found an API doc that confirms that.

Edit: never mind, many of the 1-2 file languages are excluded from the latest list, so they were in the "third party" folder.

wesm · on May 8, 2017

I see they are supporting Apache Arrow (http://arrow.apache.org/) in their result set converter, which will be nice for interoperability with other data system that use Arrow: https://github.com/mapd/mapd-core/blob/21fc39fab9e1dc2c1682b...

(I'm on the Arrow PMC)

tmostak · on May 8, 2017

Yes its definitely alpha functionality now but we're very excited about this piece and potentially supporting more of the Arrow spec. In case you missed it check out the first project we are working on with Continuum and H2O with the "GPU Data Frame" here: https://github.com/gpuopenanalytics/pygdf.

Bedon292 · on May 8, 2017

Available on the AWS marketplace already set up: https://aws.amazon.com/marketplace/pp/B071H71L2Y Can test it for as little as $0.90 / hour on a p2.xlarge. I know I absolutely cannot wait to play with it.

reacharavindh · on May 8, 2017

This blog has a few performance numbers based on a NYC taxi rides dataset, and some setup notes. Might be relevant for HNers here. http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-pas...

I have nothing to do with the blog. Just sharing.

mmcclellan · on May 8, 2017

This is where I first got interested in MapD. The author has a new post detailing how to compile MapD: http://tech.marksblogg.com/compiling-mapd-ubuntu-16.html

rsp1984 · on May 8, 2017

This is quite cool. I'm curious though what the monetization strategy is now. After all they're a for-profit company that has taken a lot of VC.

Sure, they could sell integration and maintenance services on top but services are typically much harder to scale than software -- not quite what gets investors excited.

shusson · on May 9, 2017

My understanding is that the distributed[1] capability of MapD is not open sourced. And I'm guessing a lot of the companies (especially the ones with money to burn) that would use MapD require at least some horizontal scaling.

[1]http://docs.mapd.com/latest/mapd-core-guide/distributed/

wiradikusuma · on May 8, 2017

For the laymen, is this something that I can use to replace MySQL/Postgre (with ACID), assuming I only use ANSI SQL and no stored-proc?

remus · on May 8, 2017

Probably not. MySQL and postgres are both very general purpose relational databases that have a broad range of features aimed at a wide range of use cases.

MapD is designed for more specific use cases, specifically analytic querying and visualisation.

Meai · on May 8, 2017

The demo https://www.mapd.com/demos/tweetmap/ is pretty laggy, I think you should probably make it more performant so people don't think it's the DB that is causing the problem.

tigershark · on May 8, 2017

And why finland is "scrambled" with that big rectangle of noisy tweets with uniform distribution even in the sea? O_o

tmostak · on May 8, 2017

It's a bot.

latenightcoding · on May 8, 2017

After reading a book on RDBMS design I had a similar idea, but after thinking for a couple of minutes I just could't come up with a use case of a GPU-powered database, it would be significantly slower in most scenarios.

sargun · on May 8, 2017

I imagine that it's less interesting for general purpose OLTP, which is what a lot of RDBMSs target. GPUs become really interesting for OLAP use cases where you're trying to push through hypercubes as quickly as possible, and computation across them is embarrassingly parralizable.

_wmd · on May 8, 2017

Just curious, which book did you read?

latenightcoding · on May 8, 2017

I read a couple this is one of them: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf

polskibus · on May 8, 2017

This is great stuff! I wonder how it makes IBM, Oracle and MS feel right now.

I can even see INSERTs allowed, wonder how it works with JOINs and if it is ever going to support transactions.

red2awn · on May 8, 2017

How does this compares to other GPU database like BlazingDB?

ape4 · on May 8, 2017

Is processing speed the bottleneck for databases?

remus · on May 8, 2017

It depends what you're doing. I haven't looked in a while, but last time I did the idea behind MapD was to produce interactive visualisations over lots of data. That can certainly get pretty computationally expensive.

random3 · on May 8, 2017

That's the big question I have too. I think typically it's I/O in one form or another. With GPUs sending data to the GPU seems to be another cost in the path.

However if you consider the GPU part as a way to push down processing power close to data, then you can probably solve more complex problems like graph processing efficiently within the DB..

tmostak · on May 8, 2017

GPUs have phenomenal memory bandwidth (multiple terabytes per second in a multi-GPU server) in addition to the computational parallelism they are known for. MapD caches the most recently touched data in the VRAM across multiple GPUs, meaning it can execute scan queries over billions of records in milliseconds.

jandrewrogers · on May 8, 2017

Bandwidth for moving data is typically the limiting factor if you are doing everything correctly, very rarely computation. Bottlenecks for workloads are commonly memory or network bandwidth these days.

manigandham · on May 8, 2017

It can be. It's a complex balance between processing speed and IO (from disk, ram and network).

GPUs are much more parallel and a great fit for most queries which can be broken up, so combine that with very fast GPU ram and local storage and you can eliminate network/distributed fetches to make scans even faster.

detaro · on May 8, 2017

GPUs can also have better memory bandwidth, if the relevant parts of the dataset fit in GPU memory and the access patterns work out. Which could happen for clever index structures and things like that. Or complex processing over limited results of (sub-)queries.