Recent adventures in performance optimization with Rust

hyperman1 · on Oct 21, 2023

I like it how he demonstrates the thought process. It seems quite a mechanical recipe that can be applied to a lot of similar problems:

Step 1:. A slow but easy implementation. It allows to make sure the algorithm is correct, and later allows validating the faster variants.

Step 2: Algorithm and datastructure optimization, guided by profiling.

Step 3: Micro optimization, again guided by profiling. A standard bag of tricks is applied. I didn't see cache friendlyness/tiling in here, which surprised me.

Step 4: Parallelism, first by SIMD then by threads.

I am very much not downing the author: This is a great teaching of the process, it's a lot of hard brainy work, and we didn't see everything he tried but which failed. So thank you, author.

My point is: it is not magic. It is engineering. It is a skill that can be aquired, thought, even planned and measured.

xyst · on Oct 21, 2023

> My point is: it is not magic. It is engineering. It is a skill that can be aquired, thought, even planned and measured.

Unfortunately in this day and age, the art of optimization is lost. Companies are willing to ship shitty code to production and increase performance by throwing more hardware at it.

eterm · on Oct 21, 2023

I think the intent is fine, it's an interesting problem to want to find the question or set of questions that is the best predictor of the other questions.

But I'd have thought there must be heuristics to help narrow the search.

For example you could calculate how each question correlates to all the other questions[1]. If you picked the set of k questions with the best correlation, then you have a good sense of which will be in the actual "k-corr set".

You could widen your search a little bit, but by chopping out any questions which correlate poorly to the overall score, you narrow your search greatly, going from 200 choose 5 to 10 choose 5 is a speedup factor of a million, and you've probably considered all the sets that are likely to be in the final bucket.

It's been a while since I've worked in the domain, but OP might also want to check out https://en.wikipedia.org/wiki/Item_response_theory and https://en.wikipedia.org/wiki/Classical_test_theory.

[1] https://en.wikipedia.org/wiki/Point-biserial_correlation_coe...

lionkor · on Oct 21, 2023

It's crazy to me how the author spends most of his time finding out what hides in his abstractions, and then changing that. Especially those macros, and a few random crates used, make this code quite hard to reason about.

ChrisSD · on Oct 21, 2023

Much speed up was from using the right data structure for the task, e.g. Vec rather than HashMap and BitSet rather than HashSet.

sremani · on Oct 21, 2023

Moore's law is dead, that means developers who will not waste CPU cycles will be in demand for the next generation of compute intensive development. Big Data needs big compute so obviously these kinds of articles are not only pertinent but is the way ahead.

I would not be surprised if we do rewrites of all the past 20 year code in to Rust or Go or any other performant language.

Lets agree on this, you are smart and the author is right.

I am sorry people who are complaining are not seeing where the ball is going.

devjab · on Oct 21, 2023

I’m not sure why you’re being downvoted, even if you’re aren’t correct about the law we’re seeing rising costs of cloud computing. We’ve cut our Azure cloud spending almost in half by being more efficient and rewriting core computing elements in performant languages. We’ve gone with C++ because that’s also what we use for our embedded solar plant programming but we did make a few proof of concepts using Rust and liked what we learned. I think we’ll eventually begin to move more or our C/C++ into Rust as Rust matures, but we’re also not in a rush to do so.

But to get back to what you’re talking about here, we recently spoke with Lunar about their transition from a Node backend to a Java backend and then finally to a Go backend and how they can’t imagine building anything without Go going forward. Very interesting considering that the JVM is quite performant itself. I’m not sure we’re going to follow them down the Go rabbit hole, I think we’ll remain on a Node backend for non-performant stuff since it lets us use one language for most things. But I do think your predictions are right.

sremani · on Oct 21, 2023

Thanks for your kind words.

tayo42 · on Oct 21, 2023

> the point of this post isn’t to compare highly-optimized Python to highly-optimized Rust. The point is to compare “standard-Jupyter-notebook” Python to highly-optimized Rust.

I guess the title gets clicks, but I'm curious how good python gets. I'm under the impression pandas is pretty fast despite it being python

wenc · on Oct 21, 2023

Pandas can be pretty fast, but DuckDB and Polars are both faster than Pandas. DuckDB supports vectorized and parallelized operations on Pandas dataframes, while Polars is written in Rust.

I feel though the killer is that inner loop where dataframe operations are being performed across a large number of iterations, and there's significant overhead there.

For-loops are usually not the most performant solution in Python.

I could be wrong, I feel that there's a SQL way to answer that question, in which case DuckDB might be able to exploit vectorization, parallelization, indexing and query optimization across the dataset in one fell swoop.

twelvechairs · on Oct 21, 2023

Numba is another excellent Python option for speedier python solutions

eskaytwo · on Oct 21, 2023

DuckDB can also operate on Polars data frames, and return the results to Polars data frames.

shoo · on Oct 21, 2023

as a rule of thumb, pure python code is often 1000x slower than naively written unoptimized native code.

code in pandas can be very slow (standard pure-python speed) or "fast-for-python" depending on if you are going with or against the grain. a pandas dataframe is basically a bunch of numpy arrays, one array per column. if you do columnar calculations that can be reduced to numpy operations, like summing over a column, then numpy will execute the operation in native code, and it will be fast-for-python, but there will still be some overhead due to wrapping things in python, as well as perhaps temporary array allocation etc.

if you do something against the grain, such as expressing all your pandas calculations as on row-wise operations instead of column-wise operations, or using "apply(lambda x: pure_python_expression_of(x))", then pandas cannot execute it efficiently, as the operations are going against the grain of how things are stored in memory, and the operations cannot be reduced to native code primitives on the column arrays that are implemented in numpy.

another alternative to switching to rust is using cython to define a native python module. by starting with python code and using many of the same optimization rules of thumb in this post (static typing! avoid frequent tiny allocation, preallocate stuff! avoid hashing complex things, prefer arrays with indexes!), you can translate idiomatic (and very slow) pure python code into simple code that looks closer to array-oriented fortran-in-C, that runs very fast and compiles to a native python module that is easy to integrate.

IshKebab · on Oct 21, 2023

1000x?

I think it's more like 50-200x in my experience. Which is still crazy slow, but not 1000x slow.

eskaytwo · on Oct 21, 2023

Pandas is ok for numerical, but Polars (rust-based) is absolutely the way to go for big datasets. The article is fascinating if you're a Python developer and you need to stray off the path of things Polars can do.

iwonthecase · on Oct 21, 2023

Yeah, Pandas would provide a pretty big improvement, but I wonder how much of an improvement one could get even by just replacing the c-style loop:

> for qs in combinations(all_qs, K):

> > ...

> > corrs.append({'qs': qs, 'r': r})

>

> corrs.sort_values(...)

with a python style list comprehension:

> def build_q(combination):

> > ...

> > return {'qs': qs, 'r': r}

>

> max(build_q(c) for c in combinations(all_qs, K), key = lambda v: v['r'])

georgeecollins · on Oct 21, 2023

I like Python too but let's face it, it is not performant. Why do we have to pretend otherwise?

wenc · on Oct 21, 2023

I think everybody agrees and is not pretending otherwise.

But my thing is, I think the problem could be solved in a different way in Python where we can use performant libraries to get an answer in a reasonable time.

The author says the naive Python implementation with a for-loop takes 36 milliseconds per iteration, and the problem requires 2.5 billion iterations (= 2.9 years, which is unreasonable) while optimized Rust takes 8 mins (corrected).

I believe we can solve the problem in Python in a reasonable amount of time (not 2.9 years) by expressing it differently. And I believe we can do it in Python without trying to optimize Python operations like the author is doing with Rust.

Imagine your boss came up to you and said I need the answer by this week and that you could only use Python, you would need to come up with a way to solve it. I wouldn't start by trying to optimizing Python's for-loop -- I would break out of the loop paradigm altogether and use arrays, database indices, optimized dataframe libraries (probably written in C++ or Rust) to get there. Because Python is not fast -- everyone knows this -- Python programmers will often think of other ways (generally reaching for libraries) to solve the problem.

bibmbop · on Oct 21, 2023

8 minutes.

tetsuhamu · on Oct 21, 2023

The industry agreed to standardize on Python for the task of describing compute-graphs that get executed by compute engines implemented in something other than Python. Python is not meant to be used for the computation itself.

FridgeSeal · on Oct 21, 2023

And then other people came along, using other languages, and realised we could do the same compute, with a fraction of the hardware, in about the same time, in the languages they’re already using, which really diminishes the appeal of Python, which boils down to:

“Setup some infra, and get someone who knows how to operate it, and then write new code in python, and babysit it as it ossifies into a core piece of infrastructure”

_or_ optimise it in a language you’re already using, or is semi-common in your stack already, and then move on with things.

bsder · on Oct 21, 2023

Yeah, this seems like really unoptimized Python. It would be far more useful to see tricks for torturing Python/Pandas to make things faster and how far you can take it.

The author wrote in a strongly typed, ahead of time compiled language, ran a profiler to help wipe out hotspots for optimization, and the program is now dramatically faster than unoptimized Python.

Yay? I guess?

firecraker · on Oct 21, 2023

I suppose there are already many articles showing how to speed calculations by avoiding/optimizing pandas.

It does feel a little unfair a comparison. Everyone knows that for loops are slow in python.. as is much of the core library. But pushing analysis to c using pythonic APIs (numpy/numba/pytorch) is fairly trivial

hobs · on Oct 21, 2023

I would also be curious given the Rust implementation gets a decent amount of iteration - 30x faster natively though aint bad.

returningfory2 · on Oct 21, 2023

There’s an optimization technique I’ve used that I think could be useful for this kind of problem. It’s similar in spirit to the post.

The post deals a lot both with strings and maps with strings as keys. One idea is to intern these strings using an interer like [1] which returns string keys as monotonically increasing integers starting at 0. Then instead of a map you can just have a regular vector with the interned key as the index into the vector. This gives you the map functionality with very good performance.

[1] https://docs.rs/string-interner/latest/string_interner/

xyst · on Oct 21, 2023

The more I use rust, the more I fall in love with writing code again.

This is after many years of corporate java/c#/python/ruby shops. The number of "developers" and "architects" that don't understand low level concepts is draining and disappointing. Worked with too many corporate "code monkeys" that masquerade themselves as "engineers" or "architects". It has got me jaded sometimes.

The author of this post has given me hope that there are people out there willing to push the boundaries of their code with their years of experience and understanding of low level computer concepts.

Definitely bookmarked and will use as reference!

rajangdavis · on Oct 21, 2023

I liked that the author included his thought process through out each change.

renewiltord · on Oct 21, 2023

I love demo apps like this. You can see some idiomatic code and then watch it descend into chaos but you can see the pathway. Otherwise you sometimes see the end result and you don't know why it is what it is, but the pathway makes it comprehensible. Thanks for writing it.

pjmlp · on Oct 21, 2023

Basically "Analyzing Data 180,000x Faster with AOT strongly typed compiled language" versus using a dynamically typed interpreted one.

ChrisSD · on Oct 21, 2023

No it's basically "how to analyse and improve the performance of Rust code" with a "btw, I prototype in python".

yxhuvud · on Oct 21, 2023

Roughly half the article could have been done without leaving python though. I don't know if python has a good bitset implementation, but at the very least everything before that could have been done in python. So it is more "comparing manually optimized code against unoptimized crap"

pjmlp · on Oct 21, 2023

Yeah, then there is the implementation thing, if cython or PyPy have been used instead.

Sharlin · on Oct 21, 2023

The difference between baseline Python and baseline Rust is only 8x here, because most of the heavy lifting in the Python version is done by native code.

hayley-patton · on Oct 21, 2023

"interpreted language" and "compiled language" are category errors.

pjmlp · on Oct 21, 2023

True, but since the same example can't be done with a Rust based bytecode interpreter, or a Python AOT compiler, due to lack of implementations to use for benchmarking purposes, the remark remains valid.

hayley-patton · on Oct 21, 2023

If you insist.

NooneAtAll3 · on Oct 21, 2023

are what?

hayley-patton · on Oct 21, 2023

Properties assigned to things which cannot have those properties. "Colourless green ideas sleep furiously" has many such errors for example.

c7b · on Oct 22, 2023

Great engineering. Since the article asks about analytical tips for performance improvement: looks like you should be able to substantially get the complexity down using something like least angle regression (ie what's used by efficient implementations of Lasso shrinkage estimators). Or basically any approach where you successively add the most highly correlated variable to your set instead of checking all size-k subsets.

eutectic · on Oct 21, 2023

Obviously not the point of the post, but this doesn't seem like a very reasonable thing to calculate, statistically speaking.

Also, I wonder if there is some way to use branch-and-bound to look at fewer combinations.

ramraj07 · on Oct 21, 2023

This is a legitimate problem we encounter in healthcare data analytics space where it will be great to find the minimum list of Diagnosis/procedure codes that correlate most with a particular disease, etc.

latenightcoding · on Oct 21, 2023

The Mojo PR team on their way to hire OP to publish these sort of articles at Modular.

osigurdson · on Oct 21, 2023

I keep wondering if Mojo will take over the world or somehow become Betamax.

KeplerBoy · on Oct 21, 2023

Clearly Betamax. It certainly has it's uses and it's development is led by the right people but there's no magic.

It's just another LLVM frontend.

Also I'm slightly annoyed by the fact that they kept talking about great GPU performance in their marketing material and all you can find in the docs is: "it's coming. probably. only to Nvidia hardware btw."

IshKebab · on Oct 21, 2023

I think it's unlikely. As I understand it, it's essentially a nicer version of Cython, and that is still extremely niche.

The problem is you're advertising performance and static typing to Python programmers, but pretty much anyone who is voluntarily choosing to use Python has already decided they don't care a jot about performance or static typing.

Everyone else who is forced to use Python doesn't have the option of introducing another language anyway.

raminf · on Oct 21, 2023

Comparing highly optimized code (including total algorithm rewrite and relying on unsafe and SIMD operations) without doing the same on the other side is a pointless exercise.

It's like showing how much faster you can get your handcrafted assembly code to run vs a bash script.

carstimon · on Oct 21, 2023

To add on to the defense in another comment,

I don't think the point of the article is a rust-v-python comparison. Sure, unoptimized-python2rust2optimized rust was a part of the clickbait 180,000x title but... - the author explicitly says the python2rust bit is only a factor of 8. - the Python bit is just one section of a pretty long article - the author says they typically start implementing something in Python so... why wouldn't they include their initial implementation?

I don't think this article should be dismissed just for including a simple python implementation- that isn't the point.

mtkd · on Oct 21, 2023

The author was clear what he was setting out to demonstrate in the piece

This format is likely very useful for anyone working on such large sets using Jupyter/Python and waiting days for scripts to complete -- there is nothing wrong with those final tricky optimisations when applied to single-use scripts

I found it a useful reminder that there is often more to be squeezed out on inner loops with a few mins more thought

shoo · on Oct 21, 2023

articles like this aren't pointless, i don't think it is meant to be framed as a fair comparison of "rust-the-language is 5 orders of magnitude faster than python-the-language". this article does offer examples of how to use a profiler and some ideas of what kind of things can be bottlenecks and how to eliminate them, and also helps people who only ever work with very slow programming tech stacks to understand what kind of speedups can be attainable in practice when using a tech stack that can use a machine's resources efficiently. it's a good write up and can be educational in a variety of ways for some segments of readers. if you only ever do pure python programming, and that informs your beliefs of what "fast" and "slow" programs feel like, seeing what performance cheap modern CPUs are actually capable of when used effectively is a bit like being exposed to alien technology.

refactor_master · on Oct 21, 2023

Intentional or not, it does come off as kind of “disruptive” in the negative sense of the word, ie “you’re all doing it wrong in field X. Let me show you how to work faster”.

But in a real scenario there are often so many other constraints that “optimized code” is far from the top. Salaries are also expensive, so if you factor in the lifetime cost of writing and maintaining a bit of exotic Rust for an exotic problem, it might actually be cheaper to buy a bigger machine and brute force the problem.

Sure, we could all benefit from learning a bit of Rust then, but if you’re in the Python data space you know that it’s almost as crowded as frontend frameworks, and learning Rust comes across as a yet-another-framework problem, that few of us really feel like we have time for.

Personally I’m stuck maintaining a wizard’s (who has now left the company) solo project which initially seemed like a great solution, but has caused more grief in maintenance and bugs than any existing not-as-custom-tailored community solution out there. So if I were to tell my manager “I’ve spent a week to rewrite this job that spends 40 minutes at 2 AM to spend 1 millisecond at 2 AM, from a language that everyone uses to a language that only I use” I’m not sure he’d appreciate it.

shoo · on Oct 21, 2023

i agree that in typical business situations often the main constraints to solve for aren't squeezing the most compute out of a CPU, it's rarely the bottleneck (although of course some people can no doubt chime in from contexts where compute is frequently the bottleneck and what they spend most of their time fretting about).

good engineering should figure out what the main constraints are, and solve for those, while doing a "good enough" job for everything else and no more -- like you allude to, the main bottlenecks to solve for are usually things like reducing engineering effort, reducing schedule, reducing long term maintenance cost of the overall system (not just this isolated component) - especially how easy it is to hire someone who can understand and fix the thing.

but, this article isn't a "how to make good whole-system engineering tradeoffs for the long-term benefit or your employer / client" article, it's an article showing how to write fast rust code.

refactor_master · on Oct 21, 2023

> but, this article isn't a "how to make good whole-system engineering tradeoffs for the long-term benefit or your employer / client" article, it's an article showing how to write fast rust code.

As you can probably tell from my sentiment above, it doesn’t help the author’s cause by basing the premise on a strawman. If the intention was to push Rust for Python users, this article shoots far, really far, but completely misses the mark. Nobody was telling themselves that Python was the fastest language out there.

devjab · on Oct 21, 2023

I very much disagree with your assessment. I will agree that the title is extremely clickbaity, but the author goes into how the move from Python to Rust only accounts for a .8 increase, which is impressive but we all know Python isn’t performant so. Aside from the clickbait title I think the article is extremely interesting and rather relevant, as it takes you through the craze or optimising Rust code.

I guess you could make a point out of how the author didn’t need to bring Python into the article, but on the flip side, they clearly state that they typically prototype in Python. So it would also be an incomplete article if they left out the initial implementation for the problem they are solving.

yelnatz · on Oct 21, 2023

Anyone else think this code is ugly as hell?

b20000 · on Oct 21, 2023

want to write fast rust? write your code in C or C++

lionkor · on Oct 21, 2023

you're getting downvoted because you said it in a weird way, but fundamentally I have to agree. Both C and C++, if written idiomatically, have less syntactic sugar and macros are less accepted, so that their code ends up hiding a lot less complexity than rust code.

Fairly complex code like what the author showed turns out to use a bunch of different random things nobody expected.

IshKebab · on Oct 21, 2023

> code ends up hiding a lot less complexity than rust code

Are you really saying this about C++? For real?

I quite like C++ but the idea that it doesn't hide complexity is hilarious. Something as simple as `a = b` can call constructors, copy constructors, assignment constructors, move constructors (depending on a lot of details), do implicit type conversion, throw exceptions, etc. etc.

In Rust it's just memcpy.

Maybe you're thinking of proc macros which can do a lot more than in C++, but they're just a more convenient way of doing codegen which you can equally do in C++ (e.g. to generate Protobuf code).

lionkor · on Oct 21, 2023

C++ hides complexity, yeah, but arguably Rust should have learned from that, no?

Instead, Rust opted to do as much syntactic sugar as possible.

IshKebab · on Oct 21, 2023

> Rust opted to do as much syntactic sugar as possible.

Of all the criticisms of Rust, this is really the weirdest. What syntactic sugar do you find objectionable in Rust? It has almost none.

Are you talking about macros?

b20000 · on Oct 21, 2023

yep, i agree. most pro rust arguments are from a safety perspective or that rust does a lot for you.

the problem is that in a lot of applications nobody cares about safety or security. it is just that devs operate in evironments where that matters and they think therefore it matters everywhere all the time.

when a language does something for you you then typically don’t learn what it actually does under the hood which will come back to bite you later. i have some of that problem with the latest C++ features as well, btw, which is why my C++ tends to look like C with classes and here and there use of a modern feature.

hyperman1 · on Oct 21, 2023

I disagree. There was an article somewhere about servo, mentioning advantages of rust vs C++ . One example was how less intermediate copies were made, as rust made it clear what data was immutable. Another was how rust made threading more easy, as it was more clear what data needed locks.

Rust safety on the small scale is an enabler for stability in the large scale, giving the possibility to take more risky optimizations. This is exactly where C and C++ fall flat. You can write small scale stable software with them, but multy decade multi team 1000k+ lines of software turns in a CVE fest. A lot more effort in documenting interface details and validation need to be done,because the compiler can't do it for you

lionkor · on Oct 21, 2023

Implicit copy in C++ is a horrible thing, but very consistent. You learn once that anything that doesnt fit in a register should be passed by (const) reference, and always delete your copy ctors by default, and that does a lot for you.

shrubble · on Oct 21, 2023

Using the unsafe keyword... somehow makes me wonder where else unsafe is used...

tgv · on Oct 21, 2023

Only to remove a bounds check that's guaranteed to be unneeded (well, that's what the author says, but it's not unbelievable).

yxhuvud · on Oct 21, 2023

We don't even know it still would be relevant after the simd step.

throwaway5959 · on Oct 21, 2023

Nothing wrong with using unsafe if performance is essential and it’s wrapped up with good testing and kept safely tucked away in a library.

samus · on Oct 21, 2023

In this case, it actually didn't bring this much speedup. The program would have been plenty fast even without.

brabel · on Oct 21, 2023

The author found that HashMap::get was taking most of the time and didn't swap the hashing function??

It's well known that Rust's default HashMap hashing function is slow because it's designed to be safe against DoS [1]. If you want your HashMap to be faster, just use a faster hashing function like ahash[2], which can be up to 40% faster.

[1] https://doc.rust-lang.org/book/ch08-03-hash-maps.html#hashin...

[2] https://docs.rs/ahash/0.7.4/ahash/

ChrisSD · on Oct 21, 2023

They did swap the hashing function

> Note that everywhere we refer to HashMap is actually an alias for fxhash::FxHashMap, which is just std::collections::HashMap with a more efficient hashing algorithm