I like it how he demonstrates the thought process. It seems quite a mechanical recipe that can be applied to a lot of similar problems:
Step 1:. A slow but easy implementation. It allows to make sure the algorithm is correct, and later allows validating the faster variants.
Step 2: Algorithm and datastructure optimization, guided by profiling.
Step 3: Micro optimization, again guided by profiling. A standard bag of tricks is applied. I didn't see cache friendlyness/tiling in here, which surprised me.
Step 4: Parallelism, first by SIMD then by threads.
I am very much not downing the author: This is a great teaching of the process, it's a lot of hard brainy work, and we didn't see everything he tried but which failed. So thank you, author.
My point is: it is not magic. It is engineering. It is a skill that can be aquired, thought, even planned and measured.
> My point is: it is not magic. It is engineering. It is a skill that can be aquired, thought, even planned and measured.
Unfortunately in this day and age, the art of optimization is lost. Companies are willing to ship shitty code to production and increase performance by throwing more hardware at it.
I think the intent is fine, it's an interesting problem to want to find the question or set of questions that is the best predictor of the other questions.
But I'd have thought there must be heuristics to help narrow the search.
For example you could calculate how each question correlates to all the other questions[1]. If you picked the set of k questions with the best correlation, then you have a good sense of which will be in the actual "k-corr set".
You could widen your search a little bit, but by chopping out any questions which correlate poorly to the overall score, you narrow your search greatly, going from 200 choose 5 to 10 choose 5 is a speedup factor of a million, and you've probably considered all the sets that are likely to be in the final bucket.
It's crazy to me how the author spends most of his time finding out what hides in his abstractions, and then changing that. Especially those macros, and a few random crates used, make this code quite hard to reason about.
Moore's law is dead, that means developers who will not waste CPU cycles will be in demand for the next generation of compute intensive development. Big Data needs big compute so obviously these kinds of articles are not only pertinent but is the way ahead.
I would not be surprised if we do rewrites of all the past 20 year code in to Rust or Go or any other performant language.
Lets agree on this, you are smart and the author is right.
I am sorry people who are complaining are not seeing where the ball is going.
I’m not sure why you’re being downvoted, even if you’re aren’t correct about the law we’re seeing rising costs of cloud computing. We’ve cut our Azure cloud spending almost in half by being more efficient and rewriting core computing elements in performant languages. We’ve gone with C++ because that’s also what we use for our embedded solar plant programming but we did make a few proof of concepts using Rust and liked what we learned. I think we’ll eventually begin to move more or our C/C++ into Rust as Rust matures, but we’re also not in a rush to do so.
But to get back to what you’re talking about here, we recently spoke with Lunar about their transition from a Node backend to a Java backend and then finally to a Go backend and how they can’t imagine building anything without Go going forward. Very interesting considering that the JVM is quite performant itself. I’m not sure we’re going to follow them down the Go rabbit hole, I think we’ll remain on a Node backend for non-performant stuff since it lets us use one language for most things. But I do think your predictions are right.
> the point of this post isn’t to compare highly-optimized Python to highly-optimized Rust. The point is to compare “standard-Jupyter-notebook” Python to highly-optimized Rust.
I guess the title gets clicks, but I'm curious how good python gets. I'm under the impression pandas is pretty fast despite it being python
Pandas can be pretty fast, but DuckDB and Polars are both faster than Pandas. DuckDB supports vectorized and parallelized operations on Pandas dataframes, while Polars is written in Rust.
I feel though the killer is that inner loop where dataframe operations are being performed across a large number of iterations, and there's significant overhead there.
For-loops are usually not the most performant solution in Python.
I could be wrong, I feel that there's a SQL way to answer that question, in which case DuckDB might be able to exploit vectorization, parallelization, indexing and query optimization across the dataset in one fell swoop.
as a rule of thumb, pure python code is often 1000x slower than naively written unoptimized native code.
code in pandas can be very slow (standard pure-python speed) or "fast-for-python" depending on if you are going with or against the grain. a pandas dataframe is basically a bunch of numpy arrays, one array per column. if you do columnar calculations that can be reduced to numpy operations, like summing over a column, then numpy will execute the operation in native code, and it will be fast-for-python, but there will still be some overhead due to wrapping things in python, as well as perhaps temporary array allocation etc.
if you do something against the grain, such as expressing all your pandas calculations as on row-wise operations instead of column-wise operations, or using "apply(lambda x: pure_python_expression_of(x))", then pandas cannot execute it efficiently, as the operations are going against the grain of how things are stored in memory, and the operations cannot be reduced to native code primitives on the column arrays that are implemented in numpy.
another alternative to switching to rust is using cython to define a native python module. by starting with python code and using many of the same optimization rules of thumb in this post (static typing! avoid frequent tiny allocation, preallocate stuff! avoid hashing complex things, prefer arrays with indexes!), you can translate idiomatic (and very slow) pure python code into simple code that looks closer to array-oriented fortran-in-C, that runs very fast and compiles to a native python module that is easy to integrate.
Pandas is ok for numerical, but Polars (rust-based) is absolutely the way to go for big datasets.
The article is fascinating if you're a Python developer and you need to stray off the path of things Polars can do.
I think everybody agrees and is not pretending otherwise.
But my thing is, I think the problem could be solved in a different way in Python where we can use performant libraries to get an answer in a reasonable time.
The author says the naive Python implementation with a for-loop takes 36 milliseconds per iteration, and the problem requires 2.5 billion iterations (= 2.9 years, which is unreasonable) while optimized Rust takes 8 mins (corrected).
I believe we can solve the problem in Python in a reasonable amount of time (not 2.9 years) by expressing it differently. And I believe we can do it in Python without trying to optimize Python operations like the author is doing with Rust.
Imagine your boss came up to you and said I need the answer by this week and that you could only use Python, you would need to come up with a way to solve it. I wouldn't start by trying to optimizing Python's for-loop -- I would break out of the loop paradigm altogether and use arrays, database indices, optimized dataframe libraries (probably written in C++ or Rust) to get there. Because Python is not fast -- everyone knows this -- Python programmers will often think of other ways (generally reaching for libraries) to solve the problem.
The industry agreed to standardize on Python for the task of describing compute-graphs that get executed by compute engines implemented in something other than Python. Python is not meant to be used for the computation itself.
And then other people came along, using other languages, and realised we could do the same compute, with a fraction of the hardware, in about the same time, in the languages they’re already using, which really diminishes the appeal of Python, which boils down to:
“Setup some infra, and get someone who knows how to operate it, and then write new code in python, and babysit it as it ossifies into a core piece of infrastructure”
_or_ optimise it in a language you’re already using, or is semi-common in your stack already, and then move on with things.
Yeah, this seems like really unoptimized Python. It would be far more useful to see tricks for torturing Python/Pandas to make things faster and how far you can take it.
The author wrote in a strongly typed, ahead of time compiled language, ran a profiler to help wipe out hotspots for optimization, and the program is now dramatically faster than unoptimized Python.
I suppose there are already many articles showing how to speed calculations by avoiding/optimizing pandas.
It does feel a little unfair a comparison. Everyone knows that for loops are slow in python.. as is much of the core library. But pushing analysis to c using pythonic APIs (numpy/numba/pytorch) is fairly trivial
There’s an optimization technique I’ve used that I think could be useful for this kind of problem. It’s similar in spirit to the post.
The post deals a lot both with strings and maps with strings as keys. One idea is to intern these strings using an interer like [1] which returns string keys as monotonically increasing integers starting at 0. Then instead of a map you can just have a regular vector with the interned key as the index into the vector. This gives you the map functionality with very good performance.
The more I use rust, the more I fall in love with writing code again.
This is after many years of corporate java/c#/python/ruby shops. The number of "developers" and "architects" that don't understand low level concepts is draining and disappointing. Worked with too many corporate "code monkeys" that masquerade themselves as "engineers" or "architects". It has got me jaded sometimes.
The author of this post has given me hope that there are people out there willing to push the boundaries of their code with their years of experience and understanding of low level computer concepts.
I love demo apps like this. You can see some idiomatic code and then watch it descend into chaos but you can see the pathway. Otherwise you sometimes see the end result and you don't know why it is what it is, but the pathway makes it comprehensible. Thanks for writing it.
Roughly half the article could have been done without leaving python though. I don't know if python has a good bitset implementation, but at the very least everything before that could have been done in python. So it is more "comparing manually optimized code against unoptimized crap"
The difference between baseline Python and baseline Rust is only 8x here, because most of the heavy lifting in the Python version is done by native code.
True, but since the same example can't be done with a Rust based bytecode interpreter, or a Python AOT compiler, due to lack of implementations to use for benchmarking purposes, the remark remains valid.
Great engineering. Since the article asks about analytical tips for performance improvement: looks like you should be able to substantially get the complexity down using something like least angle regression (ie what's used by efficient implementations of Lasso shrinkage estimators). Or basically any approach where you successively add the most highly correlated variable to your set instead of checking all size-k subsets.
This is a legitimate problem we encounter in healthcare data analytics space where it will be great to find the minimum list of Diagnosis/procedure codes that correlate most with a particular disease, etc.
Clearly Betamax. It certainly has it's uses and it's development is led by the right people but there's no magic.
It's just another LLVM frontend.
Also I'm slightly annoyed by the fact that they kept talking about great GPU performance in their marketing material and all you can find in the docs is: "it's coming. probably. only to Nvidia hardware btw."
I think it's unlikely. As I understand it, it's essentially a nicer version of Cython, and that is still extremely niche.
The problem is you're advertising performance and static typing to Python programmers, but pretty much anyone who is voluntarily choosing to use Python has already decided they don't care a jot about performance or static typing.
Everyone else who is forced to use Python doesn't have the option of introducing another language anyway.
Comparing highly optimized code (including total algorithm rewrite and relying on unsafe and SIMD operations) without doing the same on the other side is a pointless exercise.
It's like showing how much faster you can get your handcrafted assembly code to run vs a bash script.
I don't think the point of the article is a rust-v-python comparison. Sure, unoptimized-python2rust2optimized rust was a part of the clickbait 180,000x title but...
- the author explicitly says the python2rust bit is only a factor of 8.
- the Python bit is just one section of a pretty long article
- the author says they typically start implementing something in Python so... why wouldn't they include their initial implementation?
I don't think this article should be dismissed just for including a simple python implementation- that isn't the point.
The author was clear what he was setting out to demonstrate in the piece
This format is likely very useful for anyone working on such large sets using Jupyter/Python and waiting days for scripts to complete -- there is nothing wrong with those final tricky optimisations when applied to single-use scripts
I found it a useful reminder that there is often more to be squeezed out on inner loops with a few mins more thought
articles like this aren't pointless, i don't think it is meant to be framed as a fair comparison of "rust-the-language is 5 orders of magnitude faster than python-the-language". this article does offer examples of how to use a profiler and some ideas of what kind of things can be bottlenecks and how to eliminate them, and also helps people who only ever work with very slow programming tech stacks to understand what kind of speedups can be attainable in practice when using a tech stack that can use a machine's resources efficiently. it's a good write up and can be educational in a variety of ways for some segments of readers. if you only ever do pure python programming, and that informs your beliefs of what "fast" and "slow" programs feel like, seeing what performance cheap modern CPUs are actually capable of when used effectively is a bit like being exposed to alien technology.
Intentional or not, it does come off as kind of “disruptive” in the negative sense of the word, ie “you’re all doing it wrong in field X. Let me show you how to work faster”.
But in a real scenario there are often so many other constraints that “optimized code” is far from the top. Salaries are also expensive, so if you factor in the lifetime cost of writing and maintaining a bit of exotic Rust for an exotic problem, it might actually be cheaper to buy a bigger machine and brute force the problem.
Sure, we could all benefit from learning a bit of Rust then, but if you’re in the Python data space you know that it’s almost as crowded as frontend frameworks, and learning Rust comes across as a yet-another-framework problem, that few of us really feel like we have time for.
Personally I’m stuck maintaining a wizard’s (who has now left the company) solo project which initially seemed like a great solution, but has caused more grief in maintenance and bugs than any existing not-as-custom-tailored community solution out there. So if I were to tell my manager “I’ve spent a week to rewrite this job that spends 40 minutes at 2 AM to spend 1 millisecond at 2 AM, from a language that everyone uses to a language that only I use” I’m not sure he’d appreciate it.
i agree that in typical business situations often the main constraints to solve for aren't squeezing the most compute out of a CPU, it's rarely the bottleneck (although of course some people can no doubt chime in from contexts where compute is frequently the bottleneck and what they spend most of their time fretting about).
good engineering should figure out what the main constraints are, and solve for those, while doing a "good enough" job for everything else and no more -- like you allude to, the main bottlenecks to solve for are usually things like reducing engineering effort, reducing schedule, reducing long term maintenance cost of the overall system (not just this isolated component) - especially how easy it is to hire someone who can understand and fix the thing.
but, this article isn't a "how to make good whole-system engineering tradeoffs for the long-term benefit or your employer / client" article, it's an article showing how to write fast rust code.
> but, this article isn't a "how to make good whole-system engineering tradeoffs for the long-term benefit or your employer / client" article, it's an article showing how to write fast rust code.
As you can probably tell from my sentiment above, it doesn’t help the author’s cause by basing the premise on a strawman. If the intention was to push Rust for Python users, this article shoots far, really far, but completely misses the mark. Nobody was telling themselves that Python was the fastest language out there.
I very much disagree with your assessment. I will agree that the title is extremely clickbaity, but the author goes into how the move from Python to Rust only accounts for a .8 increase, which is impressive but we all know Python isn’t performant so. Aside from the clickbait title I think the article is extremely interesting and rather relevant, as it takes you through the craze or optimising Rust code.
I guess you could make a point out of how the author didn’t need to bring Python into the article, but on the flip side, they clearly state that they typically prototype in Python. So it would also be an incomplete article if they left out the initial implementation for the problem they are solving.
you're getting downvoted because you said it in a weird way, but fundamentally I have to agree. Both C and C++, if written idiomatically, have less syntactic sugar and macros are less accepted, so that their code ends up hiding a lot less complexity than rust code.
Fairly complex code like what the author showed turns out to use a bunch of different random things nobody expected.
> code ends up hiding a lot less complexity than rust code
Are you really saying this about C++? For real?
I quite like C++ but the idea that it doesn't hide complexity is hilarious. Something as simple as `a = b` can call constructors, copy constructors, assignment constructors, move constructors (depending on a lot of details), do implicit type conversion, throw exceptions, etc. etc.
In Rust it's just memcpy.
Maybe you're thinking of proc macros which can do a lot more than in C++, but they're just a more convenient way of doing codegen which you can equally do in C++ (e.g. to generate Protobuf code).
yep, i agree. most pro rust arguments are from a safety perspective or that rust does a lot for you.
the problem is that in a lot of applications nobody cares about safety or security. it is just that devs operate in evironments where that matters and they think therefore it matters everywhere all the time.
when a language does something for you you then typically don’t learn what it actually does under the hood which will come back to bite you later. i have some of that problem with the latest C++ features as well, btw, which is why my C++ tends to look like C with classes and here and there use of a modern feature.
I disagree. There was an article somewhere about servo, mentioning advantages of rust vs C++ . One example was how less intermediate copies were made, as rust made it clear what data was immutable. Another was how rust made threading more easy, as it was more clear what data needed locks.
Rust safety on the small scale is an enabler for stability in the large scale, giving the possibility to take more risky optimizations. This is exactly where C and C++ fall flat. You can write small scale stable software with them, but multy decade multi team 1000k+ lines of software turns in a CVE fest. A lot more effort in documenting interface details and validation need to be done,because the compiler can't do it for you
Implicit copy in C++ is a horrible thing, but very consistent. You learn once that anything that doesnt fit in a register should be passed by (const) reference, and always delete your copy ctors by default, and that does a lot for you.
The author found that HashMap::get was taking most of the time and didn't swap the hashing function??
It's well known that Rust's default HashMap hashing function is slow because it's designed to be safe against DoS [1]. If you want your HashMap to be faster, just use a faster hashing function like ahash[2], which can be up to 40% faster.
> Note that everywhere we refer to HashMap is actually an alias for fxhash::FxHashMap, which is just std::collections::HashMap with a more efficient hashing algorithm
Step 1:. A slow but easy implementation. It allows to make sure the algorithm is correct, and later allows validating the faster variants.
Step 2: Algorithm and datastructure optimization, guided by profiling.
Step 3: Micro optimization, again guided by profiling. A standard bag of tricks is applied. I didn't see cache friendlyness/tiling in here, which surprised me.
Step 4: Parallelism, first by SIMD then by threads.
I am very much not downing the author: This is a great teaching of the process, it's a lot of hard brainy work, and we didn't see everything he tried but which failed. So thank you, author.
My point is: it is not magic. It is engineering. It is a skill that can be aquired, thought, even planned and measured.