Well, Claude 3.5 can do translation from one language to another in a fairly competent manner if the languages are close enough. I've used it for that task myself with success (Java -> JavaScript).
But, this isn't just about rewriting code from one language to another. It's about reverse engineering complex information out of the code, which may not be immediately visible in it, and then finding a way to make it "safe" according to Rust's type system. Where's the training data for that? It'd be really hard even for skilled humans.
Personally I think the most pragmatic way to make C/C++ memory safe quicker is one of two approaches:
1. Incrementally. Make std::vector[] properly bounds checked (still not done even in chrome!), convert allocations to allocations that know their own size and do bounds checking e.g. https://issues.chromium.org/issues/40285824
2. Or, go the whole hog and use runtime techniques like garbage collection and runtime bounds checks.
A good example of approach (2) is Managed Sulong, which extends the JVM to execute LLVM bitcode directly whilst exposing to the C/C++/FORTRAN a virtualized Linux syscall interface. The whole piece of code can be sandboxed with permissions, and memory safety errors are caught at runtime. The compiler tries to optimize out as many bounds checks as possible. The interesting thing about this approach is it doesn't require big changes to the source code (as long as it's already been ported to Linux), which means the work of making something safe can be done by teams independent of the original authors. In practice "rewrite it in Rust" will usually mean a fork, which introduces lots of complicated technical, cultural and economic issues.
Managed Sulong is also a research project and has a bunch of problems to solve, for instance it needs to lose the JITC dependency and go fully AOT compiled (doable, there's no theoretical issue with it and much of the needed infra already exists). And performance/memory usage can always be improved of course, it regresses vs the original C. But those are "just" systems engineering problems, not rewrite-the-world and solve-static-analysis problems.
Disclosure: I do work part time at Oracle Labs which developed Managed Sulong, but I don't work on it.
> But, this isn't just about rewriting code from one language to another. It's about reverse engineering complex information out of the code, which may not be immediately visible in it, and then finding a way to make it "safe" according to Rust's type system. Where's the training data for that? It'd be really hard even for skilled humans.
That might not be too bad.
A combination of a formal system and an LLM might work here. Suppose we see a C function
void somefn(char* buf, int n);
First question: is "buf" a pointer to an array, or a pointer to a single char? That can be answered by looking at what the function does with "buf", and what callers pass to it.
If it's an array, how big is it? We don't have enough info to know that yet. But a reasonable guess, and one than an LLM might make, is that the length of buf is "n".
Following that assumption, it's reasonable to translate this to Rust as
fn somefn(buf: &[u8])
and, if n is needed within the function, use
buf.len()
The next step is to validate that guess. The run-time approach is to write all calls to "somefn" with
assert!(buf.len() == n);
somefn(buf, n);
Maybe formal methods can prove the assert true, and we can take it out. Or if a SAT solver or a fuzz tester
can generate a counterexample, we know that the guess was wrong and this has to be done the hard way, as
fn somefn(buf: &[u8], int n)
implying more subscript checks inside "somefn".
The idea is to recognize common C idioms and do clean translations to Rust for them. This should handle a high percentage of cases.
Yes, this is similar to what IntelliJ does for Java->Kotlin. Do a first pass that's extremely non-idiomatic and mechanical, then do lots of automated refactoring to bring it closer to idiomatic.
But if you're going to do it that way, the right place to start is probably to a safer form of C++ not Rust. That way code can be ported file-at-a-time or even function-at-a-time, and so you'll have a chance to run the assertions in the context of the original code. Which of course may not have good test coverage, as C codebases often don't, so you'll have to be testing your assertions in production.
std::vector [] has had bounds checking since forever if you set the correct compiler flag. Since they aren't using it this is a choice, presumably they prefer the speed gain.
You mean _GLIBCXX_DEBUG? It's got some issues. Linux only, it doesn't always work [1] and it's all or nothing. What's really needed is the ability to selectively opt-out on a per-instantiation level so very hot paths can keep the needed performance whilst all the rest gets opted into safety checks.
but it doesn't seem to actually make std::vector[] safe.
It's frustrating that low hanging fruit like this doesn't get harvested.
[1] "although there are precondition checks for some string operations, e.g. operator[], they will not always be run when using the char and wchar_t specializations (std::string and std::wstring)."
With MSVC you can use _CONTAINER_DEBUG_LEVEL=1 to get a fast bounds check that can be used in release builds. Or just use it in development to catch errors.
> We talked about this at the weekly maintainer meeting and decided that we're not comfortable enough with the (lack of) design of this feature to begin documenting it for wide usage.
As far as I am aware, the standard doesn't mandate bounds checking for std::vector::operator[] and probably never will for backwards compatibility reasons. Most standard library implementations have opt-out std::vector[] bounds checking in unoptimized builds, but not in optimized builds.
I tried a toy example with GCC [1], Clang [2], and MSVC [3], and none of them emit bounds checks with basic optimization flags.
As I said you need the correct flag set.. MSVC use _CONTAINER_DEBUG_LEVEL=1 and it can be used in release. They have had this feature since 2010 or so, though the flag name has changed.
But, this isn't just about rewriting code from one language to another. It's about reverse engineering complex information out of the code, which may not be immediately visible in it, and then finding a way to make it "safe" according to Rust's type system. Where's the training data for that? It'd be really hard even for skilled humans.
Personally I think the most pragmatic way to make C/C++ memory safe quicker is one of two approaches:
1. Incrementally. Make std::vector[] properly bounds checked (still not done even in chrome!), convert allocations to allocations that know their own size and do bounds checking e.g. https://issues.chromium.org/issues/40285824
2. Or, go the whole hog and use runtime techniques like garbage collection and runtime bounds checks.
A good example of approach (2) is Managed Sulong, which extends the JVM to execute LLVM bitcode directly whilst exposing to the C/C++/FORTRAN a virtualized Linux syscall interface. The whole piece of code can be sandboxed with permissions, and memory safety errors are caught at runtime. The compiler tries to optimize out as many bounds checks as possible. The interesting thing about this approach is it doesn't require big changes to the source code (as long as it's already been ported to Linux), which means the work of making something safe can be done by teams independent of the original authors. In practice "rewrite it in Rust" will usually mean a fork, which introduces lots of complicated technical, cultural and economic issues.
Managed Sulong is also a research project and has a bunch of problems to solve, for instance it needs to lose the JITC dependency and go fully AOT compiled (doable, there's no theoretical issue with it and much of the needed infra already exists). And performance/memory usage can always be improved of course, it regresses vs the original C. But those are "just" systems engineering problems, not rewrite-the-world and solve-static-analysis problems.
Disclosure: I do work part time at Oracle Labs which developed Managed Sulong, but I don't work on it.