Well, not quite 'native'. TLB refills are 4x to 5x as expensive, and anything th...

fwsgonzo · on Jan 4, 2023

I guess that's mostly if you are running a full operating system inside it, generally in Qemu. It doesn't have to be - could just be a program. Tiny programs running in KVM can use big pages and never cause or require any pagetable changes.

For simple workloads it can even be faster than native unless you dynamically load something that uses bigger pages for your native program, eg. https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...

monocasa · on Jan 4, 2023

It's harder to force huge pages on a guest than it is to just use them in regular user space where you can simply mmap them in.

And none of that accounts for the increased context switch time.

fwsgonzo · on Jan 4, 2023

The guest is not in control - sure theres a few pages at the beginning of each section that has to be 4k until you reach the first 2MB-multiple.

What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

The point is: KVM is native speed if you never have to leave. I don't need to prove this for anyone to understand it has to be true.

monocasa · on Jan 4, 2023

> The guest is not in control

The guest has it's own page tables above the nested guest phys->host phys tables.

> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.

fwsgonzo · on Jan 4, 2023

Right, that's very true! It's clear that you know what you're talking about when it comes to KVM and maybe even the internal structure in Linux. However, I/O can be avoided. Imagine a guest that needs no I/O, doesn't have any interrupts enabled, and simply runs a workload straight on the CPU (given that it has all the bits it needs). That is what I have made for $COMPANY, which is in production, and serves a ... purpose. I can't really elaborate more than I already have. But you get the gist of it. It works great. It does the job, and it sandboxes a piece of code at native speed. Lots of ifs and buts and memory sharing and tricks to get it to be fast and low latency. No need for JIT, which is a security and complexity nightmare.

The topic of this thread is about Blink, which happens to be a userspace emulator. Hence my comment.

jart · on Jan 4, 2023

I usually measure the functions I write in picoseconds per byte, so 5 microseconds is an eternity.

bonzini · on Jan 4, 2023

10 ps/byte is equivalent to 100 GB/sec; unless you routinely write functions that are in the tens of GB/sec range, you probably mean nanoseconds?

jart · on Jan 4, 2023

I work on a C library. Some of the functions I've written, like memmove(), take about 7 picoseconds per byte for sizes that are within the L1 cache, thanks to enhanced rep movsb.

bonzini · on Jan 4, 2023

That's a very special case though since it's hardware optimized to work up to a cache line at a time, and not at all related to the syscall cost that was mentioned in the parent comment.

fwsgonzo · on Jan 5, 2023

The 5us was the setup time in order to be able to enter the sandbox. A system call is around 1us, but rarely used. So, in general the overhead of using the sandbox is around 5us, as everything else is pure workload.