Well, not quite 'native'. TLB refills are 4x to 5x as expensive, and anything that needs a context switch tends to be at a minimum twice as expensive, and it's common to balloon even farther from there.
I guess that's mostly if you are running a full operating system inside it, generally in Qemu. It doesn't have to be - could just be a program. Tiny programs running in KVM can use big pages and never cause or require any pagetable changes.
The guest has it's own page tables above the nested guest phys->host phys tables.
> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".
And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.
Right, that's very true! It's clear that you know what you're talking about when it comes to KVM and maybe even the internal structure in Linux. However, I/O can be avoided. Imagine a guest that needs no I/O, doesn't have any interrupts enabled, and simply runs a workload straight on the CPU (given that it has all the bits it needs). That is what I have made for $COMPANY, which is in production, and serves a ... purpose. I can't really elaborate more than I already have. But you get the gist of it. It works great. It does the job, and it sandboxes a piece of code at native speed. Lots of ifs and buts and memory sharing and tricks to get it to be fast and low latency. No need for JIT, which is a security and complexity nightmare.
The topic of this thread is about Blink, which happens to be a userspace emulator. Hence my comment.
I work on a C library. Some of the functions I've written, like memmove(), take about 7 picoseconds per byte for sizes that are within the L1 cache, thanks to enhanced rep movsb.
That's a very special case though since it's hardware optimized to work up to a cache line at a time, and not at all related to the syscall cost that was mentioned in the parent comment.
The 5us was the setup time in order to be able to enter the sandbox. A system call is around 1us, but rarely used. So, in general the overhead of using the sandbox is around 5us, as everything else is pure workload.