Proposed futex2 allows Linux to mimic the NT kernel for better Wine performance

marcodiego · on April 28, 2021

Artificial benchmarks show a 4% improvement in some cases, which is huge. There aren't many tests to demonstrate the improvements on common use cases but there are already reports of subjective improvements with games having less stutter: https://www.phoronix.com/forums/forum/phoronix/latest-phoron...

phkahler · on April 28, 2021

- For requeue, I measured an average of 21% decrease in performance compared to the original futex implementation. This is expected given the new design with individual flags. The performance trade-offs are explained at patch 4 ("futex2: Implement requeue operation").

It's not clear that this is a win overall. But hey, if Windows games run better under wine...

derefr · on April 28, 2021

Better yet, since Wine is there to serve as a “smart” shim, it can offer a launch config parameter for the app, that will get Wine to point its futex symbols at its previous userland impl. (Sort of like the various Windows compatibility-mode flags.) So even Windows apps that do tons of requeues (and so presumably would be slowed down by unilaterally being made to do kernel futexes) would be able to be tuned back up to speed. And distros of Wine that provide pre-tuned installer scripts for various apps, like CrossOver, can add that flag to their scripts, so users don’t even have to know.

Denvercoder9 · on April 28, 2021

The old interface will remain available, so if requeue performance is important, you can keep using that. Also at least one kernel maintainer considers requeue to be historical ballast [1], so it's probably not used much anymore.

[1] https://lwn.net/ml/linux-kernel/87o8thg031.fsf@nanos.tec.lin...

gpderetta · on April 28, 2021

IIRC glibc stopped using requeue a while ago for cobdution variables wait morphing, so maybe requeue is not a hugely important use case anymore.

xxs · on April 28, 2021

> 4% performance improvement for up to 800 threads.

800 threads on similar WaitForMultipleObject is already a horrific design if you even remotely care about performance. The only viable case for 800 threads (and wine) would be blocking IO which is an extremely boring one.

usefulcat · on April 28, 2021

Sure, but that's irrelevant for wine, which doesn't get to choose how the things that it's running (e.g. games) are implemented.

xxs · on April 28, 2021

Of course. Yet, I am be massively surprised to see so heavily multithreaded games with massive contention on WaitForMultipleObjects. No one programs that way (again, save for blocking IO... and then it totally doesn't matter)

monocasa · on April 28, 2021

Unfortunately, people do program that way, and I've seen it way more in closed source Windows code than in the open source world where someone will (rightly) call them out.

For instance, in a previous job, a C++ codebase existed where the architect had horrendously misunderstood the actor model and had instated a "all classes must extend company::Thread" rule in the coding standard. So close, yet so incredibly far away.

At the end of the day, wine (like windows) still has to run these programs and do it's best with terrible choices that were made in the past.

xxs · on April 28, 2021

Having a policy to have to extend something is proper bad already. For actors likely you need lock-free queue and much preferably a bounded one - so each message is a CAS (very likely uncontended) and there is an upper limit how much stuff can be enqueued.

I do believe there are tons of incredible horrid code (esp. when it comes to concurrency) - but most games tend to use already established engines, less likely to have glaring concurrency design blunders.

monocasa · on April 28, 2021

I mean, yeah, you don't make a choice like that without making a lot of other terrible choices too. Like I said "horrendously misunderstood the actor model". And as you seem to be picking up on, they had a very C++98 view at best of how inheritance should work.

In case it's not clear to anyone reading along: if you find yourself doing anything like this, take a long, deep look in the mirror; you're almost certainly making the wrong choice. Best I can say is that the architect learned from some of their choices in the years since from the resulting consequences.

However, that code is still out there. The engineers have tried to rewrite since then, but it turns out that rewriting isn't adding new features. I don't agree with the company's calculus on the matter, but that's a tale as old as time.

And you'd be surprised how many games aren't written well. Writing a game engine is a bit of a rite of passage among young coders, and a surprising subset of those actually ship. In my experience, a lack of maturity in the field and (we'll be courteous and say) 'questionable' threading models (and as you point out inheritance models too) go hand in hand.

The point at the end of the day is that Wine users (like Windows users) expect the runtime to do the best it can even against terribly written applications. The users want to run that code, it isn't going to be fixed, so anything Wine can do better to paper over the (incredibly obvious to someone skilled in the art) deficiencies of that code is in their users' benefit.

girvo · on April 29, 2021

I wouldn’t be so certain with regards to game engine code quality. Game dev is about producing playable art; code “quality” isn’t a massive focus.

In this particular example you’re probably right though, because performance does matter to game dev, and that would be a horrific architecture perf wise

Matthias247 · on April 29, 2021

It kind of makes sense for waiting on a single object, if the program uses it as a replacement for a condition variable. E.g. multiple threads could wait on a shared work queue, and wait on an object to get signaled when work is available. Now ideally just one thread should be woken, but with a bit of imagination and naive coding its easy to imagine someone wakes up a lot of worker threads which then just go to sleep again.

asveikau · on April 28, 2021

IIRC, WaitForMultipleObjects can only block on 63 handles. 800 threads blocked on 63 handles seems pretty weird.

xxs · on April 28, 2021

>IIRC, WaitForMultipleObjects can only block on 63 handles

Yup (64 though), it's an extra annoying limitation. It has been there since the 90s, WinNT (cant recall win95 version as I never used it with win95). It's limited by MAXIMUM_WAIT_OBJECTS which is 64.

For instance Java implements non-blocking IO (java.nio.Selector) by using multiple threads and sync between them to effectively achieve linux' epoll.

----

my point was mostly that the 'huge' 4% happens with massive amounts of threads and WaitForMultipleObjects won't see even that... so kind of click-baity. Flip note: sync. between threads can be done better than relying on OS primitives aside for real wait/notify stuff (which indeed it's implemented via futex)

asveikau · on April 29, 2021

I don't remember if the ABI kind of made it so that they can't increase that constant. Like if the next meaningful return code started at 65... As I recall they had this WAIT_OBJECT_0 + offset scheme, maybe offset 65 was something else meaningful? It's been a while.

But if they had meaningfully increased that, then they would run into the same scaling problems as poll(2) - since the return code gives you an index, you wouldn't have to loop through the handle set in user mode like pollfd requires, but every syscall would need an O(n) scan of the HANDLE[] array on the kernel side. You need something like iocp/epoll/kqueue where kernel mode has a list of interesting handles, that doesn't need to be scanned and copied at every syscall.

userbinator · on April 29, 2021

Linux has a similar limitation with FD_SETSIZE and select() --- which I believe comes from the implementation using an array index operation instead of a more complex data structure due to performance reasons.

throwaway2037 · on April 29, 2021

I'm confused by this comment. It appears to imply that FD_SETSIZE is also very small (63?) on Linux. To be clear: I am no an expert on Linux kernal params, nor select() and its friends. However, recent Linux kernels appears to have FD_SETSIZE about 1024.

Red Hat says: <<FD_SETSIZE is hardcoded to 1024 in /usr/include/bits/typesizes.h and can not be changed.>>

Source: https://access.redhat.com/solutions/488623

If I misunderstand, please let me know.

userbinator · on April 29, 2021

I wasn't implying it's also small, but that it's a fixed value too.

throwaway2037 · on May 6, 2021

Very good. Thank you to clarify.

savant_penguin · on April 28, 2021

Really curious, why is 4% improvement huge?

chalst · on April 28, 2021

"Huge" means different things in different contexts. A compiler optimisation that delivered a 4% increase in speed for a whole class of well-tuned C programs would be regarded as huge by people who follow compiler technology, but would not strike most users of computers as a big deal. I take it "huge" is meant in this sense of extraordinary within a class of potential improvements.

If you can accumulate a series of 4% improvements, then even the computer consumer will start to see the significance.

jhgb · on April 28, 2021

A 4% improvement of the 100m dash world record would have been considered phenomenal, I imagine.

alexeldeib · on April 28, 2021

This reminded me of the 4 minute mile. The first 4 minute mile was a 3:59 run by Roger Bannister in 1954. Prior to that, the record was a 4:01 run in 1945.

A 4% improvement on the 4:01 would be a 3:51. A 3:51 wasn't achieved until 1966, about 20 years later.

So 4% can be quite a jump!

dbpatterson · on April 28, 2021

Extending your point, 4% more would go down to just under 3:42.... which has not yet happened, over a half century later.

zimpenfish · on April 28, 2021

To put that into context, Bolt's 9.58s (current WR, considered phenomenal in and of itself, I think?) is about 1.7% faster than Powell's 9.74s (last pre-Bolt WR).

A 4% improvement on Bolt's 9.58s would be about 9.19s - probably not something we'll see for several decades, if ever.

jamiek88 · on April 28, 2021

I wonder if I’ll ever see a Bolt again in my lifetime.

Such dominance and grace.

He jogged over the line with that record too.

Someone · on April 28, 2021

I can’t find where the 4% number comes from, but I expect that it comes from a micro benchmark that does little more than futex calls. Reason: if it is from a benchmark that represents real-world use, real-world use must spend at least 4% of its time in futex calls. If that were the case, somebody would have seen that, and work would have been spent on this.

So no, this won’t give you 4% in general.

chalst · on April 28, 2021

From the lkml post, it's neither from a real-world case nor from an especially constructed futex benchmark, but from an existing kernel benchmark where it was thought it would be interesting to see how it performed with the new call.

jstanley · on April 28, 2021

> If that were the case, somebody would have seen that, and work would have been spent on this.

I would think that the linked patch is good evidence that work is being spent on this.

xxs · on April 28, 2021

The microbenchmark part is part of the article, no idea why the downvotes here.

Dah00n · on April 28, 2021

Because the benchmark isn't doing what GP days it does.

marcodiego · on April 28, 2021

Because you don't have to change one line of your previously running software to get this benefit or replace your hardware. Of course, will need a new version of the kernel and wine, but that is a matter of time to hit distro repositories. Also, consider that this will accumulate when combined with new user space libs, servers, compilers, drivers... Software will improve without any change just by keeping your distro updated.

Yes, a 4% improvement is huge.

sp332 · on April 28, 2021

Well, it's a relative measure. There's been a ton of optimization already. All the low-hanging fruit is gone. It's probably been a long time since a single optimization gained 4%.

tombert · on April 28, 2021

I don't know much about Wine, but I'm speaking in generalities, but I've worked jobs where we spend weeks to get <10% performance increases. Very often that last bit of optimization is the hardest part.

fefe23 · on April 28, 2021

Intel Xeons have huge price differences at the top tier for relatively minor performance differences.

So if you get 4% for free here, that could mean hundreds of dollars savings compared to having to buy the next tier hardware. Multiplied by the number of Intel customers.

4% could mean the difference between "I can still use my old computer" and "I need to buy an upgrade".

ltbarcly3 · on April 28, 2021

The price difference is due to artificial market segmentation, for example only xeons support ecc. It is not due to performance.

(consumer chips are usually basically identical, just with functionality disabled)

k1rcher · on April 28, 2021

This is interesting. Does anyone know of previous figures on Windows virtualization improvement?

delroth · on April 28, 2021

It's not 4% improvement for the wine use case, it's 4% improvement in existing synthetic benchmarks for the futex syscall. And a 21% regression in performance in another synthetic benchmark.

It's likely the wine WMO implementation is getting a much more significant increase in performance than 4%, but the patch description doesn't provide numbers.

cedws · on April 28, 2021

This might get me some angry replies, but does the feature set of Linux ever end? Will there ever be a day where someone says "you know what, Linux isn't built for this and your code adds too much complexity, let's not". It really just seems like Linux has become a dumping ground for anyone to introduce whatever they feel like as long as it passes code review, with nobody looking at the big picture.

DSMan195276 · on April 28, 2021

Features get rejected all the time, even some pretty substantial ones. That said I think there's two things to keep in mind:

1. If someone is willing to take the time to develop and submit something to the kernel, it probably has some redeeming factors to it. The original idea or code might not get merged, but something similar to meet those needs likely can be if it's really something the kernel can't do.

2. The modularity of the Linux Kernel build system means that adding new features doesn't have to be that big of a deal, since anybody who doesn't want them can just choose not to compile them in. It's not always that simple, but for a lot of the added features it is.

thedracle · on April 28, 2021

I mean, the futex2 proposal has been around for over a year.

A lot of the history and discussion can be found here: https://lkml.org/lkml/2021/4/27/1208

I personally feel like the kernel maintainers have been very conservative over the years regarding the introduction of new system calls, but then again I've maybe gotten used to the trauma of being familiar with windows internals.

Ericson2314 · on April 28, 2021

This shouldn't be getting down-votes.

Consider the slogan "composition not configuration", Linux is squarely the latter. All the hardware, all the different use-cases, that are just added to what's mostly a monolith. You can disable them, but not really pull things a part.

Now, Linux also has some of the most stringent code review in the industry, and they are well aware of issues of maintainability, but fundamentally C is not a sufficiently expressive language for them to architect the code base in a compositional manner.

What really worries me is that as the totality of the userland users an increasingly large interface to the kernel, it will very hard to ever replace linux withsomething else when we do have something more compositional. I am therefore interested, and somewhat working on, stuff that hopefully would end up with a multi-kernel NixOS (Think Debian GNU/kFreeBSD) to try to get some kernel competition going again, and incentivize some userland portability.

Note that I don't think portability <===> just use old system calls. I am no Unix purist. I would really like a kernel that does add new interfaces but also dispenses with the old cruft; think Rust vs C++.

fabianhjr · on April 28, 2021

> Consider the slogan "composition not configuration", Linux is squarely the latter. All the hardware, all the different use-cases, that are just added to what's mostly a monolith. You can disable them, but not really pull things a part.

Linux is a Monolithic Kernel, sounds like you are more interested in Microkernels. https://www.wikiwand.com/en/Microkernel

One microkernel alternative could be Redox-OS: https://www.wikiwand.com/en/Redox_(operating_system)

bruckie · on April 28, 2021

You could write a monolithic kernel using composition for your code structure. It would just be composition at build time rather than at runtime.

It seems pretty hard to write a microkernel without using composition, though, since the runtime requirements kind of dictate your code structure.

oalae5niMiel7qu · on April 28, 2021

> but fundamentally C is not a sufficiently expressive language for them to architect the code base in a compositional manner.

Plan 9 would like to have a word with you.

pjmlp · on April 29, 2021

Well, the original plan was to use Alef.

bombcar · on April 28, 2021

Linus has done exactly that a few times; though usually he defines what would be needed to be merged.

Some of the changes aren't compatible with the kernel direction as a whole and so they're a separate branch. Given enough time (and if the change is major AND ends up being widely used) it may get merged in.

Those who want a specific kernel with a defined feature set often will look to one of the BSDs as a base (as often they also have reduced hardware requirements, think routers).

pram · on April 28, 2021

That’s the beauty of open source! Just fork your own Linux and then refuse any further patches or PRs. True freedom.

lizknope · on April 28, 2021

Linus Torvalds looks at the big picture all the time. That is basically his job as he doesn't write a lot of low level code anymore.

Do you have an example of something that you consider part of the "dumping ground?"

rkangel · on April 28, 2021

> with nobody looking at the big picture

As I understand it, this is exactly what Linus does. He's the final arbiter of what makes it in, and has a history of setting high level direction for how stuff should be (usually prefixed with "I'm rejecting this because").

stjohnswarts · on April 28, 2021

That's a fair criticism but the kernel is very modular and the legacy stuff that no one will maintain gets cut all the time, so I think it's all a fair tradeoff. I'm not angry, just pragmatic because if you don't evolve you die, look at how much market share BSD has lost over the past 15 years

__d · on April 29, 2021

I think it's hard to blame BSD's (collective) loss of market share on their reluctance to add system calls, or even their pace of evolution more generally.

Are there some specific evolutions that you think Linux adopted, and eg. FreeBSD didn't, that were/are crucial to Linux' now overwhelming dominance?

dooglius · on April 28, 2021

No, because the Linux kernel is designed explicitly as a monolithic kernel, so as long as there is anything anyone wants to do that is "OS-like", it expects to own that.

marcodiego · on April 28, 2021

Getting a new syscall on linux is hard. You'll often need user-space users before it gets accepted; that alone is hard. The linux syscall set is very small considered the breadth of cases it is used today. Compared to nt, linux syscall set is very small and well thought out.

formerly_proven · on April 28, 2021

Windows has around 500 syscalls, Linux around 300.

Both have several escape-hatches for arbitrary driver-userspace communication, which makes the number of syscalls pretty irrelevant. (E.g. Linux has ioctl, netlink, procfs, sysfs, ...)

monocasa · on April 28, 2021

Well, there's also win32k.sys which provides an additional 1000 or so syscalls.

Ericson2314 · on April 28, 2021

> Both have several escape-hatches for arbitrary driver-userspace communication, which makes the number of syscalls pretty irrelevant. (E.g. Linux has ioctl, netlink, procfs, sysfs, ...)

Yes, we need a better metric for accounting for those things, which will paint a much more sobering picture of the complexity and growth of complexity of the userland <-> kernel interface.

marcodiego · on April 28, 2021

Compare this: https://linuxhint.com/list_of_linux_syscalls/ with https://j00ru.vexillium.org/syscalls/nt/64/ and https://j00ru.vexillium.org/syscalls/win32k/64/

GoblinSlayer · on April 28, 2021

Linux syscall set is very small and posix. No matter what you think, the first version is always a prototype.

yarg · on April 28, 2021

As a relative layman, this all seemed reasonable and desirable - and the fact that he integrated it with pthreads seems particularly cool to me.

NUMA awareness especially stood out as highly significant.

The new supported sizes seems nice and shiny - if well utilised, it could improve performance and capacity in general, capabilities in the low-end and scaling in the high-end.

slaymaker1907 · on April 28, 2021

I think that will be the day somebody comes and creates a new FOSS kernel which gets rid of mostly unused legacy code. The problem with this sort of stuff is that it is hard to know what abstractions will actually be useful before said abstractions are available.

I don't think anyone in their right mind would design the IndexedDB APIs the way they are with the background of modern JS. It would certainly be promise based and I think there would be a better distinction between an actual error with IndexedDB vs just an aborted transaction. It's also extremely confusing about how long you can have transactions ongoing. I suspect they would replace the nonsensical autocommit behavior of transactions currently with some sort of timeout mechanism. There is way too much asynchronous code out there these days to rely on the event loop for transactions committing.

Linux is honestly pretty clean as far as outdated abstractions go, but Windows is definitely very guilty of keeping around old outdated abstractions. Having execution policies for PowerShell while batch scripts get off scotch free is really strange. These restrictions would only make sense if Windows didn't let users run VBScript, Batch, etc. Can anyone tell me how running a PowerShell script is inherently more dangerous than VBScript or Batch scripts?

joppy · on April 28, 2021

I don’t think this is an achievable goal for a kernel, since it has strictly more permissions than user code. There are some parts of user code that simply cannot be made faster by changing how they’re written: no amount of rewriting in Rust/C/C++ will help if the relevant kernel primitives are not there. The kernel must move with the times, and incorporate new features as they are (tastefully) decided upon.

penagwin · on April 28, 2021

There are pieces of linux - say the networking - that already don't meet many people's use cases. People seem to like BSD for certain networking stuff due to differences in the respective networking stacks.

zokula · on April 28, 2021

> People seem to like BSD for certain networking stuff due to differences in the respective networking stacks.

Is that why *BSD has lower market share for servers and other networking equipment than it did 15 years ago?

trasz · on April 29, 2021

It doesn’t, really - 15 or 10 years ago was pretty much the low point, by then everyone who wanted to jump ship did so.

gpderetta · on April 28, 2021

this patch both cleanup existing complexity and add functionality needed for a significant use case.

trillic · on April 28, 2021

So what? This code (and most drivers, modules, additional fucntionality, etc) will only be compiled in if necessary and will only actually be run if you use the feature.

cedws · on April 28, 2021

My concern is not about what my flavor of kernel includes. It's about the size of the codebase, and whether the limited number of maintainers can keep up with the countless features being introduced.

coolreader18 · on April 28, 2021

I mean I think in terms of maintainer resources Linux is one of the projects with the least to worry about, there are always companies who are willing to have a team work on stuff.

Ericson2314 · on April 28, 2021

Those megacorps also accumulate endless complexity internally, as does the industry as a whole. It's a problem larger than the discipline of reviewers of C code could ever hope to fix.

Even the larger world outside the field is somewhat aware of the the looming complexity problem: https://www.theatlantic.com/technology/archive/2017/09/savin...

Basically on our current trajectory we will end up with something like the biosphere that can do stuff but too complex to fully understand --- but also probably well more fragile than that?

kaba0 · on April 28, 2021

That’s what testing is for. We can’t really reason globally about much smaller codebases either Software is complex.

cbmuser · on April 28, 2021

The kernel is regularly shedding off unmaintained code.

So the codebase is actually not growing as much as it could.

cedws · on April 28, 2021

The code they're shedding off is drivers. They won't ever get rid of old syscalls because Linus refuses to break userspace, ever.

caf · on April 29, 2021

Frequently the old syscall just ends up as a stub that calls into the new implementation though.

fullstop · on April 28, 2021

The BSDs are always an option.

zokula · on April 28, 2021

The BSDs were an option they are becoming more moribund and less used every year.

trasz · on April 29, 2021

[citation needed]

vondur · on April 28, 2021

I'm assuming if you build the kernel yourself, you may be able to strip out features you don't want.

broodbucket · on April 29, 2021

Yep, there's loads and loads of config options. Linux isn't a big immutable blob, there's a reason it gets used in both tiny devices in supercomputers, you can build your kernel however you like.

It's not a microkernel but it does a pretty good job of being flexible for a monolithic kernel.

If you're curious, clone the kernel source and run `make menuconfig` and you can play with it yourself

anoncake · on April 28, 2021

What purpose is a general-purpose kernel not built for?

alexhutcheson · on April 28, 2021

Bummer that this doesn’t seem to support or explicitly plan for the proposed FUTEX_SWAP operation[1] (unless I’m missing it).

[1] https://lwn.net/Articles/824409/

dundarious · on April 28, 2021

I haven't read anything about FUTEX_SWAP since Aug 2020. Do you know is there ongoing progress?

aasasd · on April 28, 2021

I lowkey have to wonder why Linux seemingly goes all-in on monster functions that implement a variety of operations depending on the arguments (as far as I can see from casual encounters). In my experience, such code often both reads poorly on the calling side and looks very messy inside. I came to prefer making distinct functions that spell out their intended action in the names. Windows apparently went the same path.

E.g.: syscall doesn't accept multiple handles and other bells and whistles? Well duh, it's not supposed to—just make a new one that does.

Of course, it's probably too late to change this design choice now...

whatshisface · on April 28, 2021

I can only offer speculation, but it could be that the alternative, a complicated API that requires several subsequent calls to use, is seen as more of a risk for bad programming and state mis-tracking. We definitely don't want programs leaving the kernel in a bad state by calling the first half of a sequence of syscalls then later on getting distracted or crashing and never finishing it. A single call gives atomicity guarantees that are otherwise hard to come by in languages without state tracking primitives like futures or borrow checked references.

Another thing to remember is that in C, it's easier to make a call more specific, by hardcoding values than it is to make it more general, by picking between variants in a branch. I can see programmers wishing the API was more dynamic while writing a giant if-else tree to select the appropriate version of the syscall, but the opposite problem, when the same invocation is always needed but the API takes a lot of dynamic options, is not as bad because it is easy to write a bunch of literals in the line that calls the function. (If C had named arguments that approach would be almost flawless.)

aasasd · on April 28, 2021

I don't see how `an_op_syscall(param)` is ‘more complicated’ or requires more calls or state than `a_syscall(OP_NAME, param)`.

whatshisface · on April 28, 2021

Since C doesn't have first-class functions, if you needed to dynamically pick between values of OP_NAME you'd be glad it was a parameter. Imagine ten options parameters, which your program needs to choose between depending on the circumstances. That's a huge branch tree. Imagine choosing between two options - you'd need to keep track of function pointers, or at least write your own enum. C makes it far nicer to handle enum values than it is to handle functions, because ints are first-class. I agree that there's no difference between a symbol in the name and a value in the args when your program always needs the same value, but sometimes it needs dynamism.

delusional · on April 28, 2021

I don't see how function pointers are any harder to manage than an enum. Just slap those function pointers into an array or a switch, and away you go.

whatshisface · on April 29, 2021

Enum values are type checked based on the sum type you define them with. Function pointers are type checked based only on their signature. Also, C does not support partial application, but it does support partially filling out a struct of enum types.

aasasd · on April 28, 2021

You're talking here about the second half of your previous comment. This reasoning is ok by me, though I have to wonder how often, and why, do programmers need to decide dynamically in mid-flight whether they want to wait instead of wake (or whatever choices futex() offers), or that they need one handle instead of a vector.

I'm talking above about the first half of your previous comment, where you say that having operation names in the function names would somehow require multiple calls and break atomicity. I can only understand this if Linux syscalls are parameterized to do one thing and then another thing and something else, in one call—which sounds horrible, but still can be expressed in function names.

gpderetta · on April 28, 2021

There is no direct function calls into the kernel.

At the end of the day a function call is an interrupt with the syscall parameters in a bunch of registers. As you do not want to use a different interrupt (as there are only so many interrupts [1]) for each function call, the syscall id is passed as a parameter in a register. On the kernel side the id is used to index in a function pointer table to find and call the actual system function.

Turns out this table is actually limited in size so some related syscalls use another level of indirection with additional private tables (I think this is no longer a limitation and these syscall multiplexers are frowned upon for newer syscalls).

All of this to say that FUTEX_WAIT and FUTEX_WAKE are just implementation details. A proper syscall wrapper would just expose distinct functions for each operation.

But for a very long time glibc did not expose any wrapper for futexes as they were considered linux specific implementation details and one was supoosed to use the portable posix functions. So the futex man pages documented the raw kernel interface. That didn't stop anybody from using futexes of course and I think glibc now has proper wrappers.

[1] yes, these days you would use sysenter or similar, but same difference.

aasasd · on April 28, 2021

Thanks, this actually answers my original question in regard to the kernel's part of it. I'll take a look at what names and signatures libc has, and if they're still like that then the question is readdressed to libc—though I guess the answer then would be “because that's how it is in the kernel”.

P.S. Come to think of it, I sorta knew about the interrupt business, and vaguely about the later but fundamentally similar method (but not that there are many interrupts for syscalls, though thinking about eight-bit registers clears this up quickly). Guess I just forgot that libc is kept separate from the kernel, in the Linux world.

whatshisface · on April 28, 2021

>I'm talking above about the first half of your previous comment, where you say that having operation names in the function names

I was addressing operation names in the function names in the second half of my comment, in the first half I was addressing breaking up specification into a state machine process like OpenGL does. State machines are another way of breaking up big functions with lots of options. They have pseudo default argument support as an advantage, and ease of implementation via static globals as a historical (before multithreading) advantage. State machines are kind of like builder pattern as a way to get named arguments and default arguments in a language that doesn't have either, except instead of the builder being a class it's a singleton (with all the attendant disadvantages.)

xiphias2 · on April 28, 2021

Efficient OpenGL programming was always hard for me, as I don't just need to learn the API, but also try to figure out how it's implemented, which is often undocumented and unspecified.

It's actually hard to understand for me that an API that was created for efficient computing doesn't specify/guarantee the relative speed/latency of operations.

whatshisface · on April 28, 2021

Yeah, there are some really interesting unexpected performance implications - for instance, it's actually a great idea to send updates to buffers in a way that implies new allocation, because most drivers will silently and automatically set up a multiple-buffering system that reuses resources efficiently and doesn't stall the pipeline. Sending partial updates looks better on the surface, but because it can stall while the driver waits for the resource you're updating to fall out of use, it can be worse unless you set up the triple buffering yourself!

aasasd · on April 28, 2021

Ah, no, that's also contrary to my taste. I love clear function names and I hate state. Immutable return values are way better than builders, in my experience, even if that means many arguments. And static globals are a whole other level of bad.

whatshisface · on April 28, 2021

Builders can be a huge help when you get up to a dozen arguments and above. Seriously, I never want to see foo(13,15,91,39,LIB_YES,LIB_NO,5.0,183.5) ever again haha. Good engineers would comment each argument, good languages support named arguments, but when you have neither it can be a mess unless the library author used builders to force annotation of the values. A close relative of builders are settings struct arguments, which have their own pros and cons. I think Vulkan uses settings structs.

aasasd · on April 28, 2021

I think OOP builders can work without extra state if they implement `->addThis()->addThat()` by actually doing functional `addThat(addThis(defaultObj))` and creating immutable values inside. But ultimately there's no good answer to the absence of named arguments, and by now not having them is a pure atrocity.

Of course, I'm guessing C programmers would cringe at the prospect of going full COW for such use-cases.

whatshisface · on April 29, 2021

The downside of your second syntax for builders is when you need to add arguments. People will have a hard time remembering whether it's supposed to be addFoo(1,addBar(2)), addFoo(addBar(2),1), or any of the other combinatorially exploding number of possible arrangements as the complexity of the construction is increased. With the OOP design, the values go inside the addFoo(1)->addBar(2) and it's obviously constrained to that one way of writing it. When writing function signatures that are going to be nested I try to keep one type in the arguments and I try to make them commutative (so that it works the same way in any order.) That's not always possible but it helps people remember how to write calls.

aasasd · on April 29, 2021

Well, the convention in theory is usually that `object->method(param)` is equivalent to `method_function(object, param)`, and likewise you have the signature of `method(self, param)` in languages where `self` is an explicit argument—if only because it's always present so makes sense to always have it first. Thus, translating an OOP builder would make the functions `addThis(object, param)` =)

With the difference in my proposal that the object is treated as a COW.

quotemstr · on April 29, 2021

You're talking about the problem of multiplexer system calls. The kernel is actually moving away from that model towards one of fine grained operations.

marcodiego · on April 28, 2021

Probably main reasons are:

  - Syscalls are not designed to be called directly from end-user programs. Instead lower level libs calls syscalls.

  - Getting a syscall accepted is a long and tedious proccess, so they should as general and future proof as possible.

maxfurman · on April 28, 2021

Can someone please ELI5 what is a futex? Is it like a mutex?

hyperman1 · on April 28, 2021

Futex stands for fast userspace mutex. It is a basic building block for implementing a mutex (amongst others).

Afaik, if locking a mutex succeeds, futexes avoid a context switch to and from the kernel. The fast path happens completely in userspace. Only when a mutex lock causes a thread to sleep, the kernel is needed. All of this improves mutex performance.

There are other locking primitives than mutexes, and the current futex interface is not optimal for implementing common windows ones. Hence this proposal.

xxs · on April 28, 2021

>"context switch to and from the kernel"

This is a common misconception, a "context" switch is a lot more expensive than a "mode" switch that goes from user mode to the kernel mode (and back).

Of course, if the futex has to wait, it'd be a context switch for real.

ece · on April 28, 2021

ESYNC is the way this is currently done, it's hit and miss in that it doesn't work unless the number of open files is set high enough for some games. It's also slower than this proposal.

whoisburbansky · on April 28, 2021

It's a primitive that lets you build mutexes; you've got an atomic integer in userspace, attached to a wait queue in kernel space. Eli Bendersky has a pretty good rundown: https://eli.thegreenplace.net/2018/basics-of-futexes/

bogomipz · on April 28, 2021

This is a good post thanks. One thing confuses me is that these two sentences seem to contradict each other:

>"Before the introduction of futexes, system calls were required for locking and unlocking shared resources (for example semop)."

>"A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true"

The first indicates futexes are an improvement because they avoid the overhead of system call but the second states that futex() is a system call.

Is this similar to how a vDSO or vsyscall works where a page from kernel address space is mapped into all users processes? In other words is it similar to how gettimeofday works?[1]

[1] https://0xax.gitbooks.io/linux-insides/content/SysCall/linux...

kccqzy · on April 28, 2021

It's not a contradiction. The two sentences are perfectly clear to me. Of course futex() is itself a system call, but in case of no contention you don't need to invoke the system call. That's why futex() is an improvement. The key phrase is "only when" in the second sentence.

gpderetta · on April 28, 2021

IMHO The nice thing about futexes is not that they allow to fast path uncontended sinchronization.

The userspace fast path can in principle be implemented on top of any kernel only synchronization primitive (a pipe for example).

The nice thing about futexes is that they do not consume any kernel resource when uncontended, i.e there is no open or close futex syscalls. Any kernel state is created lazily when needed (i.e on the first wait) and destroyed automatically when not needed (i.e. when the last waiter is woken up). Coupled with the minimal userspace space requirements (just a 32 bit word) it means that futexes can be embedded very cheaply in a lot of places.

catern · on April 28, 2021

>The userspace fast path can in principle be implemented on top of any kernel only synchronization primitive (a pipe for example).

>The nice thing about futexes is that they do not consume any kernel resource when uncontended, i.e there is no open or close futex syscalls.

This is an extremely interesting and useful observation, thank you for making it.

Futexes are missing a lot of useful functionality - see the OP, and as another example it's very hard to integrate them with an event loop - so I've been dreading building anything with them, but I thought I needed it to get fast synchronization. But for my use cases, I already have setup and teardown calls. Maybe I can do fast userspace synchronization on top of some other kernel object - a pipe, or something, as you suggest.

Do you have any more to share about this observation, or pointers to any implementations of fast userspace synchronization on top of something other than futexes?

gpderetta · on May 1, 2021

You can use eventfd as a pollable waiting primitive.

But there isn't really ever a reason to mix futexes and event loops. Futexes are not a synchronization primitive, just a waiting strategy. Decouple the two and you can integrate your synchronization primitives with an event loop while still using futexes for the non event-loop cases.

catern · on May 1, 2021

>But there isn't really ever a reason to mix futexes and event loops. Futexes are not a synchronization primitive, just a waiting strategy. Decouple the two and you can integrate your synchronization primitives with an event loop while still using futexes for the non event-loop cases.

I'm not sure what you mean, can you elaborate? Suppose I had a mutex (synchronization) implemented with a futex, and a shared memory queue (waiting) implemented with a futex; suppose both are being used to coordinate between multiple processes. I guess you're suggesting that one of those two doesn't need to be integrated with an event loop? But why? Both of those seem useful to have as part of an event loop.

gpderetta · on May 2, 2021

You wouldn't want to use an event loop to wait for a mutex as it doesn't really make much sense. It does make sense to use an event loop to wait for empty/non empty events (which on a normal queue you would implement with a cond var).

You have two options: you can use an eventfd to implement the queue full/empty event, or provide a generic notification interface (i.e. a callback). Any non event loop users wiuld simply have the callback signal a futex, but it allows for more complex use case: obviously you can wakeup the event loop from the callback, but you could resume a coroutine, send an async signal or whatever makes sense for the application.

bogomipz · on April 28, 2021

>'The key phrase is "only when" in the second sentence'

Thank you for pointing this out. That makes sense. This is sort of like the idea that the fastest system call is the one that is never made. Can you say how does something in userland know if something "would" block without actually crossing the user/kernel boundary? Does the kernel expose a queue length counter or similar via a read-only page?

tialaramex · on April 28, 2021

The trick is, the memory address futex cares about is how you signal. Normally, in the "uncontended" case, everybody involved uses atomic memory access to manipulate this memory location (for example they set it to 1 as a flag to say somebody is in the mutually excluded code and then back to zero to say now nobody is in there) and they pay no kernel cost at all for this feature. The kernel has no idea the memory is even special.

But in the contended case the flag is already set by somebody else, so you leave a sign (maybe you set it to 42) about the contention in the memory address, and you tell the kernel that you're going to sleep and it should wake you back up once whoever is blocking you releases. When whoever was in there is done, they try to put things back to zero, but they discover you've left that sign that you were waiting, so they tell the kernel they're done with the address and it's time to release you.

None of the conventional semantics are enforced by the operating system. If you write buggy code which just scribbles nonsense on the magic flag location, your program doesn't work properly. Too bad. The kernel does contain some code that helps do some tricks lots of people want to do using futexes, and so for that reason you should stick to the conventions, but nobody will force you to.

bogomipz · on April 28, 2021

Thanks for this insight. A couple quick questions:

>"When whoever was in there is done, they try to put things back to zero, but they discover you've left that sign that you were waiting, so they tell the kernel they're done with the address and it's time to release you.

Did you mean to say "and it's time to release to you" here?

>"None of the conventional semantics are enforced by the operating system"

Ah OK, this is the part that I feel is maybe always left out of the things I've read on futex(). I guess this is just always implied then that some library implements these semantics correctly? And that library is generally going to glibc?

jesboat · on April 29, 2021

> Ah OK, this is the part that I feel is maybe always left out of the things I've read on futex(). I guess this is just always implied then that some library implements these semantics correctly? And that library is generally going to glibc?

Yup. For example, the pthread_foo functions are futexes under the hood.

tialaramex · on April 29, 2021

On Linux.

The POSIX threading API accommodates many possible implementations, and so in Linux many of the bits that look expensive (and might be on other systems) are more or less free thanks to futexes, e.g. setting up the mutex in Linux is just allocating an aligned 32-bit value on the heap and setting it to some initialising value, but on some systems it involves an OS system call to get a mutex handle.

This is the true benefit of the futex. The typical scheme for using it looks almost exactly like a well known trick for speeding up conventional mutual exclusion features that have low contention, which you'd see in say BeOS as the Benaphore, or described in several books about high performance programming. But those tricks all imply the expense of first obtaining a handle from the operating system for every such mutex whereas with a futex that part is free and you only talk to the operating system at all once there is contention for it to help you manage.

bogomipz · on April 29, 2021

>"The typical scheme for using it looks almost exactly like a well known trick for speeding up conventional mutual exclusion features that have low contention, which you'd see in say BeOS as the Benaphore, or described in several books about high performance programming."

Do you have any links that document this trick in Linux? Is there name for it or search term I could use to find out more?

tialaramex · on April 29, 2021

The phrase you'd have used to describe this when it wasn't just how everything works on a good OS is "Lightweight mutex". Here's Be explaining the "Benaphore" for example:

https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26....

Note that when I say "typical scheme for using it" I did not mean, that your program should implement such a scheme itself, instead futex is intended for the sort of people who implement the concurrency features in your language or APIs to use - and this is how they'd go about doing that to deliver the features you just use. So, for most of us this is interesting to know about but unlikely to impact our day-to-day practice.

kccqzy · on April 28, 2021

The most common implementation of mutexes using futex() is that the userland simply spins a while waiting to acquire the mutex (just atomic operations, no system call). After a while the userland gives up spinning and makes the system call.

bogomipz · on April 28, 2021

Thanks this implementation detail is helpful. Cheers.

corbet · on April 28, 2021

See https://lwn.net/Articles/823513/ for a look at both the old and proposed new futex interfaces.

brandmeyer · on April 28, 2021

The best introduction is probably the original presentation paper from 2002.

Fuss, Futexes, and Furwoks: Fast Userlevel Locking in Linux

https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pd...

Another great paper that came after some additional experience by glibc developers:

Futexes are Tricky, by Ulrich Drepper

https://www.akkadia.org/drepper/futex.pdf

MaxBarraclough · on April 28, 2021

I see Drepper resisted the urge to pluralise futex to futices.

gpderetta · on April 28, 2021

A futex is an ephemeral list of waiting threads or processes keyed on an address. It can be used to implement any synchronization or signaling primitive, like mutexes, condition variables, semaphores, etc.

tialaramex · on April 28, 2021

This is a really nice insight into the kernel's model of what's going on. It's probably not the best way for an absolute beginner to think about why they want this, but it's much better than some of the other explanations if you have trickier technical or performance questions, and it makes it obvious why it's clever and why you need the OS kernel to actually implement it.

DSingularity · on April 28, 2021

Okay, consider a multi-threaded program. These programs have multiple threads running on multiple cores. Sometimes you want to coordinate the execution of multiple threads in such programs. A common example is when a thread is in a "critical section" where it is only safe for one thread to access/modify state. Sometimes threads will have to wait for each other. This is where the futex comes in. Instead of the thread to wait while using the processor it can tell the OS that it needs to wait for this futex to become free. The OS will then suspend the thread and find another one to schedule in its place.

bogomipz · on April 28, 2021

Ah Interesting. So is the concept of futex() similar to how a process in kernel mode will yield the CPU by putting itself on a wait_queue and call the schedule() function before it issues an I/O request then?

I realize that one is userspace and the other kernel but would the above be the right mental model?

gpderetta · on April 28, 2021

Yes, a futex is basically a waitqueue abstraction coupled with a TOCTOU guard.

ROARosen · on April 28, 2021

Can someone please ELI5 what is a mutex?

__init · on April 28, 2021

(Edited because I transposed "mutex" and "semaphore" in my mind.)

A mutex is a lock -- if multiple users attempt to access a resource simultaneously, they each attempt to acquire the lock serially, and those not first are excluded (blocked) until the current holder releases the lock.

Mutexes are typically implemented with semaphores:

First a more developer-oriented explanation (because I imagine that's what you really want):

A semaphore locks a shared resource by implementing two operations, "up" and "down". First, the semaphore is initialized to some positive integer, which represents how many concurrent "users" (typically threads) can use that resource. Then, each time a user wants to take advantage of that resource, it "down"s the semaphore , which atomically subtracts one from the current value -- but the key is that the semaphore's value can _never_ be negative, so if the value is currently zero, "down"ing a semaphore implies blocking until it's non-zero again. "Up"ing is just what you'd imagine: atomically increment the semaphore value, and unblock someone waiting, if necessary.

Semaphores are generally seen as a fundamental primitive (one reason being that locks/mutexes can be implemented trivially as a semaphore initialized to one), but they also have broad use as a general synchronization mechanism between threads.

For a true ELI5 of semaphores (I enjoy writing these):

Imagine everyone in class needs to write something, but there are only three pencils available, so you devise a system (a system of mutual exclusion, per se) for everyone to follow while the pencils are used: First, you note how many pencils are available -- three. Then, each time someone asks you for one, (which we call "down"ing the semaphore) you subtract one from the number in your head and give them a pencil. But if that number is zero, you don't have any pencils left, so you ask the person to wait. When someone returns and is done with their pencil, you hand it off to someone waiting, or, if nobody is waiting, you add one to that number in your head and take the pencil back ("up"ing the semaphore). It's important that you decide to only handle one request at a time (atomicity) -- if you tried to do too many things at once, you might lose track of your pencil count and accidentally promise more pencils than you have.

MayeulC · on April 28, 2021

> First, you note how many pencils are available -- three. Then, each time someone asks you for one, (which we call "down"ing the mutex) you subtract one from the number in your head and give them a pencil

Isn't that a semaphore rather than a mutex?

For ELI5 on semaphores, it's like going at the train station, airport, or Mc Donald's: There is a single queue for multiple counters. Everyone is given a paper with a number, and once a counter is free, the one with the lowest number can walk to it. A semaphore is just the number of free counters, it decreases when someone goes to one, and increases when one leaves. If the semaphore is zero, there is no free spot, and people will have to queue.

It's quite close to another synchronization primitive, barriers: if you have to wait for n tasks to finish, you make one with n spots, and release everyone when all the spots are filled.

To me, mutexes are semaphores, but with only one counter: there is only exactly one or zero resources available. If you take the mutex, others will have to wait for you to be done before they can take it. They can queue in any fashion, depending on the implementation: first come, first served; fastest first; most important first; random; etc.

__init · on April 28, 2021

Oh no, you're absolutely right. My sleep-deprived brain has confused the names -- thanks for the heads up. I'll edit my post.

eproxus · on April 28, 2021

But it can’t be turtles all the way down? Isn’t a semaphore counter also shared state that needs to be protected if several threads want to “down” the value? Can it become negative?

gpderetta · on April 28, 2021

Well at the bottom of the turtle stack you have harware primitives that guarantee atomicity of a bunch of word sized arithmetic operations (at the minimum CAS or LL/SC).

ROARosen · on April 28, 2021

Thanks, great analogue there, it really helped it 'click'!

Please excuse my ignorance...

marcodiego · on April 28, 2021

Mutex is named after "mutual exclusion" a primitive that is used to guarantee that some parts of code will run exclusively, never at the same time. This is important for good performance and synchronization on multiple threaded programs.

hyperman1 · on April 28, 2021

A mutex is the software equivalent of the lock on your toilet door. There is a zone in your code where only 1 thread at a time is allowed. So before entering, you lock the mutex. Now either you are allowed in or are added to a set of waiters. When you leave, you unlock the mutex, and 1 other waiter is allowed to enter.

This sounds easy, but is in practice wrought eith peril. You need high performance, not waste too much cpu time on sleepers, some notion of fairness when you have lots of waiters, minimal memory usage, etc...

xxs · on April 28, 2021

My ELI5 version of that locking mechanism, e.g. mutex, uses a toilet and the disaster if multiple people try taking the dump at the same time.

bogomipz · on April 29, 2021

I just found this article which I think is really helpful as it has some good historical context as well:

"Detailed approach of the futex"

http://www.rkoucha.fr/tech_corner/the_futex.html

achernya · on April 28, 2021

futex is a Fast Userspace muTEX. It's the syscall to help implement a mutex when there are two or more threads waiting on the lock to let other processes/threads schedule and do useful work during the wait.

hulitu · on April 28, 2021

As far as i remember a futex is a fast mutex

Iwan-Zotow · on April 28, 2021

it is thing behind/inside mutex. It is making uncontested mutex acquire/release ops pure userspace thingy.

zamalek · on April 28, 2021

> The use case lies in the Wine implementation of the Windows NT interface WaitMultipleObjects.

This can't only be a win for emulating Windows, surely (future) Linux native code could benefit from this too?

CoolGuySteve · on April 28, 2021

Linux's answer is io_uring with eventfd. I've noticed it has higher latency than WaitForMultipleObjects though. I've had a hard time getting the 99th percentile latency below 5usec.

It would be nice if io_uring polling had a futex-like feature that spinned in userspace rather than having to syscall io_uring_wait every time. I feel like that feature would be more universal than futex2.

Adding an atomic int somewhere in the userspace would add ~100ns in the kernel path but save many microseconds in the userspace-spin path during high contention so it seems like a win to me.

the8472 · on April 28, 2021

My understanding of io_uring is that you can already poll the CQ with atomics without any syscall. io_uring_enter() is only needed to inform the kernel that new SQ entries exist and that too can be replaced by polling via IORING_SETUP_SQPOLL in privileged processes.

vardump · on April 28, 2021

Spinning consumes a lot of power, so I'd guess one would only want to do that when very low latency is truly required. At least I'd certainly wouldn't want it to be the default for the generic case.

Of course a low latency option should also exist for those cases that require it and currently that pretty much means spinning.

guenthert · on April 29, 2021

> I've had a hard time getting the 99th percentile latency below 5usec.

Can we take a moment to appreciate how mind-blowing astonishing a 99th percentile latency in the us range is given a general-purpose, multi-user OS with support for virtual memory is?

vardump · on April 28, 2021

> This can't only be a win for emulating Windows, surely (future) Linux native code could benefit from this too?

That would certainly help porting Windows software to Linux.

Cloudef · on April 28, 2021

What happened to NTSYNC? https://lore.kernel.org/lkml/f4cc1a38-1441-62f8-47e4-0c67f5a...

elynl · on April 28, 2021

It's still a thing! kernel: https://repo.or.cz/linux/zf.git/shortlog/refs/heads/winesync wine: https://repo.or.cz/wine/zf.git/shortlog/refs/heads/fastsync2

They decided to rename it since the kernel maintainers don't want NT interfaces and having NT in the name doesn't help

Cloudef · on April 28, 2021

Thats good to hear. Darling devs also went with kernel module approach. Its weird wine devs did not do the same until now. Wineserver is nice but as we see, its sometimes impossible to do accurate emulation of some apis there.

https://docs.darlinghq.org/internals/basics/system-call-emul...

elynl · on April 29, 2021

I guess they didn't go down that path in order to increase portability. A good chunk of crossover (wine with paid support) customers run macOS.

Cloudef · on April 29, 2021

Thats a good point

monocasa · on April 28, 2021

Too bad bpf isn't allowed for non root yet. A bpf futex program like xok's wake programs would be more general, and way nicer.

fefe23 · on April 28, 2021

What are you talking about?

I have been using bpf in seccomp for non-root for years.

Maybe your distro patched that out. The stock kernel allows bpf for non-root.

monocasa · on April 28, 2021

Sort of... there's been push back in the post spectre days of expanding non root usage of BPF. Right now, a subset of trace programs and seccomp are allowed for non-root, but that hasn't expanded since 2016. Everything else requires CAP_SYS_ADMIN.

They would probably remove those too if the balance of Linux didn't put "don't break user space" higher than "defense in depth security".

broodbucket · on April 29, 2021

Just in recent months we've had CVE-2020-2717{0,1} (memory exposure) followed by CVE-2021-29154 (arbitrary code execution on x86). None of which have even had fixes applied by certain distros[0].

I don't really know what the use cases for unprivileged BPF are though.

[0] https://access.redhat.com/security/cve/cve-2021-29154

a-dub · on April 28, 2021

what's the nt api call called? WaitForMultipleObjects? takes me back 20 years just thinking about it...

dboreham · on April 28, 2021

The NT api would begin ntXxxx WaitForMultipleObjects is a Win32 function.

poizan42 · on April 28, 2021

I don't know why you are being downvoted, you are completely right. The NT api function is NtWaitForMultipleObjects. The Win32 apis WaitForMultipleObjects and WaitForMultipleObjectsEx are thin wrappers over this.

AndriyKunitsyn · on April 28, 2021

Is this NT API actually possible to be called from userspace? Or even from drivers?

poizan42 · on April 28, 2021

Yes of course, they are all exported from ntdll.dll (in userspace) and ntoskrnl.exe (in kernelspace). You may need to link dynamically or make your own import libraries for many of them though.

andrealmeid · on April 28, 2021

Sure, here's an userspace example: https://docs.microsoft.com/en-us/windows/win32/sync/waiting-...

leeter · on April 28, 2021

Based on the docs kernel mode drivers would call KeWaitForMultipleObjects instead it appears. User space should just call WaitForMultipleObjects. While NtWaitForMultipleObjects and ZwWaitForMultipleObjects exist they seem to be undocumented which means they probably have funky behavior. Anything in KernelBase.dll is a pretty thin wrapper that I'd be hesitant to call Win32 so much as "really thin syscall wrapper"

a-dub · on April 28, 2021

think of it as like select and pthread_cond_wait rolled into one generic api for waiting on one or more things from the kernel.

it's actually pretty nice how generic they made it.

ziml77 · on April 28, 2021

If we're being pedantic here, it's not Win32.

> (Note that this was formerly called the Win32 API. The name Windows API more accurately reflects its roots in 16-bit Windows and its support on 64-bit Windows.)

Though Win32 is going to stick around for a while as can be seen in the URL for the page that's quoted from: https://docs.microsoft.com/en-us/windows/win32/apiindex/wind...

a-dub · on April 28, 2021

i think it's still called win32 today because win32 was a massive overhaul. 16 bit windows was loaded with that far and near pointer gunk to stretch those 16 bit pointers which win32 was able to do away with thanks to larger pointer widths.

regarding win32 vs. the nt api. i'm not super familiar with the pedantic distinctions there, but i do know, much to the programmer's chagrin, the thousands of windows api functions have varying levels of kernel entry, where i always have appreciated the (relatively) small number of unix style syscalls and the clear delineation between kernel and userspace api. (although i suppose futexes blur that, in name at least!)

Joker_vD · on April 28, 2021

Well, since introduction of vsyscall and vDSO, the Linux syscalls too have varying levels of kernel entry.

beatrobot · on April 28, 2021

NtXxxx or ZwXxxx

Joker_vD · on April 28, 2021

By the way, hearing about futex implementation and their API considerations should also bring everyone about 20 years back.

tobias3 · on April 28, 2021

WaitOnAddress and co. But that one was added with Windows 8, so it can't take you back 20 years :)

gpderetta · on April 28, 2021

WFMO is indeed the api that the new futex_waitv is supposed to help accelerate under wine.

jwilk · on April 29, 2021

I find the lore.kernel.org UI much nicer:

https://lore.kernel.org/lkml/20210427231248.220501-1-andreal...

twoodfin · on April 28, 2021

Unclear to me from the writeup: Would it be possible to implement the original futex() API on top of futex2() or is Linux doomed to preserve most of two complete futex implementations?

guipsp · on April 28, 2021

Would it be possible? Probably yes, but not performant.

nwmcsween · on April 29, 2021

A bit OT but I don't understand why kernel interfaces are single purpose vs a data-driven api, it seems like such a wasted opportunity to optimize and to have a fluid api

Shadonototro · on April 28, 2021

if linux people focused on offering compelling ecosystem for app/game distribution, they wouldn't need to bloat their os with slow and bloated emulators like wine

the reason why linux isn't gaining shares on steam anymore is because valve openly said linux has no ecosystem and is forced to emulate windows -- proton

broodbucket · on April 29, 2021

The alternative is expecting companies to make native ports for what would be less than 1% of their userbase - and the handful that have done so haven't done it well, and even if they did it well it wouldn't change any outcomes for them.

Before Wine actually got good, the few "native" ports we had of major games on Linux were poorly optimised and often worse than just running the Windows versions in Wine.

Sometimes you gotta take what you can get. Valve's approach to invest in Wine means I don't need to dual boot any more. Native games are the cherry on top.

>the reason why linux isn't gaining shares on steam anymore

Did anyone ever think Linux had any kind of comparable gaming ecosystem? lol

Valve invests in Linux as a safety net in case Microsoft ever flicks a switch and forces all game distribution through the Windows Store or similar.

Shadonototro · on April 29, 2021

> The alternative is expecting companies to make native ports for what would be less than 1% of their userbase - and the handful that have done so haven't done it well, and even if they did it well it wouldn't change any outcomes for them.

Valve decided to seek short term gains instead of thinking of a long term strategy by working with linux people to develop a nice ecosystem with libraries / drivers / app distribution

They could collaborate with Epic/Unity to facilitate linux builds

Valve went selfish and the result is here, linux share isn't growing, despite massive efforts from microsoft to kill themselves with vista, windows 8, and window 10 massive spyware drama and now the updates that break gaming

What if windows 11 is again a free upgrade and completely breaks wine? by introducing massive architecture changes

That'll be the death of wine and linux gaming, because no thought went into securing a nice and flourish ecosystem

Zambyte · on April 29, 2021

> Valve decided to seek short term gains instead of thinking of a long term strategy by working with linux people to develop a nice ecosystem with libraries / drivers / app distribution.

The part that you suggest they should do "instead" is exactly what they have been doing. As far as I know, every game by Valve is native on Linux, along with their source engine.

> They could collaborate with Epic/Unity to facilitate linux builds

Both Unreal Engine and Unity can export games for Linux. Both seem to be able to be run on Linux for development as well. What more is needed?

> What if windows 11 is again a free upgrade and completely breaks wine?

There is literally nothing Microsoft could do to break wine; it isn't their software. If Microsoft breaks backwards compatibility, no existing software for Windows would run on the new version. Linux + WINE would be the best option for running the majority of Windows software with the most recent updates (one could of course still use older versions of Windows instead, but that's a bad idea long term).

Shadonototro · on April 29, 2021

You seems to have better ideas how to grow linux ecosystem

By looking at steam stats it doesn't seems to work, but i'll trust your judgement since it seems to.. not work at all.. but that might be the goal, to waste time, effort, money, and they gotta justify the 30% tax somehow, hey we are working on things nobody care about!

ijlx · on April 28, 2021

Wine is not an emulator, it is a compatibility layer for Windows system calls on a Unix-like platform. If something runs on wine, the performance overhead should be unnoticeable/very small for the typical user.

encryptluks2 · on April 28, 2021

A compatibility layer wouldn't require that you essentially emulate the same file system layout, registry and functionality. It is basically emulating a Windows environment.

samus · on April 28, 2021

In many cases it is indeed not so simple:

* Most nontrivial Windows program would get very confused if exposed to a trivially mapped Unix filesystem. It's not so bad if every program would use environment variables to locate stuff, but you can't discount the possibility that some of them hardcode paths. Also, application data in Windows is usually kept in a common folder, while in Unix filesystems it is usually split across `/usr/bin`, `/usr/share`, `/usr/lib` and so on.

* Many Windows applications store their configuration in the registry. That part is not too bad though; it should be completely doable in userspace.

* The application might issue system calls that have different semantics, or that don't even exist on Linux. TA is significant because it paves the way to an efficient implementation of one of these system calls.

encryptluks2 · on April 28, 2021

> Most nontrivial Windows program would get very confused if exposed to a trivially mapped Unix filesystem. It's not so bad if every program would use environment variables to locate stuff, but you can't discount the possibility that some of them hardcode paths. Also, application data in Windows is usually kept in a common folder, while in Unix filesystems it is usually split across `/usr/bin`, `/usr/share`, `/usr/lib` and so on.

The application data in windows would not be the same as `/usr/` paths. It would be the equivalent of `~/.config` and `~/.local`. The last thing anyone in Linux would want is for Windows to start populating their system folders, let alone probably creating hundreds of top-level folders in their `~/.config` or `~/.local` directory.

> Many Windows applications store their configuration in the registry. That part is not too bad though; it should be completely doable in userspace.

The Windows registry is an atrocious monster that should have died already. You can rarely ever truly delete an app in Windows, as there will be sometimes hundreds of registry keys leftover.

> TA is significant because it paves the way to an efficient implementation of one of these system calls.

If there is mutual benefit outside of Windows, then sure why not. If it is built just for better Wine compatibility with Windows then I hope to see a rant by Linus telling them how ridiculous they are.

samus · on May 2, 2021

Sorry, my bad here. Of course, only the binaries and static assets are stored in C:\Program Files. Your point is the perfect argument though for why Wine emulates a Windows file system hierarchy.

I don't like the registry either. My `not too bad` refers to the effort required to support it. Wine should make it possible though give each application its own copy of the registry. This copy could be nuked along with the application when it is uninstalled.

Linus doesn't like a lot of things. ZFS for example. It certainly sucks to maintain out-of-kernel patches, but if there is enough benefit, someone will do it. And it would be actually pretty sweet and ironic if Windows applications run faster on the Linux desktop than on Windows. Also, I am optimistic that futex2 will find many applications outside of Wine.

anthk · on April 28, 2021

So Windows 10 emulates Windows XP/2000/98 too?

encryptluks2 · on April 28, 2021

Pretty much... they took a pile of crap and kept adding onto it.

Dah00n · on April 28, 2021

That has nothing in common with emulating though. Running ARM code in x64 is emulating. If adding code were emulating than Linux is emulating Linux v0.1. Wine isn't an emulator. PCSX2 is an emulator.

chungy · on April 28, 2021

The word "emulate" is overloaded, but because there's often an assumption of ISA emulation, Wine for a time called itself "WINE Is Not an Emulator" to prevent said assumption.

Taking "emulate" at its dictionary-definition, which is to imitate or copy, it's easy to argue that Wine is an emulator, because it copies and imitates the Windows ABI and API (the earliest versions of Wine even called itself a "WINdows Emulator"). They just happen to draw a hard line against emulating x86.

chungy · on April 28, 2021

Ironically, it seems quite often that Wine runs apps/games faster than Windows does.

fnord77 · on April 28, 2021

w.i.n.e = Wine Is Not an Emulator

Shadonototro · on April 29, 2021

do not trust what people say they are

define people by what they do

that works with software too