Metal shader converter and the missing device-scoped barrier

bronxbomber92 · on June 13, 2023

I believe this post is referring to device-scoped memory barriers - also sometimes called fences - as opposed to execution barriers.

The former being a mechanism to ensure memory accesses follow a well defined order (e.g. it'd be bad if the memory accesses executed inside a critical section could be reordered before or after the lock and unlock calls).

The latter being a mechanism that ensures all threads (within some scope, perhaps all threads running on the "device") reach the same point in the program before any are allowed to proceed.

raphlinus · on June 13, 2023

That's correct, it's the memory scope that I expect to be device-scoped. GPUs tend not to have execution barriers in the shader language beyond workgroup scope; generally the next coarser granularity for synchronization is a separate dispatch. However, single-pass prefix sum algorithms, including decoupled look-back, can function just fine with device-scoped memory barriers, and do not require execution barriers with coarser scope than workgroup.

samus · on June 14, 2023

The post also mentions unspecified behavior (mixing atomic and non-atomic memory accesses) where everybody has to cross their fingers and hope that the hardware designers had the same idea about how it should work. Which is almost fine with enough test coverage, but a shader translation layer adds uncomfortable complexity on top of it.

tedunangst · on June 13, 2023

So how [well] does MoltenVK work? The prevailing attitude I've seen is basically "just target vulkan for everything because it just works" but I'm not sure how much experience is reflected in such claims.

raphlinus · on June 13, 2023

If you're doing advanced compute work (including lock-free data structures), then it's best effort.

https://github.com/linebender/vello/issues/42 is an issue from when Vello (then piet-gpu) had a single-pass prefix sum algorithm. Looking back, I'm fairly confident that it's a shader translation issue and that it wouldn't work with MoltenVK either, but we stopped investigating when we moved to a more robustly portable approach.

HexDecOctBin · on June 14, 2023

So in general, am I right in assuming that any advanced compute work would be a no-go on Apple Silicon?

I am working on a 3D SDF renderer for games (fully compute driven), and my older PC is starting to croak. I was thinking of going Mac Studio 2, but if their GPU doesn't support memory barriers and such (even though I am not using them yet), I guess it's not worth the risk?

raphlinus · on June 14, 2023

It really depends on your workload. Prefix sum is pretty important, but you can also work around the missing barrier by doing extra dispatches. You'll want to do that if your goal is portable code. Metal also has some nice things, including real pointers (available in Vulkan 1.3 but not earlier versions, and not in HLSL).

richdodd · on June 13, 2023

Does the M1/M2 use ARM designs in the GPU as well as the CPU? If so, it might be possible to work out what could be implemented by looking at the [arm docs](https://developer.arm.com/documentation/102203/0100/Valhall-...).

DeRock · on June 13, 2023

Apple doesn’t use ARM IP for either, and hasn’t for many years.

raphlinus · on June 13, 2023

The most complete documentation is in the applegpu repo[1] by dougallj showing a great deal of recent activity (including by alyssarosenzweig). Last I checked, the documentation of barrier instructions wasn't complete enough to tell whether these device-scoped barriers are possible. (Note: on RDNA2, they're accomplished by DLC and GLC flags on memory accesses, combined with cache flush instructions such as S_GL1_INV).

There's also a lot of great material, accessibly written, on Alyssa's blog[2], see in particular the posts titled "Dissecting the Apple M1 GPU, part ${I}".

[1]: https://github.com/dougallj/applegpu

[2]: https://rosenzweig.io/

nicoburns · on June 13, 2023

No, they have a custom GPU design originally derived from Imagination Technologies PowerVR GPUs.

richdodd · on June 13, 2023

Hmm OK according to the doucmentation they designed the GPU themselves, so there's no public information on them.

pjmlp · on June 14, 2023

> The Vulkan ecosystem is notorious for this: the extension list at vulkan.gpuinfo.org currently lists 146 extensions.

Proudly following up OpenGL spaghetti extension developer experience.

As mentioned at Vulkanised 2023, Khronos keeps pouring extensions at a rate no one is able to catch up with.

No wonder proprietary APIs keep being prefered by game studios, with their developer first tooling approach.

samus · on June 14, 2023

Kronos can't really do anything else since it is a consortium where hardware vendors take part in, who can block or outright ignore anything they don't like. It's Microsoft who can impose APIs and "standards" thanks to their market dominance.

Many extensions eventually become parts of a core profile. Some of then are clarifications or improvements to existing extensions. Many are APIs for proprietary features that never become part of any core profiles.

pjmlp · on June 14, 2023

And Apple, and Sony, and Nintendo.

Extension spaghetti is only the same API in name, given how different many code paths tend to be like.

Do you mean the hardware vendors, happily collaborating with platform owners as means to design their future hardware roadmap, leaving Khronos extensions to 2nd place, like ray tracing, mesh shaders, direct IO,...?

Also the fact that GLSL development is kind of done, and almost everyone is using HLSL to SPIR-V instead?

moonchild · on June 14, 2023

While you're here, I should ask: what do you think of my middle-ground proposal (https://lobste.rs/s/oxzs1q/note_on_metal_shader_converter) that the interface contain a black-box scan primitive, rather than stronger low-level guarantees?

raphlinus · on June 14, 2023

My personal feeling is that higher level layers should be built on top of strong lower level layers, and we don't have that yet. Scan is of course useful, and there are a ton of them in Vello, but there are other things that are somewhat scan-like but use low-level primitives in a different way, like my stack monoid work.

Out of scope for this blog post, but I also believe there is tremendous potential for hardware that broadly resembles existing GPUs (same register layout, instruction scheduling, memory hierarchy) but with fewer limitations and performance problems. For example, the queue that's currently hardcoded between vertex and fragment shaders could be exposed as a general purpose queue primitive, allowing more dynamic scheduling of multiple different kernels in parallel. I haven't done a deep enough dive into hardware to make concrete proposals, but as I explore, I get more and more indications that what we have access to is a fairly limited subset of what's possible.

robbies · on June 14, 2023

That's an interesting idea. I wouldn't really call the Raster Block a 'queue' in most GPUs. The most queue like thing is the buffer used to store non-position exports from VS, but then those are interpolated before dropped off into PS. There's a lot of fixed function machinery to convert vertices into fragments (setup, cull, clip, rasterize, early Z/S).

And that's not even considering the vast differences between AMD and NV here. And then there are the mobile GPU constraints.

Still, would be fun to export more of the controls here. If you have access to a console devkit, some of these knobs are exposed.

Animats · on June 13, 2023

Apple having to Think Different mean we need about two more layers in portable games.

pjmlp · on June 14, 2023

And Sony, Nintendo, Microsoft,...