12bitfloat

295d

PSA: The smaller the compute shader workgroups the more efficient they are, down to the wave size (32 on nvidia). Not exactly sure why, but looks like if you don't need group shared memory always have your workgroups be wave sized

Just this alone gave me a 30%+ performance increase. And combined with a few other changes got me from 50 µs to 10 µs, yay!

random

vulkan

psa

Ranter

Comments

4

12bitfloat

10655

295d

Update: Actually I'm kinda wrong. I have some fullscreen workload and that is fastest with 8*8*1 workgroups. Both 8*4*1 (wave sized) and 16*16*1 are noticibly slower...

Guess if you're reading from an image per globalInvocationId, cache also plays a big role and having 64 threads closer together in terms of cache access outweighs some of the gains of smaller workgroup sizes?
2

Lensflare

19801

295d

No idea what you are talking about but sounds cool
2

atheist

10881

294d

Welcome to the world of performance. Profile everything, your assumptions are probably wrong.
2

CoreFusionX

3562

294d

SIMD performance can be really hard to analytically predict.

In the case of compute shaders, it really boils down in the end on them needing access to something else besides their own vertex/geometry/pixel/whatever.

That forces intrinsic dependencies between them, which coupled with, as you correctly said, caching and threading phenomena, can unpredictably impact performance.
2

Wisecrack

9365

292d

@Lensflare I just came here to say this. No idea, but sounds cool, and also write more about the topic we don't understand because it has the same flavor of fun as reading about pseudo-esoteric wizard rituals in third party DnD supplements.

More blood sacrifice please.

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service