V2 - Difference between Vector3Wide and Vector3

Nockawa · Post by **Nockawa** » Sun Aug 19, 2018 12:40 pm

First, congratulation for this amazing lib, the perf are awesome for a .net project!
I'm trying to understand things a little bit more and I have few questions.
The first would be about the Vector*Wide classes you implemented, what are the differences with the System.Numerics.Vector* ?

Thanks

Post by **Norbo** » Sun Aug 19, 2018 9:21 pm

Vector3 represents a single 3d vector, while Vector3Wide represents N different 3d vectors stored in a bundle. In the current implementation based on Vector<T>, N is Vector<float>.Count.

So, anywhere you see a Vector*Wide, whatever operation it's doing is actually applied in parallel across a batch of multiple lanes. All the constraints and most collision detection routines operate on N different constraints/collision pairs at a time.

The reason for this is performance. All modern CPU architectures are able to execute multiple operations in a single cycle- both by having multiple execution ports for certain instructions, and critically, by having Single Instructions that work and Multiple pieces of Data (SIMD). So, rather than executing just adding two numbers together, you can add two sets of numbers together.

System.Numerics.Vector* types are recognized by the JIT and emit such SIMD instructions, but are limited to the size of the vector in question. Adding two Vector3s together only gets you 3 simultaneous adds in one instruction. In contrast, pretty much all new processors can do 8-wide operations on 32 bit floating point numbers, and some can do 16-wide operations (though I don't think CoreCLR exposes AVX512 yet).

Since Vector3Wide conceptually represents a number of Vector3 instances equal to the the machine/runtime SIMD width, you can make full use of the available computational throughput. In other words, representing each operation as a line:
8 Vector3 adds:

Code: Select all

            [rx0, ry0, rz0] = [x0, y0, z0] + [x0, y0, z0]
            [rx1, ry1, rz1] = [x1, y1, z1] + [x1, y1, z1]
            [rx2, ry2, rz2] = [x2, y2, z2] + [x2, y2, z2]
            [rx3, ry3, rz3] = [x3, y3, z3] + [x3, y3, z3]
            [rx4, ry4, rz4] = [x4, y4, z4] + [x4, y4, z4]
            [rx5, ry5, rz5] = [x5, y5, z5] + [x5, y5, z5]
            [rx6, ry6, rz6] = [x6, y6, z6] + [x6, y6, z6]
            [rx7, ry7, rz7] = [x7, y7, z7] + [x7, y7, z7]

and now the same result, but with Vector3Wide.Add:

Code: Select all

[rx0, rx1, rx2, rx3, rx4, rx5, rx6, rx7] = [x0, x1, x2, x3, x4, x5, x6, x7] + [x0, x1, x2, x3, x4, x5, x6, x7]
[ry0, ry1, ry2, ry3, ry4, ry5, ry6, ry7] = [y0, y1, y2, y3, y4, y5, y6, y7] + [y0, y1, y2, y3, y4, y5, y6, y7]
[rz0, rz1, rz2, rz3, rz4, rz5, rz6, rz7] = [z0, z1, z2, z3, z4, z5, z6, z7] + [z0, z1, z2, z3, z4, z5, z6, z7]

Nockawa · Post by **Nockawa** » Sun Aug 19, 2018 10:49 pm

Thank you for the detailed explanation. But there's still something I don't fully understand.
Let's take the Triangle/TriangleWide classes as example: Triangle hold one triangle and TriangleWide should hold 8 triangles on AVX2.
I don't understand how you store the eight triangles in a single TriangleWide instance.
I saw two methods that involve creation/write: Broadcast which seems to setup a Wide object with only one instance and WriteFirst which defer to its Vector3Wide counterpart and seem to setup only the first lane.

I obviously miss something and try to figuring.

Oh my, I just may have found the answer: the GatherScatter.GetOffsetInstance is used to shift by the given index the *Wide object to return the desired lane?
But it means GetOffsetInstance returns a *Wide object but which may actually not hold Vector<float>.Count elements but Count-offset, right?
Then you use Broadcast on the returned *Wide object to store the actual data and ReadFirst to store a lane on a Vector3 object?

Post by **Norbo** » Sun Aug 19, 2018 11:54 pm

Oh my, I just may have found the answer: the GatherScatter.GetOffsetInstance is used to shift by the given index the *Wide object to return the desired lane?
But it means GetOffsetInstance returns a *Wide object but which may actually not hold Vector<float>.Count elements but Count-offset, right?

Yup!

Then you use Broadcast on the returned *Wide object to store the actual data and ReadFirst to store a lane on a Vector3 object?

Broadcast fills every lane of the current Wide instance, so it could stomp memory past the end of the 'real' *Wide object. To fill only a single lane, the WriteFirst functions are used. Those reinterpret the memory as scalar values for direct writing. ReadFirst does the same for direct reading.

Broadcast is only used in cases where the same value is desired in every lane. That shows up often when dealing with multiple things (like rays) being tested against single things (like tree node bounding boxes).

It's worth mentioning that all of this is a bit of a hacky workaround that makes assumptions about how the JIT handles enregistration. It's fragile and could break if the compiler changes significantly. A safer/faster solution would be to use explicit gathers and scatters as exposed by the platform intrinsics. Even though those don't tend to be remarkably fast (they're still bound by the fundamentals of memory access), they can generate better code than the GetOffsetInstance hack. Unfortunately, those intrinsics weren't available when I wrote everything, and they're still a ways off from completion last I checked.

When they get further along (and I get some time), I'll probably revisit all the gather/scatter cases, among other things.

Nockawa · Post by **Nockawa** » Mon Aug 20, 2018 8:45 am

Thanks for the information, that is very valuable !

In case you're not aware, you may want to follow this guy: https://github.com/fiigii he's doing a lot of .net core simd code.
He's currently working on the Gather instrinsic: https://github.com/dotnet/coreclr/pull/19392

As you said the design challenge is to organize the data in a total different way (by columns, not by row) in order to take advantage of the whole SIMD unit power. I remember fiigii talking about this just before .net core 2.1 got release. I believe there's work in progress to address this kind of things for the next minor release of .net core.

Post by **Norbo** » Mon Aug 20, 2018 10:08 pm

In case you're not aware, you may want to follow this guy: https://github.com/fiigii he's doing a lot of .net core simd code.
He's currently working on the Gather instrinsic: https://github.com/dotnet/coreclr/pull/19392

I've been glancing now and again at his work and the related efforts, but I should indeed probably pay a bit closer attention as things move along.
I suppose even 3.0 isn't too far away now- should be fun to play with.

BEPUphysics

V2 - Difference between Vector3Wide and Vector3

V2 - Difference between Vector3Wide and Vector3

Re: V2 - Difference between Vector3Wide and Vector3

Re: V2 - Difference between Vector3Wide and Vector3

Re: V2 - Difference between Vector3Wide and Vector3

Re: V2 - Difference between Vector3Wide and Vector3

Re: V2 - Difference between Vector3Wide and Vector3