Vector3 represents a single 3d vector, while Vector3Wide represents N different 3d vectors stored in a bundle. In the current implementation based on Vector<T>, N is Vector<float>.Count.
So, anywhere you see a Vector*Wide, whatever operation it's doing is actually applied in parallel across a batch of multiple lanes. All the constraints and most collision detection routines operate on N different constraints/collision pairs at a time.
The reason for this is performance. All modern CPU architectures are able to execute multiple operations in a single cycle- both by having multiple execution ports for certain instructions, and critically, by having Single Instructions that work and Multiple pieces of Data (SIMD). So, rather than executing just adding two numbers together, you can add two
sets of numbers together.
System.Numerics.Vector* types are recognized by the JIT and emit such SIMD instructions, but are limited to the size of the vector in question. Adding two Vector3s together only gets you 3 simultaneous adds in one instruction. In contrast, pretty much all new processors can do 8-wide operations on 32 bit floating point numbers, and some can do 16-wide operations (though I don't think CoreCLR exposes AVX512 yet).
Since Vector3Wide conceptually represents a number of Vector3 instances equal to the the machine/runtime SIMD width, you can make full use of the available computational throughput. In other words, representing each operation as a line:
8 Vector3 adds:
Code: Select all
[rx0, ry0, rz0] = [x0, y0, z0] + [x0, y0, z0]
[rx1, ry1, rz1] = [x1, y1, z1] + [x1, y1, z1]
[rx2, ry2, rz2] = [x2, y2, z2] + [x2, y2, z2]
[rx3, ry3, rz3] = [x3, y3, z3] + [x3, y3, z3]
[rx4, ry4, rz4] = [x4, y4, z4] + [x4, y4, z4]
[rx5, ry5, rz5] = [x5, y5, z5] + [x5, y5, z5]
[rx6, ry6, rz6] = [x6, y6, z6] + [x6, y6, z6]
[rx7, ry7, rz7] = [x7, y7, z7] + [x7, y7, z7]
and now the same result, but with Vector3Wide.Add:
Code: Select all
[rx0, rx1, rx2, rx3, rx4, rx5, rx6, rx7] = [x0, x1, x2, x3, x4, x5, x6, x7] + [x0, x1, x2, x3, x4, x5, x6, x7]
[ry0, ry1, ry2, ry3, ry4, ry5, ry6, ry7] = [y0, y1, y2, y3, y4, y5, y6, y7] + [y0, y1, y2, y3, y4, y5, y6, y7]
[rz0, rz1, rz2, rz3, rz4, rz5, rz6, rz7] = [z0, z1, z2, z3, z4, z5, z6, z7] + [z0, z1, z2, z3, z4, z5, z6, z7]