Thank you both so much for your help with this, I very much appreciate it!
@test44x:
I did see your inlining comment, but I think there's more going on:
Code: Select all
(a.M11.rawVal * b.M11.rawVal) >> Fast64.Q
This should more or less turn into a single mul machine instruction and a shift, but aren't you losing half the range of your fixed-point value to the potential overflow in the multiplication? Which is why the multiplication method in FixedMath.Net is such a convoluted thing.
I'll try your suggestion of starting with Vector3.Dot and Norbo's of using full ref-mode.
EDIT: Does not seem to make any difference. (Sorry, wall of code incoming)
New code:
Code: Select all
public static void Dot(ref Vector3 a, ref Vector3 b, out Fix64 product)
{
Fix64 result = Fix64.Zero;
Fix64 temp = new Fix64();
Fix64.mul(ref a.X, ref b.X, ref result);
Fix64.mul(ref a.Y, ref b.Y, ref temp);
Fix64.add(ref result, ref temp, ref result);
Fix64.mul(ref a.Z, ref b.Z, ref temp);
Fix64.add(ref result, ref temp, ref result);
product = result;
}
Resulting assembly:
Code: Select all
Fix64 result = Fix64.Zero;
000007FE76702D10 push rdi
000007FE76702D11 push rsi
Fix64.mul(ref a.X, ref b.X, ref result);
000007FE76702D12 mov rax,qword ptr [rcx]
000007FE76702D15 mov r9,qword ptr [rdx]
000007FE76702D18 mov r10d,0FFFFFFFFh
000007FE76702D1E and r10,rax
000007FE76702D21 mov r11d,0FFFFFFFFh
000007FE76702D27 and r11,r9
000007FE76702D2A sar r9,20h
000007FE76702D2E mov rsi,r10
000007FE76702D31 imul rsi,r11
000007FE76702D35 imul r10,r9
000007FE76702D39 sar rax,20h
000007FE76702D3D imul r11,rax
000007FE76702D41 shr rsi,20h
000007FE76702D45 imul rax,r9
000007FE76702D49 shl rax,20h
000007FE76702D4D add rsi,r10
000007FE76702D50 add rsi,r11
000007FE76702D53 add rax,rsi
Fix64.mul(ref a.Y, ref b.Y, ref temp);
000007FE76702D56 cmp dword ptr [rcx],ecx
000007FE76702D58 lea r9,[rcx+8]
000007FE76702D5C cmp dword ptr [rdx],edx
000007FE76702D5E lea r10,[rdx+8]
000007FE76702D62 mov r9,qword ptr [r9]
000007FE76702D65 mov r10,qword ptr [r10]
000007FE76702D68 mov r11d,0FFFFFFFFh
000007FE76702D6E and r11,r9
000007FE76702D71 mov esi,0FFFFFFFFh
000007FE76702D76 and rsi,r10
000007FE76702D79 sar r10,20h
000007FE76702D7D mov rdi,r11
000007FE76702D80 imul rdi,rsi
000007FE76702D84 imul r11,r10
000007FE76702D88 sar r9,20h
000007FE76702D8C imul rsi,r9
000007FE76702D90 shr rdi,20h
000007FE76702D94 imul r9,r10
000007FE76702D98 shl r9,20h
000007FE76702D9C add rdi,r11
000007FE76702D9F add rdi,rsi
000007FE76702DA2 add r9,rdi
Fix64.add(ref result, ref temp, ref result);
000007FE76702DA5 add rax,r9
Fix64.mul(ref a.Z, ref b.Z, ref temp);
000007FE76702DA8 cmp dword ptr [rcx],ecx
000007FE76702DAA add rcx,10h
000007FE76702DAE cmp dword ptr [rdx],edx
000007FE76702DB0 add rdx,10h
000007FE76702DB4 mov rcx,qword ptr [rcx]
000007FE76702DB7 mov rdx,qword ptr [rdx]
000007FE76702DBA mov r9d,0FFFFFFFFh
000007FE76702DC0 and r9,rcx
000007FE76702DC3 mov r10d,0FFFFFFFFh
000007FE76702DC9 and r10,rdx
000007FE76702DCC sar rdx,20h
000007FE76702DD0 mov r11,r9
000007FE76702DD3 imul r11,r10
000007FE76702DD7 imul r9,rdx
000007FE76702DDB sar rcx,20h
000007FE76702DDF imul r10,rcx
000007FE76702DE3 shr r11,20h
000007FE76702DE7 imul rcx,rdx
000007FE76702DEB mov rdx,rcx
000007FE76702DEE shl rdx,20h
000007FE76702DF2 add r11,r9
000007FE76702DF5 add r11,r10
000007FE76702DF8 add rdx,r11
Fix64.add(ref result, ref temp, ref result);
000007FE76702DFB add rax,rdx
product = result;
000007FE76702DFE mov qword ptr [r8],rax
000007FE76702E01 pop rsi
000007FE76702E02 pop rdi
000007FE76702E03 ret
Old code:
Code: Select all
public static void Dot(ref Vector3 a, ref Vector3 b, out Fix64 product)
{
product = a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
Assembly:
Code: Select all
product = a.X * b.X + a.Y * b.Y + a.Z * b.Z;
000007FE76722D10 push rdi
000007FE76722D11 push rsi
000007FE76722D12 mov rax,qword ptr [rcx]
000007FE76722D15 mov r9,qword ptr [rdx]
000007FE76722D18 mov r10d,0FFFFFFFFh
000007FE76722D1E and r10,rax
000007FE76722D21 mov r11d,0FFFFFFFFh
000007FE76722D27 and r11,r9
000007FE76722D2A sar r9,20h
000007FE76722D2E mov rsi,r10
000007FE76722D31 imul rsi,r11
000007FE76722D35 imul r10,r9
000007FE76722D39 sar rax,20h
000007FE76722D3D imul r11,rax
000007FE76722D41 shr rsi,20h
000007FE76722D45 imul rax,r9
000007FE76722D49 shl rax,20h
000007FE76722D4D add rsi,r10
000007FE76722D50 add rsi,r11
000007FE76722D53 add rax,rsi
000007FE76722D56 mov r9,qword ptr [rcx+8]
000007FE76722D5A mov r10,qword ptr [rdx+8]
000007FE76722D5E mov r11d,0FFFFFFFFh
000007FE76722D64 and r11,r9
000007FE76722D67 mov esi,0FFFFFFFFh
000007FE76722D6C and rsi,r10
000007FE76722D6F sar r10,20h
000007FE76722D73 mov rdi,r11
000007FE76722D76 imul rdi,rsi
000007FE76722D7A imul r11,r10
000007FE76722D7E sar r9,20h
000007FE76722D82 imul rsi,r9
000007FE76722D86 shr rdi,20h
000007FE76722D8A imul r9,r10
000007FE76722D8E shl r9,20h
000007FE76722D92 add rdi,r11
000007FE76722D95 add rdi,rsi
000007FE76722D98 add r9,rdi
000007FE76722D9B add rax,r9
000007FE76722D9E mov rcx,qword ptr [rcx+10h]
000007FE76722DA2 mov rdx,qword ptr [rdx+10h]
000007FE76722DA6 mov r9d,0FFFFFFFFh
000007FE76722DAC and r9,rcx
000007FE76722DAF mov r10d,0FFFFFFFFh
000007FE76722DB5 and r10,rdx
000007FE76722DB8 sar rdx,20h
000007FE76722DBC mov r11,r9
000007FE76722DBF imul r11,r10
000007FE76722DC3 imul r9,rdx
000007FE76722DC7 sar rcx,20h
000007FE76722DCB imul r10,rcx
000007FE76722DCF shr r11,20h
000007FE76722DD3 imul rcx,rdx
000007FE76722DD7 mov rdx,rcx
000007FE76722DDA shl rdx,20h
000007FE76722DDE add r11,r9
000007FE76722DE1 add r11,r10
000007FE76722DE4 add rdx,r11
000007FE76722DE7 add rax,rdx
000007FE76722DEA mov qword ptr [r8],rax
000007FE76722DED pop rsi
000007FE76722DEE pop rdi
000007FE76722DEF ret
As far as I can tell the assembly code is 95% identical (somehow loading the .Y and .Z values happens slightly differently?). I don't think the issue lies with the inlining or method overhead, but simply with the complexity of the multiplication method. Or am I missing something?