Canonical Stereo Code
Prime Directive
This is the prime directive code in ASM for DX9 and HelixMod. And the HLSL version for 3Dmigoto. In all cases, we are implementing the NVidia specified formula of:
clipPos.x += EyeSign * Separation * ( clipPos.w – Convergence )
ASM (both HelixMod and 3Dmigoto ASM)
This is the wordy version with extra comments to make it more clear what is happening.
... // The required constant for txldl in c200.z def c200, 1.0, 600, 0.0625, 0 // Sampler used to fetch stereo params, // s0 sampler is default for VS dcl_2d s0 ... // At this point r0 is the output position, correctly // placed, but without stereo. // To create stereo effects, we need to calculate: // Xnew = Xold + Separation * (W - Convergence) // Fetch the Separation (r30.x) and Convergence (r30.y) // using the Helix NVapi trick texldl r30, c200.z, s0 // (W - Convergence) add r30.w, r0.w, -r30.y // multiply that times Separation for: // Separation * (W - Convergence) mul r30.z, r30.x, r30.w // Add that to Xold for the complete: // Xold + Separation * (W - Convergence) add r0.x, r0.x, r30.z
Another variant that is more concise, but less clear. As you get more accustomed to seeing this sequence of code, or are sharing with an expert crowd, it's less necessary to fully document this part, as it is always the same sequence and easy to recognize because of the texldl of 0.0625.
... // Stereo correction constants def c200, 1.0, 600, 0.0625, 0 dcl_2d s0 ... // At this point r0 is the output position, correctly // placed, but without stereo. // To create stereo effects, we need to calculate: // Xnew = Xold + Separation * (W - Convergence) texldl r30, c200.z, s0 add r30.w, r0.w, -r30.y mad r0.x, r30.x, r30.w, r0.x
Another very common variant you'll see in HelixMod fixes is the four line version, with no comments. This is less optimal than the two above, but is worth seeing to be able to recognize it.
... // Stereo correction constants def c200, 1.0, 600, 0.0625, 0 dcl_2d s0 ... // At this point r0 is the output position, correctly // placed, but without stereo. texldl r30, c200.z, s0 add r30.w, r0.w, -r30.y mul r30.z, r30.x, r30.w add r0.x, r0.x, r30.z
HLSL (3Dmigoto only)
The HLSL version is very similar, but since it's a compile language there is no need to be terse when writing the code. We can make it "self-documenting" by using good variable names.
... // .Load should only be done once, but can be done anywhere in the code. float4 stereo = StereoParams.Load(0); float separation = stereo.x; float convergence = stereo.y; // At this point, "location" is the output position, but missing stereo. location.x += separation * (location.w - convergence);
Matrix Inversion
We often need to invert a ViewProjectionMatrix, in order to be able to stereo correct a location in the proper viewspace (usually projection space for deferred rendering).
We only need to do this in cases where the inverted matrix is not available. A lot of games have both matrices available, and can be used directly.
We have code samples for both ASM, and for HLSL. It's worth noting that for HelixMod fixes in DX9, that HelixMod already supports inverted matrices directly, and this code should not be used there, because all this extra code will impact performance. Try to avoid using it in PS in particular, because inverting a matrix for every pixel is very costly.
HLSL (3Dmigoto only)
The best way to invert a matrix is to use the runtime shader feature from d3dx.ini. This is best because it runs once per frame, which keeps a performance impact low. This was created by DarkStarSword.
Described here The actual shader is found here And, how to use it, described here
If that doesn't work for some reason (not all games are compatible), it's possible to do it in HLSL directly. This example is backwards in that it starts with an inverted matrix, then creates the forward matrix from that. Used in an example here Again, it's worth noting that doing this in a PS shader is sub-optimal.
The original version of this came from Mike_ar69.
... matrix ivp, vp; ivp = matrix(InstanceConsts[1], InstanceConsts[2], InstanceConsts[3], InstanceConsts[4]); // Work out the view-projection matrix from it's inverse: vp[0].x = ivp[1].y*(ivp[2].z*ivp[3].w - ivp[2].w*ivp[3].z) + ivp[1].z*(ivp[2].w*ivp[3].y - ivp[2].y*ivp[3].w) + ivp[1].w*(ivp[2].y*ivp[3].z - ivp[2].z*ivp[3].y); vp[0].y = ivp[0].y*(ivp[2].w*ivp[3].z - ivp[2].z*ivp[3].w) + ivp[0].z*(ivp[2].y*ivp[3].w - ivp[2].w*ivp[3].y) + ivp[0].w*(ivp[2].z*ivp[3].y - ivp[2].y*ivp[3].z); vp[0].z = ivp[0].y*(ivp[1].z*ivp[3].w - ivp[1].w*ivp[3].z) + ivp[0].z*(ivp[1].w*ivp[3].y - ivp[1].y*ivp[3].w) + ivp[0].w*(ivp[1].y*ivp[3].z - ivp[1].z*ivp[3].y); vp[0].w = ivp[0].y*(ivp[1].w*ivp[2].z - ivp[1].z*ivp[2].w) + ivp[0].z*(ivp[1].y*ivp[2].w - ivp[1].w*ivp[2].y) + ivp[0].w*(ivp[1].z*ivp[2].y - ivp[1].y*ivp[2].z); vp[1].x = ivp[1].x*(ivp[2].w*ivp[3].z - ivp[2].z*ivp[3].w) + ivp[1].z*(ivp[2].x*ivp[3].w - ivp[2].w*ivp[3].x) + ivp[1].w*(ivp[2].z*ivp[3].x - ivp[2].x*ivp[3].z); vp[1].y = ivp[0].x*(ivp[2].z*ivp[3].w - ivp[2].w*ivp[3].z) + ivp[0].z*(ivp[2].w*ivp[3].x - ivp[2].x*ivp[3].w) + ivp[0].w*(ivp[2].x*ivp[3].z - ivp[2].z*ivp[3].x); vp[1].z = ivp[0].x*(ivp[1].w*ivp[3].z - ivp[1].z*ivp[3].w) + ivp[0].z*(ivp[1].x*ivp[3].w - ivp[1].w*ivp[3].x) + ivp[0].w*(ivp[1].z*ivp[3].x - ivp[1].x*ivp[3].z); vp[1].w = ivp[0].x*(ivp[1].z*ivp[2].w - ivp[1].w*ivp[2].z) + ivp[0].z*(ivp[1].w*ivp[2].x - ivp[1].x*ivp[2].w) + ivp[0].w*(ivp[1].x*ivp[2].z - ivp[1].z*ivp[2].x); vp[2].x = ivp[1].x*(ivp[2].y*ivp[3].w - ivp[2].w*ivp[3].y) + ivp[1].y*(ivp[2].w*ivp[3].x - ivp[2].x*ivp[3].w) + ivp[1].w*(ivp[2].x*ivp[3].y - ivp[2].y*ivp[3].x); vp[2].y = ivp[0].x*(ivp[2].w*ivp[3].y - ivp[2].y*ivp[3].w) + ivp[0].y*(ivp[2].x*ivp[3].w - ivp[2].w*ivp[3].x) + ivp[0].w*(ivp[2].y*ivp[3].x - ivp[2].x*ivp[3].y); vp[2].z = ivp[0].x*(ivp[1].y*ivp[3].w - ivp[1].w*ivp[3].y) + ivp[0].y*(ivp[1].w*ivp[3].x - ivp[1].x*ivp[3].w) + ivp[0].w*(ivp[1].x*ivp[3].y - ivp[1].y*ivp[3].x); vp[2].w = ivp[0].x*(ivp[1].w*ivp[2].y - ivp[1].y*ivp[2].w) + ivp[0].y*(ivp[1].x*ivp[2].w - ivp[1].w*ivp[2].x) + ivp[0].w*(ivp[1].y*ivp[2].x - ivp[1].x*ivp[2].y); vp[3].x = ivp[1].x*(ivp[2].z*ivp[3].y - ivp[2].y*ivp[3].z) + ivp[1].y*(ivp[2].x*ivp[3].z - ivp[2].z*ivp[3].x) + ivp[1].z*(ivp[2].y*ivp[3].x - ivp[2].x*ivp[3].y); vp[3].y = ivp[0].x*(ivp[2].y*ivp[3].z - ivp[2].z*ivp[3].y) + ivp[0].y*(ivp[2].z*ivp[3].x - ivp[2].x*ivp[3].z) + ivp[0].z*(ivp[2].x*ivp[3].y - ivp[2].y*ivp[3].x); vp[3].z = ivp[0].x*(ivp[1].z*ivp[3].y - ivp[1].y*ivp[3].z) + ivp[0].y*(ivp[1].x*ivp[3].z - ivp[1].z*ivp[3].x) + ivp[0].z*(ivp[1].y*ivp[3].x - ivp[1].x*ivp[3].y); vp[3].w = ivp[0].x*(ivp[1].y*ivp[2].z - ivp[1].z*ivp[2].y) + ivp[0].y*(ivp[1].z*ivp[2].x - ivp[1].x*ivp[2].z) + ivp[0].z*(ivp[1].x*ivp[2].y - ivp[1].y*ivp[2].x); vp /= determinant(ivp); ...
ASM (3Dmigoto ASM)
This one hasn't been fully tested, but should work. This is created by mx-2 and found on his GitHub
... // // inverseMatrix.asm // // Matrix inversion with Gauss-Jordan elimination // algorithm on GPU. // // This algorithm uses 128 instructions, from which // 83 (best case) to 110 (worst case) are executed. // // input matrix is in r0-r3 // output will be in r4-r7 // r8, r9 are used as temporary registers // c200 = (1,0,0,0) is required // // r0.x r0.y r0.z r0.w | r4.x, r4.y, r4.z, r4.w // r1.x r1.y r1.z r1.w | r5.x, r5.y, r5.z, r5.w // r2.x r2.y r2.z r2.w | r6.x, r6.y, r6.z, r6.w // r3.x r3.y r3.z r3.w | r7.x, r7.y, r7.z, r7.w // // Test: // SETR r0, 0, 2, 3, 4 // SETR r1, 1, 0, 5, 4 // SETR r2,-1, 2, 3, 4 // SETR r3, 0, 2, 3, 5 // // Should produce // r4 = 1.0, 0.0, -1.0, 0.0 // r5 = 1.6, -0.3, -0.3, -0.8 // r6 = 0.6, 0.2, 0.2, -0.8 // r7 = -1.0, 0.0, 0.0, 1.0 // // Init registers def c200, 1, 0, 0, 0 mov r4, c200.xyzw mov r5, c200.wxyz mov r6, c200.zwxy mov r7, c200.yzwx // Pivot first column mov r8, c200 if_eq r0.x, r8.y if_eq r1.x, r8.y if_eq r2.x, r8.y mov r9, r0 mov r0, r3 mov r3, r9 mov r9, r4 mov r4, r7 mov r7, r9 else mov r9, r0 mov r0, r2 mov r2, r9 mov r9, r4 mov r4, r6 mov r6, r9 endif else mov r9, r0 mov r0, r1 mov r1, r9 mov r9, r4 mov r4, r5 mov r5, r9 endif endif // First column rcp r8.x, r0.x mul r8.y, r8.x, r1.x mul r9, r0, r8.y add r1, r1, -r9 mul r9, r4, r8.y add r5, r5, -r9 mul r8.y, r8.x, r2.x mul r9, r0, r8.y add r2, r2, -r9 mul r9, r4, r8.y add r6, r6, -r9 mul r8.y, r8.x, r3.x mul r9, r0, r8.y add r3, r3, -r9 mul r9, r4, r8.y add r7, r7, -r9 // Pivot second column mov r8, c200 if_eq r1.y, r8.y if_eq r2.y, r8.y mov r9, r1 mov r1, r3 mov r3, r9 mov r9, r5 mov r5, r7 mov r7, r9 else mov r9, r1 mov r1, r2 mov r2, r9 mov r9, r5 mov r5, r6 mov r6, r9 endif endif // Second column rcp r8.x, r1.y mul r8.y, r8.x, r2.y mul r9, r1, r8.y add r2, r2, -r9 mul r9, r5, r8.y add r6, r6, -r9 mul r8.y, r8.x, r3.y mul r9, r1, r8.y add r3, r3, -r9 mul r9, r5, r8.y add r7, r7, -r9 // Pivot third column mov r8, c200 if_eq r2.z, r8.y mov r9, r2 mov r2, r3 mov r3, r9 mov r9, r6 mov r6, r7 mov r7, r9 endif // Third column rcp r8.x, r2.z mul r8.y, r8.x, r3.z mul r9, r2, r8.y add r3, r3, -r9 mul r9, r6, r8.y add r7, r7, -r9 // Normalize r3.w rcp r8.x, r3.w mul r3, r3, r8.x mul r7, r7, r8.x // Fourth column mul r8, r3, r2.w mul r9, r7, r2.w add r2, r2, -r8 add r6, r6, -r9 mul r8, r3, r1.w mul r9, r7, r1.w add r1, r1, -r8 add r5, r5, -r9 mul r8, r3, r0.w mul r9, r7, r0.w add r0, r0, -r8 add r4, r4, -r9 // Normalize r2.z rcp r8.x, r2.z mul r2, r2, r8.x mul r6, r6, r8.x // Third column (upper part) mul r8, r2, r1.z mul r9, r6, r1.z add r1, r1, -r8 add r5, r5, -r9 mul r8, r2, r0.z mul r9, r6, r0.z add r0, r0, -r8 add r4, r4, -r9 // Normalize r1.y rcp r8.x, r1.y mul r1, r1, r8.x mul r5, r5, r8.x // Second column (upper part) mul r8, r1, r0.y mul r9, r5, r0.y add r0, r0, -r8 add r4, r4, -r9 // Normalize first column rcp r8.x, r0.x mul r0, r0, r8.x mul r4, r4, r8.x ...
Two more variants, both from Helifax. In case those previous ones don't seem quite right.
HLSL
//Work out Inverse //...Variables float4 a1, a2, a3, a4; float4 b1, b2, b3, b4; float det; //...Original Matrix a1 = g_invViewProjMatrix._m00_m10_m20_m30; a2 = g_invViewProjMatrix._m01_m11_m21_m31; a3 = g_invViewProjMatrix._m02_m12_m22_m32; a4 = g_invViewProjMatrix._m03_m13_m23_m33; //...Determinant det = a1.x*(a2.y*(a3.z*a4.w - a3.w*a4.z) + a2.z*(a3.w*a4.y - a3.y*a4.w) + a2.w*(a3.y*a4.z - a3.z*a4.y)); det += a1.y*(a2.x*(a3.w*a4.z - a3.z*a4.w) + a2.z*(a3.x*a4.w - a3.w*a4.z) + a2.w*(a3.z*a4.x - a3.x*a4.z)); det += a1.z*(a2.x*(a3.y*a4.w - a3.w*a4.y) + a2.y*(a3.w*a4.x - a3.x*a4.w) + a2.w*(a3.x*a4.y - a3.y*a4.x)); det += a1.w*(a2.x*(a3.z*a4.y - a3.y*a4.z) + a2.y*(a3.x*a4.z - a3.z*a4.x) + a2.z*(a3.y*a4.x - a3.x*a4.y)); //...Inverse Matrix Elements b1.x = a2.y*(a3.z*a4.w - a3.w*a4.z) + a2.z*(a3.w*a4.y - a3.y*a4.w) + a2.w*(a3.y*a4.z - a3.z*a4.y); b1.y = a1.y*(a3.w*a4.z - a3.z*a4.w) + a1.z*(a3.y*a4.w - a3.w*a4.y) + a1.w*(a3.z*a4.y - a3.y*a4.z); b1.z = a1.y*(a2.z*a4.w - a2.w*a4.z) + a1.z*(a2.w*a4.y - a2.y*a4.w) + a1.w*(a2.y*a4.z - a2.z*a4.y); b1.w = a1.y*(a2.w*a3.z - a2.z*a3.w) + a1.z*(a2.y*a3.w - a2.w*a3.y) + a1.w*(a2.z*a3.y - a2.y*a3.z); b2.x = a2.x*(a3.w*a4.z - a3.z*a4.w) + a2.z*(a3.x*a4.w - a3.w*a4.x) + a2.w*(a3.z*a4.x - a3.x*a4.z); b2.y = a1.x*(a3.z*a4.w - a3.w*a4.z) + a1.z*(a3.w*a4.x - a3.x*a4.w) + a1.w*(a3.x*a4.z - a3.z*a4.x); b2.z = a1.x*(a2.w*a4.z - a2.z*a4.w) + a1.z*(a2.x*a4.w - a2.w*a4.x) + a1.w*(a2.z*a4.x - a2.x*a4.z); b2.w = a1.x*(a2.z*a3.w - a2.w*a3.z) + a1.z*(a2.w*a3.x - a2.x*a3.w) + a1.w*(a2.x*a3.z - a2.z*a3.x); b3.x = a2.x*(a3.y*a4.w - a3.w*a4.y) + a2.y*(a3.w*a4.x - a3.x*a4.w) + a2.w*(a3.x*a4.y - a3.y*a4.x); b3.y = a1.x*(a3.w*a4.y - a3.y*a4.w) + a1.y*(a3.x*a4.w - a3.w*a4.x) + a1.w*(a3.y*a4.x - a3.x*a4.y); b3.z = a1.x*(a2.y*a4.w - a2.w*a4.y) + a1.y*(a2.w*a4.x - a2.x*a4.w) + a1.w*(a2.x*a4.y - a2.y*a4.x); b3.w = a1.x*(a2.w*a3.y - a2.y*a3.w) + a1.y*(a2.x*a3.w - a2.w*a3.x) + a1.w*(a2.y*a3.x - a2.x*a3.y); b4.x = a2.x*(a3.z*a4.y - a3.y*a4.z) + a2.y*(a3.x*a4.z - a3.z*a4.x) + a2.z*(a3.y*a4.x - a3.x*a4.y); b4.y = a1.x*(a3.y*a4.z - a3.z*a4.y) + a1.y*(a3.z*a4.x - a3.x*a4.z) + a1.z*(a3.x*a4.y - a3.y*a4.x); b4.z = a1.x*(a2.z*a4.y - a2.y*a4.z) + a1.y*(a2.x*a4.z - a2.z*a4.x) + a1.z*(a2.y*a4.x - a2.x*a4.y); b4.w = a1.x*(a2.y*a3.z - a2.z*a3.y) + a1.y*(a2.z*a3.x - a2.x*a3.z) + a1.z*(a2.x*a3.y - a2.y*a3.x); b1.xyzw /= det; b2.xyzw /= det; b3.xyzw /= det; b4.xyzw /= det; //End Inverse
ASM
Generated from the HLSL, using fxc.
// Declare how many registers ww use // The code uses registers from r38 to r53. dcl_temps 60 // 3DMigoto StereoParams: dcl_resource_texture1d (float,float,float,float) t120 dcl_resource_texture2d (float,float,float,float) t125 ld_indexable(texture1d)(float,float,float,float) r41.xyzw, l(0, 0, 0, 0), t120.xyzw ld_indexable(texture2d)(float,float,float,float) r40.xyzw, l(0, 0, 0, 0), t125.xyzw // Inverse // cb0[0], etc is the inverseMatrix mov r0.xyzw, cb0[0].xyzw mov r1.xyzw, cb0[1].xyzw mov r2.xyzw, cb0[2].xyzw mov r3.xyzw, cb0[3].xyzw mul r4.x, r2.z, r3.w mul r4.y, r2.w, r3.z mov r4.y, -r4.y add r4.x, r4.y, r4.x mul r4.x, r1.y, r4.x mul r4.y, r2.w, r3.y mul r4.z, r2.y, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.z, r4.y add r4.x, r4.y, r4.x mul r4.y, r2.y, r3.z mul r4.z, r2.z, r3.y mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.w, r4.y add r4.x, r4.y, r4.x mul r4.x, r0.x, r4.x mul r4.y, r2.w, r3.z mul r4.z, r2.z, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.x, r4.y mul r4.z, r2.x, r3.w mul r4.w, r2.w, r3.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.z, r3.x mul r4.w, r2.x, r3.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.w, r4.z add r4.y, r4.z, r4.y mul r4.y, r0.y, r4.y add r4.x, r4.y, r4.x mul r4.y, r2.y, r3.w mul r4.z, r2.w, r3.y mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.x, r4.y mul r4.z, r2.w, r3.x mul r4.w, r2.x, r3.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.y, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.x, r3.y mul r4.w, r2.y, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.w, r4.z add r4.y, r4.z, r4.y mul r4.y, r0.z, r4.y add r4.x, r4.y, r4.x mul r4.y, r2.z, r3.y mul r4.z, r2.y, r3.z mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.x, r4.y mul r4.z, r2.x, r3.z mul r4.w, r2.z, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.y, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.y, r3.x mul r4.w, r2.x, r3.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.z, r4.z add r4.y, r4.z, r4.y mul r4.y, r0.w, r4.y add r4.x, r4.y, r4.x mul r4.y, r2.z, r3.w mul r4.z, r2.w, r3.z mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.y, r4.y mul r4.z, r2.w, r3.y mul r4.w, r2.y, r3.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.y, r3.z mul r4.w, r2.z, r3.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.w, r4.z add r5.x, r4.z, r4.y mul r4.y, r2.w, r3.z mul r4.z, r2.z, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.y, r4.y mul r4.z, r2.y, r3.w mul r4.w, r2.w, r3.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.z, r3.y mul r4.w, r2.y, r3.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r5.y, r4.z, r4.y mul r4.y, r1.z, r3.w mul r4.z, r1.w, r3.z mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.y, r4.y mul r4.z, r1.w, r3.y mul r4.w, r1.y, r3.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r1.y, r3.z mul r4.w, r1.z, r3.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r5.z, r4.z, r4.y mul r4.y, r1.w, r2.z mul r4.z, r1.z, r2.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.y, r4.y mul r4.z, r1.y, r2.w mul r4.w, r1.w, r2.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r1.z, r2.y mul r4.w, r1.y, r2.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r5.w, r4.z, r4.y mul r4.y, r2.w, r3.z mul r4.z, r2.z, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.x, r4.y mul r4.z, r2.x, r3.w mul r4.w, r2.w, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.z, r3.x mul r4.w, r2.x, r3.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.w, r4.z add r6.x, r4.z, r4.y mul r4.y, r2.z, r3.w mul r4.z, r2.w, r3.z mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.x, r4.y mul r4.z, r2.w, r3.x mul r4.w, r2.x, r3.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.x, r3.z mul r4.w, r2.z, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r6.y, r4.z, r4.y mul r4.y, r1.w, r3.z mul r4.z, r1.z, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.x, r4.y mul r4.z, r1.x, r3.w mul r4.w, r1.w, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r1.z, r3.x mul r4.w, r1.x, r3.z mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r6.z, r4.z, r4.y mul r4.y, r1.z, r2.w mul r4.z, r1.w, r2.z mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.x, r4.y mul r4.z, r1.w, r2.x mul r4.w, r1.x, r2.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.z, r4.z add r4.y, r4.z, r4.y mul r4.z, r1.x, r2.z mul r4.w, r1.z, r2.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r6.w, r4.z, r4.y mul r4.y, r2.y, r3.w mul r4.z, r2.w, r3.y mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r1.x, r4.y mul r4.z, r2.w, r3.x mul r4.w, r2.x, r3.w mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.y, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.x, r3.y mul r4.w, r2.y, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r1.w, r4.z add r7.x, r4.z, r4.y mul r4.y, r2.w, r3.y mul r4.z, r2.y, r3.w mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.x, r4.y mul r4.z, r2.x, r3.w mul r4.w, r2.w, r3.x mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.y, r4.z add r4.y, r4.z, r4.y mul r4.z, r2.y, r3.x mul r4.w, r2.x, r3.y mov r4.w, -r4.w add r4.z, r4.w, r4.z mul r4.z, r0.w, r4.z add r7.y, r4.z, r4.y mul r4.y, r1.y, r3.w mul r4.z, r1.w, r3.y mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.x, r4.y mul r4.z, r1.w, r3.x mul r3.w, r1.x, r3.w mov r3.w, -r3.w add r3.w, r3.w, r4.z mul r3.w, r0.y, r3.w add r3.w, r3.w, r4.y mul r4.y, r1.x, r3.y mul r4.z, r1.y, r3.x mov r4.z, -r4.z add r4.y, r4.z, r4.y mul r4.y, r0.w, r4.y add r7.z, r3.w, r4.y mul r3.w, r1.w, r2.y mul r4.y, r1.y, r2.w mov r4.y, -r4.y add r3.w, r3.w, r4.y mul r3.w, r0.x, r3.w mul r2.w, r1.x, r2.w mul r1.w, r1.w, r2.x mov r1.w, -r1.w add r1.w, r1.w, r2.w mul r1.w, r0.y, r1.w add r1.w, r1.w, r3.w mul r2.w, r1.y, r2.x mul r3.w, r1.x, r2.y mov r3.w, -r3.w add r2.w, r2.w, r3.w mul r0.w, r0.w, r2.w add r7.w, r0.w, r1.w mul r0.w, r2.z, r3.y mul r1.w, r2.y, r3.z mov r1.w, -r1.w add r0.w, r0.w, r1.w mul r0.w, r0.w, r1.x mul r1.w, r2.x, r3.z mul r2.w, r2.z, r3.x mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r1.w, r1.y add r0.w, r0.w, r1.w mul r1.w, r2.y, r3.x mul r2.w, r2.x, r3.y mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r1.w, r1.z add r8.x, r0.w, r1.w mul r0.w, r2.y, r3.z mul r1.w, r2.z, r3.y mov r1.w, -r1.w add r0.w, r0.w, r1.w mul r0.w, r0.w, r0.x mul r1.w, r2.z, r3.x mul r2.w, r2.x, r3.z mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r0.y, r1.w add r0.w, r0.w, r1.w mul r1.w, r2.x, r3.y mul r2.w, r2.y, r3.x mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r0.z, r1.w add r8.y, r0.w, r1.w mul r0.w, r1.z, r3.y mul r1.w, r1.y, r3.z mov r1.w, -r1.w add r0.w, r0.w, r1.w mul r0.w, r0.w, r0.x mul r1.w, r1.x, r3.z mul r2.w, r1.z, r3.x mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r0.y, r1.w add r0.w, r0.w, r1.w mul r1.w, r1.y, r3.x mul r2.w, r1.x, r3.y mov r2.w, -r2.w add r1.w, r1.w, r2.w mul r1.w, r0.z, r1.w add r8.z, r0.w, r1.w mul r0.w, r1.y, r2.z mul r1.w, r1.z, r2.y mov r1.w, -r1.w add r0.w, r0.w, r1.w mul r0.x, r0.w, r0.x mul r0.w, r1.z, r2.x mul r1.z, r1.x, r2.z mov r1.z, -r1.z add r0.w, r0.w, r1.z mul r0.y, r0.w, r0.y add r0.x, r0.y, r0.x mul r0.y, r1.x, r2.y mul r0.w, r1.y, r2.x mov r0.w, -r0.w add r0.y, r0.w, r0.y mul r0.y, r0.y, r0.z add r8.w, r0.y, r0.x div r0.xyzw, r5.xyzw, r4.xxxx div r1.xyzw, r6.xyzw, r4.xxxx div r2.xyzw, r7.xyzw, r4.xxxx div r3.xyzw, r8.xyzw, r4.xxxx // Store results for later use as r0-r4 are most // likely to be used by the default shader code. // r50 equivalent of matrix._m00_m01_m02_m03 // r51 equivalent of matrix._m10_m11_m12_m13 // r52 equivalent of matrix._m20_m21_m22_m23 // r53 equivalent of matrix._m30_m31_m32_m33 mov r50.xyzw, r0.xyzw mov r51.xyzw, r1.xyzw mov r52.xyzw, r2.xyzw mov r53.xyzw, r3.xyzw