Example 5-1, Figure 5-2, Dot product operation -8 – Intel ARCHITECTURE IA-32 User Manual
Page 270

IA-32 Intel® Architecture Optimization
5-8
Figure 5-2 shows how 1 result would be computed for 7 instructions if
the data were organized as AoS and using SSE alone: 4 results would
require 28 instructions.
Figure 5-2
Dot Product Operation
Example 5-1
Pseudocode for Horizontal (xyz, AoS) Computation
mulps
; x*x', y*y', z*z'
movaps
; reg->reg move, since next steps overwrite
shufps
; get b,a,d,c from a,b,c,d
addps
; get a+b,a+b,c+d,c+d
movaps
; reg->reg move
shufps
; get c+d,c+d,a+b,a+b from prior addps
addps
; get a+b+c+d,a+b+c+d,a+b+c+d,a+b+c+d
O M 15168
X
+
X
+
X
+
X
=
X1
X2
X3
X4
Fx
Fx
Fx
Fx
Y1
Y2
Y3
Y4
Fy
Fy
Fy
Fy
Z1
Z2
Z3
Z4
Fz
Fz
Fz
Fz
W 1
W 2
W 3
W 4
Fw
Fw
Fw
Fw
R1
R2
R3
R4