Use of cvttps2pi/cvttss2si instructions, Use of cvttps2pi/cvttss2si instructions -21, Example 5-10 – Intel ARCHITECTURE IA-32 User Manual
Page 283: And example 5-10, Register should be

Optimizing for SIMD Floating-point Applications
5
5-21
Use of cvttps2pi/cvttss2si Instructions
The
cvttps2pi
and
cvttss2si
instructions encode the truncate/chop
rounding mode implicitly in the instruction, thereby taking precedence
over the rounding mode specified in the
MXCSR
register. This behavior
can eliminate the need to change the rounding mode from
round-nearest, to truncate/chop, and then back to round-nearest to
resume computation. Frequent changes to the
MXCSR
register should be
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps
void horiz_add_intrin(Vertex_soa *in, float *out)
{
__m128 v1, v2, v3, v4;
__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;
// Temporary variables
tmm0 = _mm_load_ps(in->x);
// tmm0 = A1 A2 A3 A4
tmm1 = _mm_load_ps(in->y);
// tmm1 = B1 B2 B3 B4
tmm2 = _mm_load_ps(in->z);
// tmm2 = C1 C2 C3 C4
tmm3 = _mm_load_ps(in->w);
// tmm3 = D1 D2 D3 D4
tmm5 = tmm0;
// tmm0 = A1 A2 A3 A4
tmm5 = _mm_movelh_ps(tmm5, tmm1);
// tmm5 = A1 A2 B1 B2
tmm1 = _mm_movehl_ps(tmm1, tmm0);
// tmm1 = A3 A4 B3 B4
tmm5 = _mm_add_ps(tmm5, tmm1);
// tmm5 = A1+A3 A2+A4 B1+B3 B2+B4
tmm4 = tmm2;
tmm2 = _mm_movelh_ps(tmm2, tmm3);
// tmm2 = C1 C2 D1 D2
tmm3 = _mm_movehl_ps(tmm3, tmm4);
// tmm3 = C3 C4 D3 D4
tmm3 = _mm_add_ps(tmm3, tmm2);
// tmm3 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = tmm3;
// tmm6 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);
// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3
tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);
// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4
tmm6 = _mm_add_ps(tmm6, tmm5);
// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4
// C1+C2+C3+C4 D1+D2+D3+D4
_mm_store_ps(out, tmm6);
}