Use of cvttps2pi/cvttss2si instructions, Use of cvttps2pi/cvttss2si instructions -21, Example 5-10 – Intel ARCHITECTURE IA-32 User Manual

Page 283: And example 5-10, Register should be

Optimizing for SIMD Floating-point Applications

5-21

Use of cvttps2pi/cvttss2si Instructions

The

cvttps2pi

and

cvttss2si

instructions encode the truncate/chop

rounding mode implicitly in the instruction, thereby taking precedence
over the rounding mode specified in the

MXCSR

can eliminate the need to change the rounding mode from
round-nearest, to truncate/chop, and then back to round-nearest to
resume computation. Frequent changes to the

MXCSR

Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps

void horiz_add_intrin(Vertex_soa *in, float *out)

{

__m128 v1, v2, v3, v4;

__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;

// Temporary variables

tmm0 = _mm_load_ps(in->x);

// tmm0 = A1 A2 A3 A4

tmm1 = _mm_load_ps(in->y);

// tmm1 = B1 B2 B3 B4

tmm2 = _mm_load_ps(in->z);

// tmm2 = C1 C2 C3 C4

tmm3 = _mm_load_ps(in->w);

// tmm3 = D1 D2 D3 D4

tmm5 = tmm0;

// tmm0 = A1 A2 A3 A4

tmm5 = _mm_movelh_ps(tmm5, tmm1);

// tmm5 = A1 A2 B1 B2

tmm1 = _mm_movehl_ps(tmm1, tmm0);

// tmm1 = A3 A4 B3 B4

tmm5 = _mm_add_ps(tmm5, tmm1);

// tmm5 = A1+A3 A2+A4 B1+B3 B2+B4

tmm4 = tmm2;

tmm2 = _mm_movelh_ps(tmm2, tmm3);

// tmm2 = C1 C2 D1 D2

tmm3 = _mm_movehl_ps(tmm3, tmm4);

// tmm3 = C3 C4 D3 D4

tmm3 = _mm_add_ps(tmm3, tmm2);

// tmm3 = C1+C3 C2+C4 D1+D3 D2+D4

tmm6 = tmm3;

// tmm6 = C1+C3 C2+C4 D1+D3 D2+D4

tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);

// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3

tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);

// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4

tmm6 = _mm_add_ps(tmm6, tmm5);

// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4

// C1+C2+C3+C4 D1+D2+D3+D4

_mm_store_ps(out, tmm6);

}