IBM 990 User Manual
Page 57
Chapter 2. System structure and design
45
The success rate that the BHT design offers contributes a great deal to the superscalar
aspects of the z990, given the fact that the architecture rules prescribe that for successful
parallel execution of an instruction stream, the correctly predicted result of the branch is
essential.
IEEE Floating Point
The inclusion of the IEEE Standard for Binary Floating Point Arithmetic (IEEE 754-1985) in
S/390 was made to further enhance the value of this platform for this type of calculation. The
initial implementation had 121 floating-point instructions over prior S/390 CMOS models
(Hexadecimal Floating Point had 54 instructions). Later, with the introduction of the 64-bit
architecture, 12 additional instructions were added for IEEE Binary Floating Point Arithmetic
64-bit integer conversion.
The key point is that Java and C/C++ applications tend to use IEEE Binary Floating Point
operations more frequently than legacy applications. This means that the better the hardware
implementation of this set of instructions, the better the performance of e-business
applications will be.
On earlier systems, the emphasis has been on the traditional hexadecimal floating point
arithmetic. The z990 has a Binary Floating Point unit that matches the performance of the
traditional hexadecimal floating point unit by halving the number of cycles required earlier.
Translation Lookaside Buffer
The Translation Lookaside Buffer (TLB) in the Instruction and Data L1 caches now have a
secondary TLB to enhance performance. In addition, a translator unit is added to translate
misses in the secondary TLB.
Instruction fetching and instruction decode
The superscalar design of the z990 microprocessor allows for the decoding of up to two
instructions per cycle and the execution of three instructions per cycle. Execution takes place
in order, but storage accesses for instruction and operand fetching may occur out of
sequence.
Instruction fetching
Instruction fetch in non-z990 models tries to get as far ahead of instruction decode and
execution as possible because of the relatively large instruction buffers available. In the z990
microprocessor, smaller instruction buffers are used. The operation code is fetched from the
I-cache and put in instruction buffers that hold pre-fetched data awaiting decode.
Instruction decoding
The processor can decode one or two instruction per cycle. The result of the decoding
process is queued and subsequently used to form a group.
Instruction grouping
From the instruction queue, one simple branch instruction and up to two general instructions
can be issued every cycle. The instructions are taken from the instruction queue and grouped
together. The instructions are assembled according to instruction grouping rules. A complete
description of the rules is beyond the scope of this redbook.
It is the compiler’s responsibility to select instructions that best fit with the z990 superscalar
microprocessor and abide by the grouping rules to create code that best exploits the
superscalar implementation.