beautypg.com

2 delayed cof, 3 cof execution cycles, Delayed cof -19 – Freescale Semiconductor StarCore SC140 User Manual

Page 199: Cof execution cycles -19

background image

Instruction Timing

SC140 DSP Core Reference Manual

5-19

5.3.2.2 Delayed COF

When a change-of-flow instruction is executed, the core must wait for the pipeline to fill, starting with a
new pre-fetch from memory. A delay slot is the next VLES after a delayed change-of-flow instruction.
Since it is possible to use the delay slots of the change-of-flow operation to continue the execution of the
previously fetched instructions, special delayed instructions are added to the instruction set. These
instructions use part or all of the delay cycles to execute one additional execution set. This effectively
reduces the penalty for utilizing a change-of-flow operation. If the additional execution set in the delay slot
is included in the cycle count, the number of cycles for the change-of-flow instruction are effectively
reduced. Refer to

Section 5.3.2, “Change-Of-Flow Instruction Timing,”

on page 5-17 for further details.

5.3.2.3 COF Execution Cycles

The basic change-of-flow JMP instruction takes three cycles to execute. However, the number of cycles is
different for the following change-of-flow instructions:

PC-relative instructions such as BRA require an additional cycle to calculate the destination.

Delayed instructions such as JMPD effectively require the same cycle count as the non-delayed
version (in this example JMP) minus the execution cycle count of the set in the delay slot. This is
the case because the pipeline fill-up time is used to execute a useful execution set. The actual time
taken to jump to the new address is the same for the delayed or non-delayed version. However, the
effective cycle count is less for the delayed version since the execution of the instructions in the
delay slot would be extra counts if the non-delayed version was used.

The delay slot lasts for the full execution time of the set in the delay slot, which may be more than
one cycle. The minimum execution time of a delayed instruction is one cycle. For example:

JMPD dest;

takes 1 cycle (3-2=1), because the next instruction

MOVE.W d0,(sp + xxx)

; takes 2 cycles

Stalls that originate in delay slot instructions, and are caused by a memory access wait-state or
contention, stall the whole core, and are not deducted from the cycle count.

Conditional change-of-flow instructions (JT/JF/BT/BF) require four cycles to execute (if taken),
and one cycle to execute (if not taken).

The core implements a mechanism for fast return from subroutine. The return address of
subroutines is kept in a hidden return address stack (RAS) register in addition to being pushed to
the stack. This saves the need to read it from the stack in memory upon return. However, this
hidden register is not valid if there was another jump to a subroutine before the return, in which
case, the core adds two cycles to the RTS instruction to read the return address from the stack.
Refer to

Section 5.5.5, “Fast Return from Subroutines,”

for a more detailed description of the fast

return mechanism.

The core keeps a “shadow” version of SP-8 to save pre-calculation time in case of a POP. If SP
was explicitly changed by a TFRA or an AGU arithmetic instruction, the shadow SP is not valid
and another cycle is needed for the first POP pre-calculation (or equivalent, such as RTE). Refer to

Section 5.5.4, “Shadow Stack Pointer Registers,”

for a more detailed description of the shadow SP

mechanism.

A change-of-flow instruction (jump, branch, interrupt, or long loop iteration) made to an execution
set destination that is spread over two fetch sets, requires an additional cycle for memory access.
An execution set is not necessarily aligned to a fetch set, and can overlap two fetch sets. The core
keeps two fetch sets in a buffer, so this is not normally a problem. However, when a