We find that there are two wasted cycles in this program (represented by the two nops.) This program can be modified to be a bit more efficient.
/* first.s */ .global _main _main: save %sp, -64, %sp mov 9, %l0 sub %l0, 1, %o0 sub %l0, 7, %o1 call .mul nop ! wasted machine cycle sub %l0, 11, %o1 call .div nop ! wasted machine cycle mov %o0, %l1 mov 1, %g1 ta 0
.global _main _main: save %sp, -64, %sp mov 9, %l0 sub %l0, 1, %o0 call .mul sub %l0, 7, %o1 ! Delay slot filled call .div sub %l0, 11, %o1 ! Delay slot filled mov %o0, %l1 mov 1, %g1 ta 0
Z - set if the result was zero N - set if the result was negative V - set if execution resulted in number too large to store in register C - if any carry out of the register after execution of instructionBased on the value of each of these condition codes, a branch may be taken or not taken. There are only certain instructions in Sparc Assembly that set the condition codes. Once the condition codes are set, the branch outcome depends on the value of these condition codes. The condition codes remain in their state until they are overwritten by another instuction which modifies the condition codes. Hence, placing such an instruction in the delay slot of a branch or before the branch has actually taken place might not be a good idea.
The ordinary instructions do not set the condition codes. For example, the add instruction does not modify any condition codes. we need modified instructions. But we can use the modified addcc instruction which can be used to add two numbers, and at the same time, use the results of the addition to set the condition codes.
add r1, r2(or c), r3 ! no condition codes set addcc r1, r2(or c), r3 ! condition codes are set
Example:
Let us assume for the sake of simplicity that the maximum number that can be stored in a register is 255, min. -256. i.e. We are assuming that the numbers are stored in a 9-bit register.
Which of the condition code variables (Z, C, V, N) will be set by the following
statement:
addcc r1, r2(or c), r3
The format is: b... label (b... is one of the branches testing integer condition codes)
The various branch mnemonics are:
ba branch always, similar to goto bn branch never, similar to a nop bl branch on less than zero ble branch on less than or equal to be branch on equal to zero bne branch on not equal to zero bge branch on greater than or equal to zero bg branch on greater than zeroNote what complements (opposites) are:
Condition Complement --------- ---------- bl bge ble bg be bne bne be bge bl bg bleLess than what (for example)? The branches take effect based on condition codes stored in the control unit. These are bits that are modified by certain instructions; the ones you'll use most are
addcc reg, reg_or_imm, reg subcc reg, reg_or_imm, regThe condition codes are set based on the value of the last operand above. They are:
Z - was it zero? N - was it negative? V - did it generate an overflow as number too large to be stored in register? C - did it generate a carry-out?
As such, it is an intermediary language. It's really not intended, in general, for programmer use.
The fact that SPARC machine language can be efficiently implemented in hardware is clear from the fact that processor clock speeds for SPARC chips are very fast. A typical SPARC CPU might run at 100 MHz. This means that the interpreter, running in hardware, can execute 100 million instructions each second.
To illustrate how SPARC assembly language is useful for implementing a wide range of control structures efficiently, we'll look at how common C control structure are implemented in SPARC assembly.
Along the way we'll get more flavor of assembly language programming and we'll see more examples on how to fill delay slots. Note that in each case, we start by filling each delay slot with a nop. Once we are sure the loop is correct, then we can move instructions around to replace nop's and fill the delay slots with useful instructions.
Here are the C Control structures that we'll examine:
Note: You can use these examples as templates for your own Assembly language programming; understand and copy them into your code when you want specific control structures.
/* third.c */ /* Calculate y= (x-1) * (x-7) / (x-11) for x = 0 to 10. */ #define A2 1 #define A1 7 #define A0 11 main() { int x = 0, y; do { y = ((x-A2) * (x-A1) / (x-A0) ; x++ ; } while (x < 11) ; }Now let's see the assembly code corresponding to this HLL code:
/* third.m */ define(a2, 1) define(a1, 7) define(a0, 11) define(x_r, l0) define(y_r, l1) .global _main _main: save %sp, -64, %sp clr %x_r .global loop loop: sub %x_r, a2, %o0 call .mul sub %x_r, a1, %o1 call .div sub %x_r, a0, %o1 mov %o0, %y_r add %x_r, 1, %x_r subcc %x_r, 11, %g0 bl loop nop mov 1, %g1 ta 0
After passing through the macro-processor m4, we get the actual assembly code as shown below:
/* third.s */ .global _main _main: save %sp, -64, %sp clr %l0 .global loop loop: sub %l0, 1, %o0 call .mul ! branch instruction sub %l0, 7, %o1 ! delay slot call .div ! branch instruction sub %l0, 11, %o ! delay slot1 mov %o0, %l1 add %l0, 1, %l0 ! x++ subcc %l0, 11, %g0 ! check for x less than 11 bl loop ! actual branch instruction nop mov 1, %g1 ta 0
Problem: The final delay slot cannot be filled...WHY ???
Well, the answer is that there is no instruction that can be moved into the delay slot, which will not modify the logic of the program. The add instruction modifies the register l0 so it cannot be moved in the delay slot.
Solution: We can modify the order in which the instructions will be executed, as shown below:
/* third.1.s */ .global _main _main: save %sp, -64, %sp clr %l0 .global loop loop: sub %l0, 1, %o0 call .mul sub %l0, 7, %o1 call .div sub %l0, 11, %o1 add %l0, 1, %l0 subcc %l0, 11, %g0 bl loop mov %o0, %l1 ! nop eliminated mov 1, %g1 ta 0Here, the mov instruction has been moved in the delay slot, as it did not change the logic of the program.
While loops are the most common high level language (HLL) construct, but are actually somewhat tricky at the assembly language level. We will develop an efficient while loop in assembly language in a series of steps. Consider this fragment of C code:
Example from the book:
/* fourth.c */ while (a <=17) { a = a + b ; c ++ ; }
/* fourth.s */ test: cmp %a_r, 17 bg done nop add %a_r, %b_r, %a_r add %c_r, 1, %c_r ba test nop done:
How many instructions are executed for each pass thru the loop? Answer: 7. First observation: we have two control transfer instructions, but only one loop. We should test the condition at the bottom of the loop, and leave the loop then.
/* fourth.1.s */ test: cmp %a_r, 17 bg done nop loop: add %a_r, %b_r, %a_r add %c_r, 1, %c_r cmp %a_r, 17 ble loop nop done:
Now how many instructions in loop? 5. A 29% decrease in running time.
Now, why replicate code? Simply branch to the bottom of the loop to get started:
/* fourth.2.s */ ba test nop loop: add %a_r, %b_r, %a_r add %c_r, 1, %c_r test: cmp %a_r, 17 ble loop nop done:
Decreased code length *and* the loop is more maintainable this way. Now let's remove the nop.
/* fourth.3.s */ ba test cmp %a_r, 17 loop: add %a_r, %b_r, %a_r add %c_r, 1, %c_r cmp %a_r, 17 test: ble loop nop done:
Now we would like to eliminate the nop inside the loop (more important than the first nop). Can we move an instruction from the loop body into the delay slot? Say, like this:
(A rule for debugging loops: always test what happens if the loop executes no times, one time, and two times.) The one time and two times case is OK, but the no times case results in a being modified, which it shouldn't.
/* fourth.4.s */ ba test cmp %a_r, 17 loop: add %c_r, 1, %c_r cmp %a_r, 17 test: ble loop add %a_r, %b_r, %a_r done:
To address loops, branch instructions can be annulled. An annulled branch will not execute the delay slot instr if the branch is not taken. The basic idea is the the delay slot can be part of the loop this way, and when the loop isn't taken, the delay slot isn't executed either.
So our final, correct and efficient version with second nop also removed is:
/* fourth.5.s */ ba test cmp %a_r, 17 loop: add %c_r, 1, %c_r cmp %a_r, 17 test: ble,a loop add %a_r, %b_r, %a_r done:
If the branch is not taken, then the statement after the branch (in the pipeline) is not executed, it is annulled.
Now we have four instructions inside the loop. This makes sense: two to do the work, one to set the condition, and one to conditionally branch.
It's probably simplest just to copy this example whenever you need a while loop --- use it as a template.
The for loop is defined in terms of a while loop:
So, to translate something like:
for (ex1; ex2; ex3) st This is the same as : ex1; while ( ex2 ) { st ex3; }
We would wind up with (after filling delay slots!):
/* fifth.c */ for (a = 1; a <= b; a++) c *= a;
Note that the overall structure of the embedded while loop is evident.
/* fifth.s */ ba test mov 1, %a_r ! a = 1 loop: call .mul mov %c_r, %o1 mov %o0, %c_r add %a_r, 1, %a_r test: cmp %a_r, %b_r ble,a loop mov %a_r, %o0