Programming the armlet
All right, that was a lengthy description of how the machine operates and how it is designed. Now it is time to get back to programming and convince ourselves that the machine actually is programmable.
Our first armlet
program
Here is our first program for the armlet
. As you can
observe, the program is not written in binary:
# Stores into $2 the sum of the first $0 positive integers
mov $0, 10 # initially 10 summands remain
mov $1, 1 # positive integers start from 1
mov $2, 0 # the sum starts at 0
@loop:
cmp $0, 0 # compare number of summands remaining with zero
beq >done # ... branch to label @done if zero summands remain
add $2, $2, $1 # add the current integer to the sum
add $1, $1, 1 # update the current integer to next
sub $0, $0, 1 # decrease number of integers to sum
jmp >loop # jump to label @loop
@done:
hlt # halt the processor
Of course, our poor armlet
has no idea about the program text
above. There is only so far 5574 gates can be pushed in terms
of reading comprehension.
However, the armlet
has been purpose-built with extreme
care to understand the following sequence of 16-bit words, stored in
the memory unit, starting from address 00000
:
00000: 0000000000011010
00001: 0000000000001010
00002: 0000000001011010
00003: 0000000000000001
00004: 0000000010011010
00005: 0000000000000000
00006: 0000000000100011
00007: 0000000000000000
00008: 0000000000100101
00009: 0000000000010001
00010: 0001010010000110
00011: 0000001001011110
00012: 0000000000000001
00013: 0000000000011111
00014: 0000000000000001
00015: 0000000000100100
00016: 0000000000000110
00017: 0000000000111111
Indeed, after exactly 120 clock ticks, the armlet
halts
with the value \(55 = 1 + 2 + \ldots + 10\) in $2
.
Assembly language
In the early days of computing it used to be the case that coding meant actually the act of toggling in the program to a memory unit, in binary (or other) code, using physical switches or other physical means to interact with the memory unit. (See here.) Fortunately, these days are past.
The program text above is written in the armlet
assembly language, which is a symbolic,
human-readable, and human-comment-decorated representation of the
binary code, with minor convenience features. One of the convenience
features is the possibility to use symbolic labels such as “@done
”
and “@loop
” to label addresses in the eventual binary, and
the possibility to refer to these addresses using symbolic references
such as “>done
” and “>loop
”.
The assembler
Once the program text is ready, it is then the responsibility of a
program, the armlet
assembler to automatically
turn, or assemble, the decorated assembly-language program text
into the binary code that the armlet
understands if it is loaded
(in precise terms, stored) into the memory unit.
Again our pedagogical guideline is to hide nothing, so we observe
that the armlet
assembler is less than 300 lines of Scala code
and is part of package armlet
.
The input to the assembler is a string of program text and the output
is a sequence of binary words that will put the armlet
to work
if we so choose.
A playful introduction to Ticker
It is about time for a bit of play. And now that the armlet
is ready,
perhaps it is time to improve our play beyond toggling.
Towards this end, we are pleased to introduce Ticker
,
a complete interactive program to play with the armlet
architecture,
either one clock tick at a time, or (one expects) at least at
a few thousand simulated clock ticks per second.
Copy and paste the following into a console:
import armlet._ // Packages in Round 4 and Round 5
new Ticker()
A Ticker
interface should pop up. The top left text area accepts
program text in armlet
assembly language.
So let us type in some program text, that is, our first armlet
program above:
We can now click “>> Assemble >>
” to turn the program text
into a binary.
And indeed the binary appears in the loader window on the top right:
If we click “>> Load >>
”, the binary is loaded into the
memory unit and the armlet
is reset.
We are ready to tick or run:
If we choose to run, we get a halt at tick 120:
Pressing “[[ Stop ]]
” will bring back the Assembler/Loader
view. Incidentally, you can change between views by clicking on the
titles “Assembler/Loader” and “Processor/Memory”.
You are now cordially invited to play with the armlet
.
Watch the processor obediently execute whatever you want, with full
view of what happens, one tick at a time, or at a few kHz until halt.
Beyond the first example
Let us now practice programming the armlet
a little bit to gain
confidence that the design is fully programmable and not just a useless
toy. Indeed, even if the armlet
may appear like a humble design,
it is in fact equally capable as any computer, and only restricted by
the limitations of the 16-bit design and the \(2^{16} = 65536\) words
of memory it can address. With wider word length to enable the processor
to address more memory, the armlet
could simulate any modern
computer. (See
here
for a more detailed discussion.)
Let us now look at a few examples how to program the armlet
.
Again we start simple and get more ambitious gradually.
Scanning data
Computers are used to process information, so it is perhaps natural to start with code that searches and indexes numeric data.
To prepare a numeric array for input to the armlet
, we
can declare it in the assembly language as follows:
# Here is my data, which consists of 20 words:
@my_data:
%data 296, 573, 291, 415, 825, 674, 426, 632, 793, 701, 884, 1, 989, 912, 254, 869, 462, 296, 767, 220
In other words, we have a label “my_data
” for the memory address
that contains the data, which is a sequence of (20) words that we declare with
the %data
keyword.
We should also record the length of the data so that our eventual program will know how many words of data it has to process:
# Here is my data:
@length_of_my_data:
%data 20
@my_data:
%data 296, 573, 291, 415, 825, 674, 426, 632, 793, 701, 884, 1, 989, 912, 254, 869, 462, 296, 767, 220
The declarations above will now cause the assembler to include 21 words of
data into the compiled (assembled) binary. The loader will then load the
data to the armlet
for whatever processing we like.
So let us get started with something simple. Let us compute the sum
of the data in $0
and then halt the processor.
Here we go:
# Computes (to $0) the sum of the data and then halts.
# Let us first set things up ...
mov $0, 0 # the sum starts with 0
mov $2, >length_of_my_data # set up memory address where to get length
loa $2, $2 # load the length from memory to $2
mov $1, >my_data # set up memory address where to get the data
# Now let us loop through the data and accumulate the sum ...
@sum_loop:
cmp $2, 0 # compare length with 0
beq >done # ... branch to label @done if $2 == 0
loa $3, $1 # load a data item from memory
add $0, $0, $3 # accumulate the sum in $0
add $1, $1, 1 # advance to next data item
sub $2, $2, 1 # decrement length by one
jmp >sum_loop # continue the summation
# ... until we are done
@done:
hlt # the processor stops here
# Our data follows the program code in the binary ...
# Here is my data:
@length_of_my_data:
%data 20
@my_data:
%data 296, 573, 291, 415, 825, 674, 426, 632, 793, 701, 884, 1, 989, 912, 254, 869, 462, 296, 767, 220
Using Ticker
we find that the processor halts at tick 272, at
which point we find 11280 in $0
. Thus, we find that the armlet
has no problem with simple processing of numeric data.
So let us get a little more ambitious. Say, let us find the maximum value in the data:
# Finds the maximum value ($0) and its position ($1) in data and then halts.
# Let us first set things up ...
mov $3, >length_of_my_data # set up memory address where to get length
loa $3, $3 # load the length from memory to $3
mov $2, >my_data # set up memory address where to get the data
cmp $3, 0 # compare length with 0
bab >have_data # ... if length > 0, branch to process data
hlt # ... otherwise halt immediately
@have_data:
mov $4, $2 # set up first address
add $5, $4, $3 # set up last address
loa $0, $4 # set up a candidate maximum
sub $1, $4, $2 # set up position of candidate maximum
add $4, $4, 1 # advance current address
cmp $4, $5 # are we at the last address?
bbw >max_scan_loop # ... if not, continue scanning
jmp >done # ... otherwise we are done
# ... and then scan for the maximum, updating the candidate maximum
@max_scan_loop:
loa $6, $4 # load current data item
cmp $6, $0 # compare current with candidate maximum
bbe >no_update # ... if current <= candidate, no update
mov $0, $6 # ... update candidate maximum
sub $1, $4, $2 # ... and its position
@no_update:
add $4, $4, 1 # advance current address
cmp $4, $5 # compare current address with last address
bbw >max_scan_loop # ... if curr addr < last addr, continue scan
# ... until we are done
@done:
hlt # the processor stops here
# Our data follows the program code in the binary ...
# Here is my data:
@length_of_my_data:
%data 20
@my_data:
%data 296, 573, 291, 415, 825, 674, 426, 632, 793, 701, 884, 1, 989, 912, 254, 869, 462, 296, 767, 220
Executing the code above, a halt occurs at tick 218, at which point we
find 989 in $0
and 12 in $1
. It would thus appear that
the armlet
is perfectly able to do minor tasks. We also observe
that the structure of the code has an effect on performance. In particular,
the simple summation took 272 ticks, whereas the maximum-finding took
only 218 ticks by a slightly more careful use of the registers and
structuring of the loops.
An important lesson to be learned while programming the armlet
in assembly language is that it is not particularly pleasant to program
in assembly language from the programmer’s perspective. Indeed, the machine
will do exactly what we tell it to do, but there are few conveniences
available. The programmer must mentally keep track of what is in the
registers at any given point of execution and must access memory with
explicit addresses. A much greater level of programmer productivity and
program portability is achieved by using a high-level programming language
such as Scala, even if program performance may, and in general is, lost
in terms of increased running time or increased memory usage.
Yet an important fact to keep in mind is that all high-level programming
tools are simply convenience tools to assist the programmer. At heart,
every modern computer processor is, from a qualitative perspective, not
substantially different from the armlet
. Even the Scala programming
language is just a convenience tool for translating the programmer’s
high-level instructions for execution on physical hardware.
Sorting data
Scanning through the data may not appear that impressive, so let us turn to
a somewhat more challenging task, sorting. In particular, with the
armlet
we have neither convenience tools nor existing code available
to help us with sorting, so we must implement everything ourselves,
from scratch.
There are a myriad of different strategies to sort data. Since we already know how to scan data for the maximum element, perhaps the sorting strategy that is easiest to implement is to repeatedly find the maximum element in the current array, transpose the maximum element with the last element in the current array, and disregard the last element of the current array from consideration. The data is sorted when the current array is empty. This strategy is called selection sort since we repeatedly select the maximum element.
Here is an implementation of selection sort for the armlet
:
# Selection sort for the armlet
mov $0, >data # set up start address
mov $1, >datalen # set up address where to get length
loa $1, $1 # load length
add $1, $1, $0
sub $1, $1, 1 # address of the last data item
@select_loop:
cmp $0, $1 # compare start address and last address
bae >done # ... if start addr >= last addr, we are done
loa $2, $0 # set up a candidate maximum
mov $3, $0 # set up address of candidate maximum
add $4, $0, 1 # set up current address
@max_scan_loop:
cmp $4, $1 # compare current address with last address
bab >scan_done # ... if curr addr > last addr, we have the max
loa $5, $4 # load current data item
cmp $5, $2 # compare current item with candidate max
bbe >no_update # ... if current <= candidate, no need to update
mov $2, $5 # update candidate maximum
mov $3, $4 # update address of candidate maximum
@no_update:
add $4, $4, 1 # advance to next element
jmp >max_scan_loop # continue scanning
@scan_done:
# at this point $2 is the max item and $3 is its addr in array
# transpose max item and last item in current array ...
sub $4, $4, 1 # address of last item
loa $5, $4 # load last item
sto $4, $2 # store max to last position
sto $3, $5 # store last item to max position
sub $1, $1, 1 # remove last item (now =max) from consideration
jmp >select_loop # continue sorting the remaining array
@done:
hlt
@datalen:
%data 100
@data:
%randperm 1234567, 100
Observe the assembler directive %randperm 1234567, 100
.
This is a convenience tool in the assembler to produce a random permutation
of the integers \(1, 2, \ldots, 100\) using the random seed 1234567 to
seed the Scala pseudorandom generator.
That is to say, the assembler will create the pseudorandom permutation
that is then included in the binary.
The program halts at tick 62305, having sorted the random permutation
into the sorted order \(1, 2, \ldots, 100\).
(Do check this with the memory view in Ticker
!)
But that was only 100 integers. Hardly something to write home about.
Let us really push the armlet
now. How about a permutation with,
say, 30000 integers? (Remember that the armlet has only 65536 words of
memory!)
Incidentally, selection sorting is not a good idea, since using selection sort we need to scan 30000 times for the maximum, and scan number \(j = 1, 2, \ldots, 30000\) requires us to iterate over \(30000 - j\) entries, implying that no matter what, selection-sorting 30000 words will thus require about 450 million ticks, and in fact many more. This will take a macroscopic amount of time even if executed on a serious computer.
Below is an implementation of a different sorting strategy, namely a heap sort, which will be studied in more detail later in the context of algorithms and data structures. For now, all we need to observe is that the code looks to be doing what appears to be something more serious:
# Heap sort for the armlet
mov $6, >datalen
loa $6, $6
mov $7, >data
sub $7, $7, 1 # index origin is 1
mov $0, 1 # start heapifying at the origin
@heapify:
cmp $0, $6 # done with heapify?
bab >getmax # ... yes
mov $1, $0 # current index
add $3, $7, $1 # address of current
loa $2, $3 # load current value
@upheap_loop:
cmp $1, 1 # at heap top?
bbe >heapify_next # ... yes
lsr $3, $1, 1 # index of parent
add $4, $7, $3 # address of parent
loa $4, $4 # load parent
cmp $4, $2 # compare parent and current
bae >heapify_next # ... heap property holds
add $5, $7, $1 # address of current
sto $5, $4 # store parent to current
mov $1, $3 # move to parent
jmp >upheap_loop
@heapify_next:
add $3, $7, $1 # address of current
sto $3, $2 # store current value here
add $0, $0, 1
jmp >heapify
@getmax:
cmp $6, 0 # done with max extraction?
beq >done # yes
add $0, $7, 1 # address of heap top
loa $3, $0 # load heap top
add $2, $7, $6 # address of last element
loa $1, $2 # load last element
sto $2, $3 # heap top goes to last element
sub $6, $6, 1 # decrease heap size
mov $0, 1 # start downheap, $1 is current
@downheap_loop:
lsl $2, $0, 1 # left child index
cmp $2, $6 # is there a left child?
bab >downheap_done # ... no
add $4, $2, $7 # left child address
loa $4, $4 # load left child
add $3, $2, 1 # right child index
cmp $3, $6 # is there a right child?
bab >downheap_left # ... no
add $5, $3, $7 # right child address
loa $5, $5 # load right child
cmp $4, $5 # compare children
bae >downheap_left # ... go left
@downheap_right:
cmp $1, $5 # compare current and right
bae >downheap_done # ... ok to insert at current
add $2, $7, $0 # current address
sto $2, $5 # insert right to current
mov $0, $3 # move to right
jmp >downheap_loop
@downheap_left:
cmp $1, $4 # compare current and left
bae >downheap_done # ... ok to insert at current
add $3, $7, $0 # current address
sto $3, $4 # insert left to current
mov $0, $2 # move to left
jmp >downheap_loop
@downheap_done:
add $2, $7, $0 # current address
sto $2, $1 # store current here
jmp >getmax
@done:
hlt
@datalen:
%data 30000
@data:
%randperm 1234567, 30000
In the last line above we observe a random permutation with 30000 elements.
It will take some time for the armlet
to finish the sort, in fact
a halt will occur at tick 12647480, or in about 13 million ticks for this
particular random permutation.
At this point it should be immediate that the armlet
with
its 5574 gates is able to do serious work if we are serious about programming
it. The exercises will present some further challenges in this regard.
Some conventions and hints for the exercises
(You can download a quick-reference for armlet here
)
Let us first summarize the armlet
instruction set.
Here are the arithmetic and logic instructions that operate on registers:
nop # no operation
mov $L, $A # $L = $A (copy the value of $A to $L)
and $L, $A, $B # $L = bitwise AND of $A and $B
ior $L, $A, $B # $L = bitwise (inclusive) OR of $A and $B
eor $L, $A, $B # $L = bitwise exclusive-OR of $A and $B
not $L, $A # $L = bitwise NOT of $A
add $L, $A, $B # $L = $A + $B
sub $L, $A, $B # $L = $A - $B
neg $L, $A # $L = -$A
lsl $L, $A, $B # $L = $A shifted to the left by $B bits
lsr $L, $A, $B # $L = $A shifted to the right by $B bits
asr $L, $A, $B # $L = $A (arithmetically) shifted to the right by $B bits
Here are the arithmetic and logic instructions that operate on registers and take immediate data:
mov $L, I # $L = I (copy the immediate data I to $L)
add $L, $A, I # $L = $A + I
sub $L, $A, I # $L = $A - I
and $L, $A, I # $L = bitwise AND of $A and I
ior $L, $A, I # $L = bitwise (inclusive) OR of $A and I
eor $L, $A, I # $L = bitwise exclusive OR of $A and I
lsl $L, $A, I # $L = $A shifted to the left by I bits
lsr $L, $A, I # $L = $A shifted to the right by I bits
asr $L, $A, I # $L = $A (arithmetically) shifted to the right by I bits
Here are the instructions for accessing (loading data from and storing data to) memory:
loa $L, $A # $L = [contents of memory word at address $A]
sto $L, $A # [contents of memory word at address $L] = $A
Here are the comparison instructions whose result controls what happens when a branch instruction is executed:
cmp $A, $B # compare $A (left) and $B (right)
cmp $A, I # compare $A (left) and I (right)
Here are the jump and branch instructions that take a register operand (the address where we jump or branch, if the branch condition is met):
jmp $A # jump to address $A
beq $A # ... if left == right (in the most recent comparison)
bne $A # ... if left != right
bgt $A # ... if left > right (signed)
blt $A # ... if left < right (signed)
bge $A # ... if left >= right (signed)
ble $A # ... if left <= right (signed)
bab $A # ... if left > right (unsigned)
bbw $A # ... if left < right (unsigned)
bae $A # ... if left >= right (unsigned)
bbe $A # ... if left <= right (unsigned)
Here are the jump and branch instructions that take an immediate operand (the address where we jump or branch, if the branch condition is met):
jmp I # jump to address I
beq I # ... if left == right (in the most recent comparison)
bne I # ... if left != right
bgt I # ... if left > right (signed)
blt I # ... if left < right (signed)
bge I # ... if left >= right (signed)
ble I # ... if left <= right (signed)
bab I # ... if left > right (unsigned)
bbw I # ... if left < right (unsigned)
bae I # ... if left >= right (unsigned)
bbe I # ... if left <= right (unsigned)
The trap instruction halts the execution temporarily so that you may observe the register contents and memory for debugging purposes when your program is running:
trp # trap (break out of execution for debugging)
Finally, the halt instruction stops the processor:
hlt # halt execution
All right. That was the entire armlet
instruction set.
Let us now outline some further conventions and hints.
When programming in assembly language, one of the recurrent curses is that
you run out of registers to hold the intermediate values in your computation.
In this case it is good to remember that the armlet
has 65536 words of
memory available. Just be careful not to overwrite either the code of your
program or its input data!
Perhaps for the exercises we can agree on a convention that words at addresses from 60000 to 65535 are free for your code to use as scratch memory as you like. In fact, a recommended approach is to set up a stack where you can push values and pop them back in the reverse order they were pushed. Here is some sample code towards this end:
# Example armlet assembly code that uses a stack to save and recover contents
# of registers
#
# In the start of our code we reserve one of the registers, say $7,
# to act as a __stack pointer__ that contains the address of the
# first free slot in the stack.
#
# Let us start the stack at address 65535, and grow the stack towards
# lower addresses.
#
mov $7, 65535 # set up a stack
#
# ... Now suppose we are deep in user code, and suddenly need three
# registers, but can spare none ... How to get free registers?
#
# We can push the contents of any three registers to the stack.
# Say, let us push $0, $1, and $2.
#
sto $7, $0 # push $0 into the stack
sub $7, $7, 1
sto $7, $1 # push $1 into the stack
sub $7, $7, 1
sto $7, $2 # push $2 into the stack
sub $7, $7, 1
#
# Observe that after each store ("sto") to the stack, we update the
# stack pointer $7 so that it always points to a free slot in the stack.
# The combination of these operations (sto & sub) is called a "push" to
# the stack.
#
# ... Now we can use $0, $1, $2 as we like since their values are
# safely stored in the stack ...
#
# Suppose now we need the values back. In this case we simply pop
# the values out, one by one, in reverse order.
#
add $7, $7, 1
loa $2, $7 # pop $2 from the stack
add $7, $7, 1
loa $1, $7 # pop $1 from the stack
add $7, $7, 1
loa $0, $7 # pop $0 from the stack
#
# Observe that to correctly unwind the sequence of pushes, we need to
# update the stack pointer first and only then load ("loa") from the stack.
# The combination of these operations (add & loa) is called a "pop" from
# the stack.
#
# ... We can now resume computation with the original values of $0, $1, $2
# as if nothing happened ...
#