The `armlet` data path

The `armlet` architecture overview

Let us now start developing the armlet architecture. To enable gate-by-gate construction using minilog, we will keep the design small. While most modern designs have a 32- or 64-bit word length, the armlet will have a 16-bit word length. That is, all arithmetic and other operations on data are carried out in basic units of 16 bits.

The architecture consists of two subsystems. Namely the armlet processor itself, and a memory module for storing data.

Armlet architecture overview with 128 KiB of memory on the left and 16-bit processor on the right.

Here our focus will be on developing the processor design. The exercises will investigate how to build a memory unit in sequential logic.

In essence, the armlet processor will consist of

internal state elements (the registers),

the external interface (inputs and outputs to units external to the processor, in particular the memory unit), and

the combinational logic joining the state elements (with feedback) and the external interface.

We will develop the armlet processor in multiple stages, at each stage making further additions to the design. To give a quick birds-eye view to the entire architecture, here is a diagram of the armlet at the level of subunits:

Armlet processor architecture with load completion unit on top, arithmetic logic unit in the middle, memory interface unit at the bottom and control and execution unit on the right.

The feedbacks occur from the outputs at the bottom to matching inputs at the top in the diagram. The inputs and outputs on both sides (left and right) of the diagram are the external interface of the processor, which includes the interface to the memory module.

Our first objective will be to build the computational heart of the processor, namely the arithmetic logic unit (ALU), without paying attention to external interfaces or programmability beyond a single step (clock tick) of computation. Once the ALU is ready, we will extend the design with a simple load/store interface between registers and memory.

The armlet is fully designed, built, and simulated in gate-level sequential logic, using package minilog. This is convenient from a pedagogical perspective in two respects. First, nothing is hidden. The entire processor design is about 500 lines of Scala code, and can be found in package armlet. We can load any part or all of the design to Trigger and play with it. (Indeed, our plan is to introduce the design first by play and then by work, through programming the armlet.)

Caveat. The armlet is a substantially simplified design compared with an actual ARM architecture, whereby our main motivation and ambition is to investigate the principles underlying a computer architecture from a programmer’s perspective, while avoiding all details and complexities of a physical implementation, including related design considerations such as pipelining, or parallelism-enabling designs such as multicore or multithreading. With this motivation in mind, in fact only the basic Von Neumann architecture remains, together with a reduced instruction set that is sufficient to run user code that interfaces with memory. Also the memory interface is substantially simplified without e.g. caching or virtualization. In general, processor microarchitecture is a rapidly evolving topic requiring substantial further study beyond the basic principles covered in the armlet design.

Registers and processor state

The internal state of a processor resides mostly in its registers. Each register typically holds one word of state.

The armlet has 8 general-purpose registers, which we will somewhat unimaginatively name as $0, $1, $2, $3, $4, $5, $6, and $7. Each register has 16 bits of state.

One may view each general-purpose register as a “variable that is physically built into the processor”.

In addition to the general-purpose registers, there are two special registers, the program counter and the processor status register, and some further state elements relevant from the perspective of control and program execution. These will be discussed and implemented into the design in Round 5: A programmable computer.

The arithmetic logic unit (ALU), first version

We now start building the combinational logic of the processor, one subunit at a time, and in multiple versions to increase the complexity of the design gradually, taking sufficient time to play to familiarize ourselves with the design.

From a user perspective, the arithmetic logic unit is the part of the processor that actually takes care of the computing, one elementary operation at a time.

The arithmetic logic unit is a combinational circuit that takes as data input the current values of all the 8 registers (depicted at the top in the diagram below), carries out an arithmetic or logic operation that is configured via control inputs (depicted on the right), and outputs the values of all the registers after the operation has been carried out.

In general an operation affects at most one register. The outputs are then given as input to further functional units along the processor data path, and eventually fed back to the input elements when the clock triggers.

The arithmetic logic unit with 8 register inputs on top (marked $0 to $7), 8 register outputs at the bottom and control buses for the operation, result and the operands on the right.

For example, one possible operation of the ALU is to

“store into the register $4 the sum of registers $0 and $3”,

or what is the same in slightly more succinct form,

add $4, $0, $3

The control inputs configure the ALU to effect the desired operation so that the inputs are transformed into the desired output. For example, if we configure add $4, $0, $3 via the control inputs (depicted on the right), then the ALU will indeed effect the configured operation:

The arithmetic logic unit with a visualization of add $4, $0, $3. In this case the control buses have: operation = "add" (or actually a binary value corresponding to a predetermined value), result = 100 (or 4 in decimal) and operands are 000 and 011 (or 0 and 3 in decimal). In the output register $4 the value is equal to the sum of inputs $0 and $3.

The armlet ALU supports the following eleven different operations, which can be viewed as a reduced set of the operations that an actual ARM processor supports:

nop                   # no operation
mov $L, $A            # $L = $A (copy the value of $A to $L)
and $L, $A, $B        # $L = bitwise AND of $A and $B
ior $L, $A, $B        # $L = bitwise (inclusive) OR of $A and $B
eor $L, $A, $B        # $L = bitwise exclusive-OR of $A and $B
not $L, $A            # $L = bitwise NOT of $A
add $L, $A, $B        # $L = $A + $B
sub $L, $A, $B        # $L = $A - $B
neg $L, $A            # $L = -$A
lsl $L, $A, $B        # $L = $A shifted to the left by $B bits
lsr $L, $A, $B        # $L = $A shifted to the right by $B bits
asr $L, $A, $B        # $L = $A (arithmetically) shifted to the right by $B bits

By convention, each operation is associated with a three-letter mnemonic such as “add” or “ior” whose meaning is explained on the right in the table above. The parameters “$L”, “$A”, and “$B” each may be an arbitrary register among the registers $0, $1, $2, $3, $4, $5, $6, and $7.

Now all that is needed is the combinational logic to implement these operations, with control inputs designating what operation is to be carried out, what register(s) are used as operands ($A and $B), and what register will receive the result ($L), if any.

If you were diligently solving the circuit-design exercises in Round 3: Combinational logic, this is where your work pays off. In fact, you have all the building blocks sufficient to build the ALU. The design of the ALU just amounts to combining these building blocks appropriately and making design decisions how to structure the control inputs to configure the ALU to compute what we want with data on the registers.

Extending the ALU – loading immediate data

The first version of our ALU already offers a fair selection of different operations and versatility in register selection, but it is lacking in one aspect. Namely, how does one load input to the registers? (Remember, the registers are inside the processor – in practice we must have some mechanism for loading data into the processor to start computing with the data.)

It would appear like a good idea to have such data-loading functionality via an immediate data input to the ALU, whose use is configured via the control lines. Observe the immediate data input immed_in in the following diagram:

The arithmetic logic unit extended with an immediate operand input.

We extend the specification of the ALU with the following operations that accept immediate data:

mov $L, I             # $L = I (copy the immediate data I to $L)
add $L, $A, I         # $L = $A + I
sub $L, $A, I         # $L = $A - I
and $L, $A, I         # $L = bitwise AND of $A and I
ior $L, $A, I         # $L = bitwise (inclusive) OR of $A and I
eor $L, $A, I         # $L = bitwise exclusive OR of $A and I
lsl $L, $A, I         # $L = $A shifted to the left by I bits
lsr $L, $A, I         # $L = $A shifted to the right by I bits
asr $L, $A, I         # $L = $A (arithmetically) shifted to the right by I bits

So far so good. We have now specified what we want the ALU to do. If you look at the armlet source code, the actual ALU in the final processor design is somewhat more intricate in its inputs and outputs, but the present specification is sufficient to start playing with the design, which is what we will do right now.

`Trigger` time with the ALU

To play with the ALU circuit compile and run the file launchALUTrigger.scala in the armlet source code, or copy and paste the following to an armlet-aware console:

import armlet.*
new ALUTrigger()

A Trigger window should pop up, with a view to the ALU circuit, and with feedbacks built from the output to the input elements. In particular, the Trigger view displays only the register inputs and the control inputs to configure the operation of the ALU. This is best illustrated by a few examples that we will review below.

The register inputs to the ALU are open to toggling, so we can toggle in some numbers. So let us toggle in some numbers to the registers:

Trigger UI with the ALU. Registers $0 to $7 on top, result and operand buses, operation code bus, immediate value enabling and immediate value input bus.

(Observe in the figure above that on the right there are convenience text fields that decode the watched bus in decimal, or in symbolic form. This is a convenience feature of Trigger for manipulating a watched data bus, not part of the processor design.)

We can now configure the operation of the ALU via the control inputs.

Let us set up the operation “add $4, $0, $3”:

Registers $0 to $3 storing numbers 123, 234, 345 and 456 respectfully and the control buses signifying operation "add $4, $0, $3".

Click on the clock button to execute the operation:

The outputs feed back to the inputs, and we observe that the ALU indeed performs precisely as instructed – the sum of $0 and $3 appears in $4. Next, let us set up the operation “mov $7, 54783”:

The trigger after the previous operation. Register $4 now has value 579. Also, the operation "mov $7, 54783" is now encoded in the control buses and immediate input.

After the clock is triggered, $7 has the immediate data:

The value 54783 has now moved to register $7.

Now you are free to play with and configure the ALU as you like. All the operations specified above are available – just play with the control inputs to configure the operation, and trigger the clock to execute it. Say, playing with the shift operations can be a mesmerizing experience. Alternate between, say,

lsl $0, $0, 1

and

lsr $0, $0, 1

to witness the bits in $0 walk left and right.

What we should take home from playing with the ALU is that it is just a piece of sequential logic, conceptually not much more complex than the simple accumulator-adder that we built at the start of this round.

The ALU in Scala (**)

The ALU may look like a formidable logic design in Trigger, at the level of Scala code the design is far less intimidating. The following is essentially the complete code it takes to build the combinational logic in the ALU, assuming builders for subunits such as adders and subtractors are available as subroutines. (Note however, that the actual ALU design in package armlet has some further functionality required by material that we will cover in Round 5: A programmable computer.)

// A helper table of the arithmetic-and-logic operations and their
// operator-builder functions:

val ops_ALU = immutable.ArraySeq(
    ("mov" , (a: Bus, b: Bus) => a),
    ("and" , (a: Bus, b: Bus) => a & b),
    ("ior" , (a: Bus, b: Bus) => a | b),
    ("eor" , (a: Bus, b: Bus) => a ^ b),
    ("not" , (a: Bus, b: Bus) => ~a),
    ("neg" , (a: Bus, b: Bus) => buildNeg(a)),
    ("add" , (a: Bus, b: Bus) => buildAdd(a,b)),
    ("sub" , (a: Bus, b: Bus) => buildSub(a,b)),
    ("lsl" , (a: Bus, b: Bus) => buildLeftS(a,b)),
    ("lsr" , (a: Bus, b: Bus) => buildRightS(a,b)),
    ("asr" , (a: Bus, b: Bus) => buildARightS(a,b)))

// Load completion unit

def buildLCU(reg_in:     Seq[Bus],
             read_in:    Bus,
             prev_l_idx: Bus,
             lcu_e:      Gate) =

  // Complete load if load enabled

  val l_select = buildDecoder(prev_l_idx) && lcu_e
  val reg_out = (reg_in zip
                 l_select).map(x => ((x._1 && !x._2)|(read_in && x._2)))

  reg_out
end buildLCU

// Arithmetic logic unit

def buildALU(reg_in:    Seq[Bus],
             l_idx:     Bus,
             a_idx:     Bus,
             b_idx:     Bus,
             immed_in:  Bus,
             imm_a_e:   Gate,
             imm_b_e:   Gate,
             opers:     Seq[(Gate,(Bus,Bus)=>Bus)],
             reset_e:   Gate) =

  // Decode and select $A
  val a_select = buildDecoder(a_idx)
  var ra = buildBusSelector(reg_in, a_select)

  // Decode and select $B
  val b_select = buildDecoder(b_idx)
  var rb = buildBusSelector(reg_in, b_select)

  // Left operand -- register or immediate?
  val opa = (ra && !imm_a_e) | (immed_in && imm_a_e)

  // Right operand -- register or immediate?
  val opb = (rb && !imm_b_e) | (immed_in && imm_b_e)

  // Run the operation builder function for each operation
  val results = opers.map(_._2(opa,opb))

  // Comparison of operands
  val cmp_sgn = buildSub(opa,opb)
  val f_eq    = !(opa ^ opb).reduceLeft(_ || _)     // opA == opB
  val f_ab    = buildAB(opa,opb)                    // opA > opB  (unsigned)
  val f_gt    = !f_eq && !cmp_sgn(cmp_sgn.length-1) // opA > opB  (signed)

  // Select result-of-operation based on enable-signals for each operation
  val res = (results zip
             opers.map(_._1)).map(x => x._1 && x._2).reduceLeft(_ | _)

  // Save result to register if there is an operation that is enabled
  val save_e = opers.map(_._1).reduceLeft(_ || _)
  val l_select = buildDecoder(l_idx) && save_e

  val reg_save = (reg_in zip
                  l_select).map(x => ((x._1 && !x._2)|(res && x._2)))

  // Zap all regs on reset
  val reg_out = reg_save.map(_ && !reset_e)

  // Output new values of registers & flags
  (reg_out,
   f_eq,f_ab,f_gt,
   opa)
end buildALU

This is essentially all it takes to build the combinational logic in the ALU. The Trigger interface in object ALUTrigger requires a few more lines of code that is available in package armlet.

Now that we have reviewed the basic functionality of the arithmetic logic unit and presented its implementation, our next task is to extend the data path (which so far consists only of the arithmetic logic unit) with an interface to a memory unit. This is achieved by two subunits that occur before and after the ALU.

Memory and the `armlet` (*)

A processor in most cases works with a lot more data than can be stored in the processor registers. This data must be made available to the processor by interfacing the processor with dedicated storage units (memory) that hold the data.

The armlet is a load/store architecture, which means that data must be loaded from memory to a register, and stored from a register to memory, using dedicated operations. These operations are implemented in the memory interface unit in the data path. But before designing the unit, let us specify the interface that the unit uses to access memory.

The armlet architecture has a 16-bit word length, and accordingly it can interface with up to $2^{16} = 65536$ words of memory that reside in a memory unit outside the processor. Each such word consists of 16 bits, for a total of 128 kibibytes of main memory.

Each word in the memory unit has a unique address in the range $0,1,\ldots,65535$ by which it can be identified and its contents manipulated.

The memory unit supports two basic operations: memory read and memory write. The memory unit has the following interface:

Memory unit with controls mem_read_e, mem_write_e and mem_addr (bus) on the right along with mem_data input and read_out on the left.

Read. A memory read is issued to the memory unit by giving true value to mem_read_e and giving the address of the word that is to be read as value to the bus mem_addr. The result of the read appears in the output bus read_out of the memory unit. This bus is then fed back from the memory unit to the processor so that when the clock triggers, the result of the read appears at the processor input bus read_in.

Write. A memory write is issued to the memory unit by giving true value to mem_write_e and giving the address of the word that is to be written as value to the bus mem_addr. The data that is to be written is given as value to the bus mem_data of the memory unit. When the clock triggers, the contents of the memory word at this address are updated with the given data.

Caveats. Our armlet simulation considerably simplifies an actual processor–memory interface. An actual (main) memory unit is considerably slower than the processor in terms of the clock speed. This is compensated in practice with (a) one or more levels of cache memory that has less storage capacity than main memory, but operates faster and carries several words of data in one data block (cache line), and (b) a high-bandwidth interface to the main memory utilizing increased data rates and/or multiple channels.

Memory interface unit

Let us now design the part of the armlet processor that is responsible for interacting and interfacing with the memory unit. The memory interface unit uses the registers and its own control input to issue operations (read or write) to the memory unit.

Our design situates the memory interface unit immediately after the arithmetic logic unit in the data path of the processor. That is, the register inputs of the memory interface unit are the register outputs of the arithmetic logic unit:

Memory interface unit with the registers $0 to $7 going through it, corresponding interface with the memory unit on the left, controls such as load and store enabling on the right and also interface for load control unit (lcu_e gate and C bus) on the bottom.

The memory interface unit implements two operations, load from memory and store to memory:

loa $L, $A            # $L = [contents of memory word at address $A]
sto $L, $A            # [contents of memory word at address $L] = $A

The load completion unit (*)

In addition to the memory interface unit, the armlet design has a unit that takes care of completing a load operation from memory, that is, actually storing the data that arrives from the memory unit into the designated register that is to receive the data. The load completion unit occurs before the arithmetic logic unit on the data path. That is, the load completion unit receives its register inputs from the actual input elements (flip-flops or latches), and gives its register outputs as input to the arithmetic logic unit. In this way the result of a memory read issued in the previous clock cycle gets stored in the designated register before the ALU operates on the register in the current clock cycle.

Load completion unit with the registers going through it, read_in bus for memory reads on the left, lcu_e and C controls from memory interface unit to select where (if at all) a load to registers is to be done.

The memory interface unit is responsible for controlling the load completion unit via the enable signal lcu_e and the index C of the register that is to receive the data from the memory read, which arrives via the input bus read_in.

Summary

The sequence of the load completion unit, the arithmetic logic unit, and the memory interface unit constitute the data path of the armlet processor. That is, the sequential logic that evolves the state of the registers and memory based on control inputs. We can summarize the structure of the data path with the following diagram:

Load completion unit, arithmetic logic unit and memory interface unit put together.

The data path has an interface to the memory unit (the outputs depicted on the left) and internal control feedback from the memory interface unit to the load completion unit (bottom-left outputs to top-left inputs). The data path is controlled by control inputs (inputs depicted on the right) that configure the arithmetic logic unit and the memory interface unit.

All in all, we observe that the data path is a rather versatile machine that can maintain a state (both in its registers and in a memory unit outside the processor proper) and evolve that state, one operation at a time. Yet, recalling our Trigger time with the ALU, we observe that it is rather cumbersome to configure the control inputs to effect a particular operation. This is an issue that needs our attention as designers, which is what we proceed to do next.

The armlet data path

The armlet architecture overview