From Scala to hardware (*)

Scala code loop calculating the sum of integers 1 to 100 000 000 and from this code an arrow pointing to a hex-string and an Intel CPU.

From Scala to bytecode for the Java Virtual Machine (*)

Our quest for Module I: The mystery of the computer, obtaining an understanding of the computer, is nearing a conclusion. Recall our motivation for this module, namely the brief fragment of Scala that took the sum of the first four billion positive integers.

Let us wrap a roughly similar fragment of code (without the timing printouts) inside a function and an object that we can compile:

object test {
  def main(args: Array[String]) : Unit = {
    var i = 1L
    var s = 0L
    while i <= 100000000L do {
      s = s + i
      i = i + 1
    }
  }
}

Note

If you want to try the examples in this section yourself you will need the Scala compiler, scalac installed on your computer. Alternatively you can build them with sbt or Visual Code, in which case you can find the .class files in the target directories.

Assuming that the above is saved to a file test.scala, we can now compile the code with scalac the Scala compiler (on the command line):

> scalac test.scala

This will produce the compiled class files that can be loaded/imported for execution on the Java Virtual Machine (JVM). Looking in the directory you will see three new files test$.class, test.class, and test.tasty. The .tasty file is specific to the Scala 3 compiler, and contains type information (compiling to Java byte code looses some high level type information; you can read more about it here), while the .class files represents the program as understood by a Java Virtual Machine.

(The reason there are two class files created from a single object is due to how Scala produces Java classes. One is for the class information, and the other for the object. We will be looking at test$.class which corresponds to the bulk of the program above, but you can use the same method to look inside the other file if you like.)

So, how do these class files look like, and how does the Java Virtual Machine operate?

Let us use the javap utility program to disassemble the bytecode (the instructions) for the JVM and inspect the compiled class file test$.class to see what it looks like, in more human-readable form. The following command will disassemble the file test$.class that we compiled to above:

javap -private -c test$ > test$.javap

(Note that if you do not have JDK in your terminal path the command javap might not be available.) The switch -private means show all members, and -c means disassemble. test$ is the name of the class (in this case, the file without the .class ending) we want to disassemble. The command will print out the result to the terminal, so the last part > test$.javap will redirect this text into a text file called test$.javap.

Here is the content of test$.javap:

Compiled from "test.scala"
public final class test$ implements java.io.Serializable {
  public static final test$ MODULE$;

  private test$();
    Code:
       0: aload_0
       1: invokespecial #13                 // Method java/lang/Object."<init>":()V
       4: return

  public static {};
    Code:
       0: new           #2                  // class test$
       3: dup
       4: invokespecial #16                 // Method "<init>":()V
       7: putstatic     #18                 // Field MODULE$:Ltest$;
      10: return

  private java.lang.Object writeReplace();
    Code:
       0: new           #22                 // class scala/runtime/ModuleSerializationProxy
       3: dup
       4: ldc           #2                  // class test$
       6: invokespecial #25                 // Method scala/runtime/ModuleSerializationProxy."<init>":(Ljava/lang/Class;)V
       9: areturn

  public void main(java.lang.String[]);
    Code:
       0: lconst_1
       1: lstore_2
       2: lconst_0
       3: lstore        4
       5: lload_2
       6: ldc2_w        #29                 // long 100000000l
       9: lcmp
      10: ifgt          26
      13: lload         4
      15: lload_2
      16: ladd
      17: lstore        4
      19: lload_2
      20: lconst_1
      21: ladd
      22: lstore_2
      23: goto          5
      26: return
}

The disassembled code appears, of course, like something that is really intended for a machine. Yet, given our hard-won knowledge of the armlet architecture, and our knowledge of the Scala source code, we should not be completely lost with the disassembler output. A helpful resource is this Wikipedia List of Java bytecode instructions.

So let us put some effort into reading the output. Take a careful look at the disassembled code above. Look at the mnemonics such as “lload”. We observe that there are

  • load instructions (e.g. “lload”),

  • constant-loading (that is, immediate-data-loading) instructions (e.g. “lconst_1” and “ldc2_w”),

  • store instructions (e.g. “lstore”),

  • comparison instructions (e.g. “lcmp”),

  • arithmetic instructions (e.g. “ladd”), and

  • control instructions for branching (e.g. “ifgt”) and jumping (e.g. “goto”).

In fact, after some reflection, it looks like the code is qualitatively not too far from armlet code!

From an architectural perspective there is one key difference between the JVM and the armlet. Namely, the armlet is a register machine that computes with register contents, whereas the JVM is a stack machine that computes with the contents of the topmost elements of an operand stack.

Let us decorate the disassembled code of function main a little bit to see what is going on. For ease of study, we also recall the source code again:

object test {
  def main(args: Array[String]) : Unit = {
    var i = 1L
    var s = 0L
    while i <= 100000000L do {
      s = s + i
      i = i + 1
    }
  }
}

Below is the line-by-line decorated disassembly. We display both the operation and the contents of the operand stack after each operation is executed. The stack top is always the rightmost element (if any). Here we go:

public void main(java.lang.String[]);
  Code:              // Operation                            Stack contents (-->top)
                     // -----------------------------------  -----------------------
   0:   lconst_1     // Push constant 1L to stack            1L
   1:   lstore_2     // Store stack top to local var 2 (=i)  (empty)
   2:   lconst_0     // Push constant 0L to stack            0L
   3:   lstore 4     // Store stack top to local var 4 (=s)  (empty)
   5:   lload_2      // Load local var 2 to stack            i
   6:   ldc2_w #29   // Load run-time constant #29           i 100000000L
   9:   lcmp         // Compare top elements of stack        (result of comparison)
   10:  ifgt 26      // Branch to 26 if i > 100000000L       (empty)
   13:  lload 4      // Load local var 4 to stack            s
   15:  lload_2      // Load local var 2 to stack            s i
   16:  ladd         // Add top elements                     s+i
   17:  lstore 4     // Store stack top to local var 4 (=s)  (empty)
   19:  lload_2      // Load local var 2 to stack            i
   20:  lconst_1     // Push constant 1L to stack            i 1L
   21:  ladd         // Add top elements                     i+1L
   22:  lstore_2     // Store stack top to local var 2 (=i)  (empty)
   23:  goto 5       // Jump to 5                            (empty)
   26:  return       // Return control to caller             (empty)

We observe that apart from the stack-based instead of register-based computations, the JVM and the armlet are quite close relatives. (In fact, save for the limited 16-bit word length of the armlet one machine can simulate the other with ease.)

A second example (*)

Let us strengthen this intuition by writing a selection sort in Scala (recall our armlet selection sort earlier) and disassembling the resulting JVM bytecode. Here we go:

object test2 {
  def selnsort(aa: Array[Int]) : Unit = {
    val n = aa.length
    var j = n-1
    while j > 0 do {
      var candmaxidx = 0
      var i = 1
      while i <= j do {
        if aa(i) > aa(candmaxidx) then {
          candmaxidx = i
        }
        i += 1
      }
      val t = aa(j)
      aa(j) = aa(candmaxidx)
      aa(candmaxidx) = t
      j -= 1
    }
  }

  def main(args: Array[String]) : Unit = {
    val n = 10
    val aa = new Array[Int](n)
    var i = 0
    while i < n do {
      aa(i) = n-i
      i += 1
    }
    selnsort(aa)
  }
}

And here is the javap-disassembled file test2$.class, which we have decorated with comments:

Compiled from "test2.scala"
public final class test2$ implements java.io.Serializable {
  public static final test2$ MODULE$;

  private test2$();
    Code:
       0: aload_0
       1: invokespecial #13                 // Method java/lang/Object."<init>":()V
       4: return

  public static {};
    Code:
       0: new           #2                  // class test2$
       3: dup
       4: invokespecial #16                 // Method "<init>":()V
       7: putstatic     #18                 // Field MODULE$:Ltest2$;
      10: return

  private java.lang.Object writeReplace();
    Code:
       0: new           #22                 // class scala/runtime/ModuleSerializationProxy
       3: dup
       4: ldc           #2                  // class test2$
       6: invokespecial #25                 // Method scala/runtime/ModuleSerializationProxy."<init>":(Ljava/lang/Class;)V
       9: areturn

  public void selnsort(int[]);
    Code:                         // Operation                            Stack contents (-->top)
    				  // -----------------------------------  -----------------------
       0: aload_1		  // Load local var 1 (=aa)               aa     
       1: arraylength		  // Get length of array                  aa.leng
       2: istore_2		  // Store stack top to local var 2 (=n)  (empty)
       3: iload_2		  // Load local var 2 (=n)                n      
       4: iconst_1		  // Push constant 1 to stack             n 1    
       5: isub			  // Subtract                             n-1    
       6: istore_3		  // Store to local var 3 (=j)            (empty)
       7: iload_3		  // Load local var 3 (=j)                j      
       8: iconst_0		  // Push constant 1 to stack             j 0    
       9: if_icmple     69	  // Branch to 69 if j <= 0               (empty)
      12: iconst_0		  // Push constant 1 to stack             0      
      13: istore        4	  // Store to local var 4 (=candmaxidx)   (empty)
      15: iconst_1		  // Push constant 1 to stack             1      
      16: istore        5	  // Store to local var 5 (=i)            (empty)
      18: iload         5	  // Load local var 5 (=i)                i      
      20: iload_3		  // Load local var 3 (=j)                i j    
      21: if_icmpgt     48	  // Branch to 48 if i > j                (empty)
      24: aload_1		  // Load local var 1 (=aa)               aa     
      25: iload         5	  // Load local var 5 (=i)                aa i   
      27: iaload		  // Load array element                   aa(i)  
      28: aload_1		  // Load local var 1 (=aa)               aa(i) aa
      29: iload         4	  // Load local var 4 (=candmaxidx)       aa(i) aa candmaxidx
      31: iaload		  // Load array element                   aa(i) aa(candmaxidx)
      32: if_icmple     39	  // Br. to 39 if aa(i)<=aa(candmaxidx)   (empty)
      35: iload         5	  // Load local var 5 (=i)                i      
      37: istore        4	  // Store to local var 4 (=candmaxidx)   (empty)
      39: iinc          5, 1	  // Increase local var 5 by 1 (=i)     
      42: goto          18	  // Jump to 18                           (empty)
      45: aload_1		  // Load local var 1 (=aa)               aa     
      46: iload_3		  // Load local var 3 (=j)                aa j   
      47: iaload		  // Load array element                   aa(j)  
      48: istore        6	  // Store to local var 6 (=t)            (empty)
      50: aload_1		  // Load local var 1 (=aa)               aa     
      51: iload_3		  // Load local var 3 (=j)                aa j   
      52: aload_1		  // Load local var 1 (=aa)               aa j aa
      53: iload         4	  // Load local var 4 (=candmaxidx)       aa j aa candmaxidx
      55: iaload		  // Load array element                   aa j aa(candmaxidx)
      56: iastore		  // Store array element                  (empty)
      57: aload_1		  // Load local var 1 (=aa)               aa     
      58: iload         4	  // Load local var 4 (=candmaxidx)       aa candmaxidx
      60: iload         6	  // Load local var 6 (=t)                aa candmaxidx
      62: iastore		  // Store array element                  (empty)
      63: iinc          3, -1	  // Increase local var 3 (=j) by 1       (empty)
      66: goto          7	  // Jump to 7                            (empty)
      69: return		  // Return to caller                     (empty) 

  public void main(java.lang.String[]);
    Code:			  
       0: bipush        10	  // Push constant 10 to stack            10         
       2: istore_2		  // Store to local var 2 (=n)            (empty)    
       3: iload_2		  // Load local var 2 (=n)                n          
       4: newarray       int	  // Create new int array (=aa)           aa         
       6: checkcast     #39       // Check type of stack top (class "[I") aa
       9: astore_3                // Store to local var 3 (=aa)           (empty) 
      10: iconst_0		  // Push constant 0 to stack             0       
      11: istore        4	  // Store to local var 4 (=i)            (empty) 
      13: iload         4	  // Load local var 4 (=i)                i       
      15: iload_2		  // Load local var 2 (=n)                i n     
      16: if_icmpge     33	  // Branch to 33 if i >= n               (empty) 
      19: aload_3		  // Load local var 3 (=aa)               aa      
      20: iload         4	  // Load local var 4 (=i)                aa i    
      22: iload_2		  // Load local var 2 (=n)                aa i n  
      23: iload         4	  // Load local var 4 (=i)                aa i n i
      25: isub			  // Subtract                             aa i n-i
      26: iastore		  // Store array element                  (empty) 
      27: iinc          4, 1	  // Increase local var 4 (=i) by 1       (empty)
      30: goto          13	  // Jump to 13                           (empty)
      33: aload_0		  // Load local var 0 (=test2)            test2   
      34: aload_3		  // Load local var 3 (=aa)               test2 aa
      35: invokevirtual #41       // Invoke method selnsort               selnsort(aa)
      38: return		  // Return to caller                     (empty)
}				  

Again we observe that the Scala compiler has turned what we wrote in Scala into an implementation of the insertion sort in bytecode.

Bytecode beyond our examples (**)

More generally, Chapter 6 of the JVM Specification contains a complete description of the instructions supported by the JVM. In particular, we observe that there are dedicated instructions to operate on operands of different types, which are indicated by the prefixes of the instruction mnemonics. For example, iload loads an Int, whereas lload loads a Long, and aload loads a reference to (the memory address of) an object.

At this point we encourage you to compile package armlet or some other more substantial program written in Scala. Then disassemble some class files. What you will discover in the output of the disassembler is bytecode that implements exactly what the programmer has instructed in Scala.

Thus, what we discover is that the Scala compiler is a program for turning program text written in the Scala programming language into bytecode for the Java Virtual Machine.

From bytecode to hardware (*)

An implementation of the Java Virtual Machine is a program that executes the compiled bytecode on actual hardware (or in yet another layer of virtualization/simulation).

Before looking at what a JVM implementation does to the bytecode, let us recall the original Scala program and its transformation to bytecode via the Scala compiler:

Scala code loop calculating the sum of integers 1 to 100 000 000 and the corresponding java bytecode.

Let us assume the JVM implementation prepares the bytecode for execution on actual physical hardware. For example, let us assume the target hardware is a 64-bit Intel microarchitecture.

The JVM implementation optimizes the bytecode and then transforms it to machine language of the target architecture. For clarity let us first display the machine language in symbolic form:

The same java bytecode and its corresponding machine language instructions. The initialization part is xorq %rax, %rax; incq %rax; xorq %rdx, %rdx. The loop L1 is addq %rax, %rdx; incq %rax; cmpq 100 000 001, %rax; jne L1.

Observe how the Intel 64 machine language again bears resemblance to our humble armlet machine language. Of course a physical platform operates in binary, so the actual machine language is not symbolic but binary, which we display in hexadecimal below:

The machine code in binary format: 4831C0; 48FFC0; 4831D2; 4801C2; 48FFC0; 483D01E1F505; 75F2;

So all in all, our original Scala program gets transformed into the following \(23\times 8=184\) -bit binary sequence:

Hex string 4831C048FFC04831D24801C248FFC0483D01E1F50575F2

This binary then gets executed by the actual physical processor (say, an Intel Core i7) at a rate of several billion instructions per second.