Collections

A prerequisite to becoming a productive programmer is to have a firm understading of the abstractions that are available to take care of the mundane tasks. This understanding ideally includes an understanding how these abstractions are implemented, in software, making efficient use of the hardware.

In programming terms a very frequent task is that we have a collection of objects of some type, and we must transform that collection in a more or less routine manner to yield a new collection with the desired objects and/or properties.

Collections, in Scala. In this round our focus is on the collection types available in Scala, as provided by the scala.collection package. This package offers many convenient classes and methods for storing and manipulating collections of objects; the most commonly used collection types are

Our further focus will be on functional programming style. Indeed, functional programming enables us to write many common transformations compactly and clearly, and furthermore is in effect required to conveniently express what we want to happen to the objects in our collections as they cascade through a sequence of transformations. (Recall cascading from Round 1: Warmup.)

Our aim is, however, not to exhaustively cover all the classes and methods in the scala.collection package in this round (indeed, that would be quite impossible). Rather we seek to review some of the central features of this package and leave the rest for further study. Here are some links for further, detailed studies:

Other programming languages (*). While our focus is on Scala, essentially every programming language comes with a “standard library” that provides implementations of the most commonly used abstractions and their supporting functionality. For example,

Java has the java.util package, and
C++ comes with the C++ standard library.

Example: ArrayBuffer

Let us start easy and recall ArrayBuffer collection class that we have encountered in the CS-A1110 Programming 1 course.

ArrayBuffer is an extension of the fixed-size Array provided by the Java Virtual Machine (which in turn relies on the hardware to supply the required memory capacity in consecutively-addressed memory words). Unlike Array, an ArrayBuffer allows the programmer to add new elements to the beginning and end, and also to remove elements:

scala> import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer

scala> val a = ArrayBuffer("first", "second")
a: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(first, second)

scala> a.append("last")

scala> a.prepend("zeroth")

scala> a
res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, first, second, last)

scala> a.remove(1)
res: String = first

scala> a
res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, second, last)

So far so good, but how is this done, that is, implemented in software?

An ArrayBuffer a is in fact an object that internally uses a data structure consisting of

a fixed-size Array, let us call it array, to store the elements, and
an integer size0 telling how many of the elements in array are actually in use; that is, the length of the ArrayBuffer a itself.

That is, the internal array can be larger than the current length of a. When a new element is appended to the end of a, the algorithm implementing this functionality simply adds the new element in the next free position in array if not all the elements are in use. If the array is already full, then a new internal Array, that is twice as long as the old array, is allocated and the current contents (and the new element) are copied there.

A programmer who uses ArrayBuffers and other collections does not necessarily have to know the details of how the classes are implemented. It is perfectly possible to use the collection without having any idea how the collection is implemented. But it is important to understand that

the collection classes implement their functionality in software by using (sometimes quite non-trivial) data structures and algorithms on them, and
the chosen data structures and algorithms have an effect on how efficient the methods are.

For example, for an ArrayBuffer it is very efficient to append elements at the end: as the internal array always doubles in size, the allocation of and copying to the new array occurs rather seldomly. In contrast, prepending an element at the beginning is time-consuming because a new internal array is allocated and the elements copied to it every time the method is called.

We will study efficiency in more detail during the next rounds and then in the course CS-A1140 Data Structures and Algorithms.

Why so many collections?

But why do we have so many different types of collections? Furthermore, at the first sight, it seems that the scala.collection.mutable and scala.collection.immutable packages include a lot of highly similar classes/traits.

For example,

for Vector, List and ArrayBuffer we can write:

scala> import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer

scala> Vector(1,2,3).sum
res: Int = 6

scala> List(1,2,3).sum
res: Int = 6

scala> ArrayBuffer(1,2,3).sum
res: Int = 6

Similarly, for maps (also called association arrays or dictionaries)

scala> val m = Map("a"->2,"b"->3,"c"->3)
m: scala.collection.immutable.Map[String,Int] = Map(a -> 2, b -> 3, c -> 3)
scala> m("b")
res: Int = 3
scala> m.keys
res: Iterable[String] = Set(a, b, c)
scala> m.values
res: Iterable[Int] = MapLike(2, 3, 3)

we have at least the following classes/traits:

So what are the main differences between these classes?

Traits and concrete classes

Some of the collection classes are actually traits that describe the abstract interface (that is, the available methods) that is implemented in the concrete classes. For example,

scala.collection.immutable.Map is an abstract trait while
scala.collection.immutable.HashMap and scala.collection.immutable.ListMap are two concrete classes that implement the trait. They differ on the data structures and algorithms they use to provide the functionality described in the trait.

But observe that the scala.collection.immutable.Map trait has also a companion object that allows us to write, for instance

scala> import scala.collection.immutable._
import scala.collection.immutable._

scala> val m = Map("a"->1,"b"->2,"c"->3,"d"->4,"e"->5)
m: scala.collection.immutable.Map[String,Int] = Map(e -> 5, a -> 1, b -> 2, c -> 3, d -> 4)

The constructor (apply) method in the companion object creates an object of some concrete class that implements the trait. We can find out the actual class used by using the getClass method:

scala> m.getClass()
res: Class[_ <: scala.collection.immutable.Map[String,Int]] = class scala.collection.immutable.HashTrieMap

Differences in underlying implementation

Some of the concrete classes implement the same functionality (methods) but with different underlying data structures and algorithms, which in particular leads to differences in efficiency between the same methods in different classes.

For example, a ListBuffer “is” an ArrayBuffer but implemented with a different underlying data structure. This is reflected in the efficiency of the operations:

In a ListBuffer, adding elements to the beginning and to the end is very fast, while in ArrayBuffer adding elements to the beginning is slow.
On the other hand, indexed access to the elements (a(i)) is slow in ListBuffer but fast in ArrayBuffer.

For the most common collection classes and operations, the efficiency of the operations in the implementation can be found in this table. We will

come back to the question of efficiency in Round 7: Efficiency, and
implement a collection similar to Scala’s List class and some operations in it in Round 8: Recursion.

Data structures are studied in-depth in the course CS-A1140 Data Structures and Algorithms.

Mutable and immutable collections

As Scala supports both imperative and functional programming styles, the collections package also supports both mutable and immutable classes.

The collection classes in the scala.collection.mutable package are mutable, meaning that the internal state of objects of these classes can be modified after the creation of the object. For example, we may write

scala> val m1 = scala.collection.mutable.Map("a"->1, "b"-> 2)
m1: ... = Map(b -> 2, a -> 1)
scala> val m2 = m1
m2: ... = Map(b -> 2, a -> 1)
scala> m2("c") = 3
scala> m1
res: ... = Map(b -> 2, a -> 1, c -> 3)

The value m2 references to the mutable object m1, meaning that when we modify the state of the object through m2, the object seen through m1 changes as well (because it is the same object).

On the other hand, the collection classes in the scala.collection.immutable package are immutable, meaning that one cannot change the internal state of objects of these classes after the creation of the object. For example, let us experiment in the console:

scala> val m1 = scala.collection.immutable.Map("a"->1, "b"-> 2)
m1: ....immutable.Map[String,Int] = Map(a -> 1, b -> 2)
scala> val m2 = m1.updated("c", 3)
m2: ... = Map(a -> 1, b -> 2, c -> 3)
scala> m1
res: ... = Map(a -> 1, b -> 2)
scala> m2
res: ... = Map(a -> 1, b -> 2, c -> 3)

Now

the scala.collection.immutable.Map trait does not include the update method, implying that we cannot write a statement m2("c") = 3 like we did in the previous example; and
the closest corresponding method, updated, returns a new object that differs from the original object by one modified mapping (in our example, c -> 3). The original object (m1) stays unmodified, that is, it is immutable.

Based on the above discussion, one may have a doubt that immutable collection classes must be horribly slow because new, possibly large objects have to be created after every modification (even if very small, such as adding one element). Fortunately, this is not necessarily true: well-designed data structures can usually share substructure so that the required changes can be small. As a relatively easy example, consider the code:

scala> val l = List(5, 4, -7)
l: List[Int] = List(5, 4, -7)

scala> val m = 3 :: l
m: List[Int] = List(3, 5, 4, -7)

As we will learn in Round 8: Recursion, a List consists of smaller component objects that each store one element (or reference to one) and a reference to the next component object. Thus, adding to the beginning of the list is actually easy: just create a new component object and link its “next”reference to the first component object of the original list. Graphically, we may draw the situation in memory after making the new list m from l by adding 3 as the first element:

Linked list l with elements 5, 4 and -7. Variable l points to the first node with number 5 and a reference to the next node with number 4 etc. Also, another variable m that points to a node with number 3 and a reference to the aforementioned node with number 5. Thus m is a linked list 3, 5, 4, -7.