A prerequisite to becoming a productive programmer is to have a firm understading of the abstractions that are available to take care of the mundane tasks. This understanding ideally includes an understanding how these abstractions are implemented, in software, making efficient use of the hardware.
In programming terms a very frequent task is that we have a collection of objects of some type, and we must transform that collection in a more or less routine manner to yield a new collection with the desired objects and/or properties.
Collections, in Scala. In this round our focus is on the collection types available in Scala, as provided by the scala.collection package. This package offers many convenient classes and methods for storing and manipulating collections of objects; the most commonly used collection types are
Our further focus will be on functional programming style. Indeed, functional programming enables us to write many common transformations compactly and clearly, and furthermore is in effect required to conveniently express what we want to happen to the objects in our collections as they cascade through a sequence of transformations. (Recall cascading from Round 1: Warmup.)
Our aim is, however, not to exhaustively cover all the classes and methods in the scala.collection package in this round (indeed, that would be quite impossible). Rather we seek to review some of the central features of this package and leave the rest for further study. Here are some links for further, detailed studies:
Other programming languages (*). While our focus is on Scala, essentially every programming language comes with a “standard library” that provides implementations of the most commonly used abstractions and their supporting functionality. For example,
ArrayBuffer is an extension of the fixed-size Array provided by the Java Virtual Machine (which in turn relies on the hardware to supply the required memory capacity in consecutively-addressed memory words).
ArrayBuffer allows the programmer to add new elements
to the beginning and end, and also to remove elements:
scala> import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala> val a = ArrayBuffer("first", "second") a: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(first, second) scala> a.append("last") scala> a.prepend("zeroth") scala> a res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, first, second, last) scala> a.remove(1) res: String = first scala> a res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, second, last)
So far so good, but how is this done, that is, implemented in software?
a is in fact an object
that internally uses a data structure consisting of
Array, let us call it
array, to store the elements, and
size0telling how many of the elements in
arrayare actually in use; that is, the length of the
That is, the internal array can be larger than the current length of
When a new element is appended to the end of
the algorithm implementing this functionality
simply adds the new element in the next free position in
not all the elements are in use.
array is already full,
then a new internal
Array, that is twice as long as the old
is allocated and the current contents (and the new element) are copied there.
A programmer who uses
ArrayBuffers and other collections does not necessarily have to know the details of how the classes are implemented. It is perfectly possible to use the collection without having any idea how the collection is implemented. But it is important to understand that
the collection classes implement their functionality in software by using (sometimes quite non-trivial) data structures and algorithms on them, and
the chosen data structures and algorithms have an effect on how efficient the methods are.
ArrayBuffer it is very efficient to append elements at the end:
as the internal array always doubles in size,
the allocation of and copying to the new array occurs rather seldomly.
In contrast, prepending an element at the beginning is time-consuming
because a new internal array is allocated and the elements copied to it
every time the method is called.
We will study efficiency in more detail during the next rounds and then in the course CS-A1140 Data Structures and Algorithms.
Why so many collections?
But why do we have so many different types of collections? Furthermore, at the first sight, it seems that the scala.collection.mutable and scala.collection.immutable packages include a lot of highly similar classes/traits.
scala> import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala> Vector(1,2,3).sum res: Int = 6 scala> List(1,2,3).sum res: Int = 6 scala> ArrayBuffer(1,2,3).sum res: Int = 6
Similarly, for maps (also called association arrays or dictionaries)
scala> val m = Map("a"->2,"b"->3,"c"->3) m: scala.collection.immutable.Map[String,Int] = Map(a -> 2, b -> 3, c -> 3) scala> m("b") res: Int = 3 scala> m.keys res: Iterable[String] = Set(a, b, c) scala> m.values res: Iterable[Int] = MapLike(2, 3, 3)
we have at least the following classes/traits:
So what are the main differences between these classes?
Traits and concrete classes
Some of the collection classes are actually traits that describe the abstract interface (that is, the available methods) that is implemented in the concrete classes. For example,
scala.collection.immutable.Map is an abstract trait while
scala.collection.immutable.HashMap and scala.collection.immutable.ListMap are two concrete classes that implement the trait. They differ on the data structures and algorithms they use to provide the functionality described in the trait.
scala> import scala.collection.immutable._ import scala.collection.immutable._ scala> val m = Map("a"->1,"b"->2,"c"->3,"d"->4,"e"->5) m: scala.collection.immutable.Map[String,Int] = Map(e -> 5, a -> 1, b -> 2, c -> 3, d -> 4)
The constructor (
apply) method in the companion object creates
an object of some concrete class that implements the trait.
We can find out the actual class used by using the
scala> m.getClass() res: Class[_ <: scala.collection.immutable.Map[String,Int]] = class scala.collection.immutable.HashTrieMap
Differences in underlying implementation
Some of the concrete classes implement the same functionality (methods) but with different underlying data structures and algorithms, which in particular leads to differences in efficiency between the same methods in different classes.
In a ListBuffer, adding elements to the beginning and to the end is very fast, while in ArrayBuffer adding elements to the beginning is slow.
On the other hand, indexed access to the elements (
a(i)) is slow in ListBuffer but fast in ArrayBuffer.
For the most common collection classes and operations, the efficiency of the operations in the implementation can be found in this table. We will
come back to the question of efficiency in Round 7: Efficiency, and
Data structures are studied in-depth in the course CS-A1140 Data Structures and Algorithms.
Mutable and immutable collections
As Scala supports both imperative and functional programming styles, the collections package also supports both mutable and immutable classes.
The collection classes in the scala.collection.mutable package are mutable, meaning that the internal state of objects of these classes can be modified after the creation of the object. For example, we may write
scala> val m1 = scala.collection.mutable.Map("a"->1, "b"-> 2) m1: ... = Map(b -> 2, a -> 1) scala> val m2 = m1 m2: ... = Map(b -> 2, a -> 1) scala> m2("c") = 3 scala> m1 res: ... = Map(b -> 2, a -> 1, c -> 3)
m2 references to the mutable object
meaning that when we modify the state of the object through
the object seen through
m1 changes as well (because it is the same object).
On the other hand, the collection classes in the scala.collection.immutable package are immutable, meaning that one cannot change the internal state of objects of these classes after the creation of the object. For example, let us experiment in the console:
scala> val m1 = scala.collection.immutable.Map("a"->1, "b"-> 2) m1: ....immutable.Map[String,Int] = Map(a -> 1, b -> 2) scala> val m2 = m1.updated("c", 3) m2: ... = Map(a -> 1, b -> 2, c -> 3) scala> m1 res: ... = Map(a -> 1, b -> 2) scala> m2 res: ... = Map(a -> 1, b -> 2, c -> 3)
scala.collection.immutable.Maptrait does not include the
updatemethod, implying that we cannot write a statement
m2("c") = 3like we did in the previous example; and
the closest corresponding method,
updated, returns a new object that differs from the original object by one modified mapping (in our example,
c -> 3). The original object (
m1) stays unmodified, that is, it is immutable.
Based on the above discussion, one may have a doubt that immutable collection classes must be horribly slow because new, possibly large objects have to be created after every modification (even if very small, such as adding one element). Fortunately, this is not necessarily true: well-designed data structures can usually share substructure so that the required changes can be small. As a relatively easy example, consider the code:
scala> val l = List(5, 4, -7) l: List[Int] = List(5, 4, -7) scala> val m = 3 :: l m: List[Int] = List(3, 5, 4, -7)
As we will learn in Round 8: Recursion, a
List consists of
smaller component objects that each store one element (or reference to one)
and a reference to the next component object.
Thus, adding to the beginning of the list is actually easy: just create a new component object and link its “next”reference to the first component object of the original list.
Graphically, we may draw the situation in memory after making the
l by adding 3 as the first element:
More on the topic:
Data structures in general: CS-A1140 Data Structures and Algorithms.
A presentation by Daniel Spiewak: Extreme Cleverness: Functional Data Structures in Scala.