Collections
A prerequisite to becoming a productive programmer is to have a firm understading of the abstractions that are available to take care of the mundane tasks. This understanding ideally includes an understanding how these abstractions are implemented, in software, making efficient use of the hardware.
In programming terms a very frequent task is that we have a collection of objects of some type, and we must transform that collection in a more or less routine manner to yield a new collection with the desired objects and/or properties.
Collections, in Scala. In this round our focus is on the collection types available in Scala, as provided by the scala.collection package. This package offers many convenient classes and methods for storing and manipulating collections of objects; the most commonly used collection types are
Our further focus will be on functional programming style. Indeed, functional programming enables us to write many common transformations compactly and clearly, and furthermore is in effect required to conveniently express what we want to happen to the objects in our collections as they cascade through a sequence of transformations. (Recall cascading from Round 1: Warmup.)
Our aim is, however, not to exhaustively cover all the classes and methods in the scala.collection package in this round (indeed, that would be quite impossible). Rather we seek to review some of the central features of this package and leave the rest for further study. Here are some links for further, detailed studies:
Other programming languages (*). While our focus is on Scala, essentially every programming language comes with a “standard library” that provides implementations of the most commonly used abstractions and their supporting functionality. For example,
Java has the java.util package, and
C++ comes with the C++ standard library.
Example: ArrayBuffer
Let us start easy and recall ArrayBuffer collection class that we have encountered in the CS-A1110 Programming 1 course.
ArrayBuffer
is an extension of the fixed-size Array provided by the Java Virtual Machine (which in turn relies on the hardware to supply the required memory capacity in consecutively-addressed memory words).
Unlike Array
, an ArrayBuffer
allows the programmer to add new elements
to the beginning and end, and also to remove elements:
scala> import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer
scala> val a = ArrayBuffer("first", "second")
a: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(first, second)
scala> a.append("last")
scala> a.prepend("zeroth")
scala> a
res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, first, second, last)
scala> a.remove(1)
res: String = first
scala> a
res: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer(zeroth, second, last)
So far so good, but how is this done, that is, implemented in software?
An ArrayBuffer
a
is in fact an object
that internally uses a data structure consisting of
a fixed-size
Array
, let us call itarray
, to store the elements, andan integer
size0
telling how many of the elements inarray
are actually in use; that is, the length of theArrayBuffer
a
itself.
That is, the internal array can be larger than the current length of a
.
When a new element is appended to the end of a
,
the algorithm implementing this functionality
simply adds the new element in the next free position in array
if
not all the elements are in use.
If the array
is already full,
then a new internal Array
, that is twice as long as the old array
,
is allocated and the current contents (and the new element) are copied there.
A programmer who uses ArrayBuffers
and other collections does not necessarily have to know the details of how the classes are implemented. It is perfectly possible to use the collection without having any idea how the collection is implemented. But it is important to understand that
the collection classes implement their functionality in software by using (sometimes quite non-trivial) data structures and algorithms on them, and
the chosen data structures and algorithms have an effect on how efficient the methods are.
For example,
for an ArrayBuffer
it is very efficient to append elements at the end:
as the internal array always doubles in size,
the allocation of and copying to the new array occurs rather seldomly.
In contrast, prepending an element at the beginning is time-consuming
because a new internal array is allocated and the elements copied to it
every time the method is called.
We will study efficiency in more detail during the next rounds and then in the course CS-A1140 Data Structures and Algorithms.
Why so many collections?
But why do we have so many different types of collections? Furthermore, at the first sight, it seems that the scala.collection.mutable and scala.collection.immutable packages include a lot of highly similar classes/traits.
For example,
for Vector, List and ArrayBuffer we can write:
scala> import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala> Vector(1,2,3).sum res: Int = 6 scala> List(1,2,3).sum res: Int = 6 scala> ArrayBuffer(1,2,3).sum res: Int = 6
Similarly, for maps (also called association arrays or dictionaries)
scala> val m = Map("a"->2,"b"->3,"c"->3) m: scala.collection.immutable.Map[String,Int] = Map(a -> 2, b -> 3, c -> 3) scala> m("b") res: Int = 3 scala> m.keys res: Iterable[String] = Set(a, b, c) scala> m.values res: Iterable[Int] = MapLike(2, 3, 3)
we have at least the following classes/traits:
So what are the main differences between these classes?
Traits and concrete classes
Some of the collection classes are actually traits that describe the abstract interface (that is, the available methods) that is implemented in the concrete classes. For example,
scala.collection.immutable.Map is an abstract trait while
scala.collection.immutable.HashMap and scala.collection.immutable.ListMap are two concrete classes that implement the trait. They differ on the data structures and algorithms they use to provide the functionality described in the trait.
But observe that the scala.collection.immutable.Map trait has also a companion object that allows us to write, for instance
scala> import scala.collection.immutable._
import scala.collection.immutable._
scala> val m = Map("a"->1,"b"->2,"c"->3,"d"->4,"e"->5)
m: scala.collection.immutable.Map[String,Int] = Map(e -> 5, a -> 1, b -> 2, c -> 3, d -> 4)
The constructor (apply
) method in the companion object creates
an object of some concrete class that implements the trait.
We can find out the actual class used by using the getClass
method:
scala> m.getClass()
res: Class[_ <: scala.collection.immutable.Map[String,Int]] = class scala.collection.immutable.HashTrieMap
Differences in underlying implementation
Some of the concrete classes implement the same functionality (methods) but with different underlying data structures and algorithms, which in particular leads to differences in efficiency between the same methods in different classes.
For example, a ListBuffer “is” an ArrayBuffer but implemented with a different underlying data structure. This is reflected in the efficiency of the operations:
In a ListBuffer, adding elements to the beginning and to the end is very fast, while in ArrayBuffer adding elements to the beginning is slow.
On the other hand, indexed access to the elements (
a(i)
) is slow in ListBuffer but fast in ArrayBuffer.
For the most common collection classes and operations, the efficiency of the operations in the implementation can be found in this table. We will
come back to the question of efficiency in Round 7: Efficiency, and
implement a collection similar to Scala’s List class and some operations in it in Round 8: Recursion.
Data structures are studied in-depth in the course CS-A1140 Data Structures and Algorithms.
Mutable and immutable collections
As Scala supports both imperative and functional programming styles, the collections package also supports both mutable and immutable classes.
The collection classes in the scala.collection.mutable package are mutable, meaning that the internal state of objects of these classes can be modified after the creation of the object. For example, we may write
scala> val m1 = scala.collection.mutable.Map("a"->1, "b"-> 2)
m1: ... = Map(b -> 2, a -> 1)
scala> val m2 = m1
m2: ... = Map(b -> 2, a -> 1)
scala> m2("c") = 3
scala> m1
res: ... = Map(b -> 2, a -> 1, c -> 3)
The value m2
references to the mutable object m1
,
meaning that when we modify the state of the object through m2
,
the object seen through m1
changes as well (because it is the same object).
On the other hand, the collection classes in the scala.collection.immutable package are immutable, meaning that one cannot change the internal state of objects of these classes after the creation of the object. For example, let us experiment in the console:
scala> val m1 = scala.collection.immutable.Map("a"->1, "b"-> 2)
m1: ....immutable.Map[String,Int] = Map(a -> 1, b -> 2)
scala> val m2 = m1.updated("c", 3)
m2: ... = Map(a -> 1, b -> 2, c -> 3)
scala> m1
res: ... = Map(a -> 1, b -> 2)
scala> m2
res: ... = Map(a -> 1, b -> 2, c -> 3)
Now
the
scala.collection.immutable.Map
trait does not include theupdate
method, implying that we cannot write a statementm2("c") = 3
like we did in the previous example; andthe closest corresponding method,
updated
, returns a new object that differs from the original object by one modified mapping (in our example,c -> 3
). The original object (m1
) stays unmodified, that is, it is immutable.
Based on the above discussion, one may have a doubt that immutable collection classes must be horribly slow because new, possibly large objects have to be created after every modification (even if very small, such as adding one element). Fortunately, this is not necessarily true: well-designed data structures can usually share substructure so that the required changes can be small. As a relatively easy example, consider the code:
scala> val l = List(5, 4, -7)
l: List[Int] = List(5, 4, -7)
scala> val m = 3 :: l
m: List[Int] = List(3, 5, 4, -7)
As we will learn in Round 8: Recursion, a List
consists of
smaller component objects that each store one element (or reference to one)
and a reference to the next component object.
Thus, adding to the beginning of the list is actually easy: just create a new component object and link its “next”reference to the first component object of the original list.
Graphically, we may draw the situation in memory after making the
new list m
from l
by adding 3 as the first element:
More on the topic:
We study lists more in Round 8: Recursion, Section Linked lists.
Data structures in general: CS-A1140 Data Structures and Algorithms.
A presentation by Daniel Spiewak: Extreme Cleverness: Functional Data Structures in Scala.