# Indexing and search

Everyday experience hints that if we get to choose between
*disorder* and *order*, from the perspective of *efficiency* it may
be serendipitous to choose *order*.

Indeed, look at the two sequences of characters below:

Both sequences contain the same characters.

The first sequence is arguably *disordered* (or *unstructured*), whereas
the second sequence is arguably more *ordered* (or *structured*).

## Disorder and order, searching

Suppose we now need to *search* a sequence of characters
for a particular character, say, the character “`x`

”:

Does the character occur in the sequence?

If so, how many times, and in what precise positions?

Again everyday experience suggests that *order*
enables a more *efficient* search than *disorder*.

Let us, however, start with *disorder*.

## Linear search

Suppose we are faced with an indexed sequence of elements of some type `T`

.
Suppose furthermore that we know *absolutely nothing* about the sequence.
For all we know, there is no structure whatsoever in the sequence.

How efficiently can we search such a sequence for a particular element
(or *key*) of type `T`

?

One possibility is just to perform a *linear search*, that is,
we iterate through the sequence, one element at a time,
and see whether we can find the key `k`

:

```
def linearSearch[T](s: IndexedSeq[T], k: T): Boolean = {
var i = 0
while(i < s.length) {
if(s(i) == k) return true // found k at position i
i = i + 1
}
return false // no k in sequence
}
```

From the perspective of efficiency, if

testing elements for equality, and

indexed access to elements of the sequence

are both constant-time operations, then linear search takes \(\mathcal{O}(n)\) time for a sequence of length \(n\).

What is more, if we really know absolutely nothing about the sequence,
then *linear search is essentially the best we can do*.
That is, in the worst case we must access every element and test it
for equality against the key `k`

. Indeed, any element left unaccessed
or untested could have been the only element that is equal to the key,
so every element must be accessed and tested.

Note

The `scala.collection`

classes that integrate the scala.collection.Seq trait use linear search to implement
the method `find`

. The search is actually implemented in the
scala.collection.Iterator trait as

```
def find(p: A => Boolean): Option[A] = {
var res: Option[A] = None
while (res.isEmpty && hasNext) {
val e = next()
if (p(e)) res = Some(e)
}
res
}
```

Warning

Despite its name, linear search may take more than linear time if
it is not a constant-time operation to test two values of type `T`

for equality. This is the case, for example, if the values are
long strings or other complicated objects that may require
substantial effort to test for equality.

## Indexing and sorting

While linear search is the best we can do to search unstructured
or unknown data, of course there is nothing that forces the
data to *remain unstructured* once we have access to it.

Indeed, as soon as we have access, we should invest resources to
*introduce* structure to the data *to enable fast search*.
Once we *know* that the data has a particular structure, we can
take advantage of this structure when we search the data.

Note

Such an approach of investing resources to introduce structure
to enable subsequent fast search is called *indexing* the data.

Perhaps the most common form of indexing is that we **sort** the data
so that the keys appear in increasing (or decreasing) order.
Once we know that this is the case, then we can take advantage of the
order when we search the data.

## Binary search

Suppose we have a sequence \(s\) of elements that has been sorted to increasing order. Now suppose we must search whether a key \(k\) occurs in \(s\).

Consider the *middle* element \(m\) in \(s\). We observe the following three cases:

If \(k = m\), we have found the key \(k\) so we can stop immediately.

If \(k < m\), then the key \(k\) can appear only in

*the first half*of the sequence.If \(k > m\), then the key \(k\) can appear only in

*the second half*of the sequence.

In the latter two cases, we *recursively continue in the selected half* of
the sequence until either

we find the key \(k\), or

the subsequence under consideration is empty, in which case we conclude that the key \(k\) does not appear in \(s\).

For example, below we display the steps of
binary search for the key 19 in the ordered sequence
[1,4,4,7,7,8,11,19,21,23,24,30]. In each step, we highlight
with blue the current subsequence defined by the indices `start`

and `end`

:

That is, in the first step we start with the entire sequence and consider
its middle element 8 (let us agree that when the sequence has even length,
the middle element is the leftmost of the two available choices for
the middle element). As the sequence is sorted and 8 is less than the key 19,
we conclude that 19 can occur only in the second half of the sequence,
not including the middle element 8. We then repeat the same with the
subsequence consisting of the second half. Its middle element is 21,
which is greater than the key 19.
We thus next consider the first half of *this subsequence*,
whose middle element is 11. In the last step the subsequence consists
of one element, and this element is both the middle element and
is equal to the key 19. Thus we have discovered 19 in the sequence.
Observe that only the 4 elements (8, 21, 11 and 19) indexed by the middle
point index `mid`

in the figure are compared against the key 19.

As a second example, let us search for the key 6 in the same sorted sequence:

In the last step, there are no more elements to consider in the subsequence and we conclude that 6 does not appear in the sequence.

### Implementing binary search in Scala

Now that we understand how binary search operates, let us implement it in Scala.

First, we must make sure that our sequence is sorted and supports
constant-time indexed access to its elements. If this is not the case,
we prepare a sorted copy of the sequence with the `sorted`

method
and then (if necessary) transform the sorted collection into a type
that supports constant-time indexed access, for example, by calling its
`toIndexedSeq`

(or `toArray`

or `toVector`

) method.

As we are not modifying the sequence during search,
we need not create new sequence objects for the considered
subsequences but can simply track the start and end of the
current subsequence with two indices (`start`

and `end`

).

We can now implement binary search as a recursive function:

```
def binarySearch[T](s: IndexedSeq[T], k: T)(implicit ev: T=>Ordered[T]) : Boolean = {
//require(s.sliding(2).forall(p => p(0) <= p(1)), "s should be sorted")
def inner(start: Int, end: Int): Int = {
if(!(start < end)) start
else {
val mid = (start + end) / 2
val cmp = k compare s(mid)
if(cmp == 0) mid // k == s(mid)
else if(cmp < 0) inner(start, mid-1) // k < s(mid)
else inner(mid+1, end) // k > s(mid)
}
}
if(s.length == 0) false
else s(inner(0, s.length-1)) == k
}
```

Let us test the function with some sample inputs:

```
scala> val s = Vector(1,4,4,7,7,8,11,19,21,23,24,30)
s: scala.collection.immutable.Vector[Int] = Vector(1, 4, 4, 7, 7, 8, 11, 19, 21, 23, 24, 30)
scala> binarySearch(s, 19)
res: Boolean = true
scala> binarySearch(s, 6)
res: Boolean = false
```

Observe that the recursion takes place in the inner function `inner`

because
the recursive calls have to know the start and end of the current
subsequence, and these parameters are not available in the outer function
`binarySearch`

. Also, the first `require`

statement is not included but
rather is commented out because its evaluation is (for long sequences)
substantially slower than the rest of the function.
One can include it, for instance, in the testing phase of a larger project
so that illegal calls to the function with non-sorted sequences can be caught.

For a sorted sequence with \(n\) elements
with constant-time indexed access and `compare`

operator,
the run-time of `binarySearch`

is in \(\mathcal{O}(\log n)\):

The number of elements under consideration,

`end - start + 1`

, at least halves in each recursive callWe stop when

`start >= end`

(that is, when`end - start + 1 <= 1`

)How many halvings suffice until we have decreased the original \(n\)-element sequence into a one-element sequence? Denoting the number of halvings by \(h\), we have \(n/2^h \le 1\), which implies \(n \le 2^h\) and thus \(h \ge \log n\).

Under the previous assumptions, each halving takes constant time and thus the run-time is \(\mathcal{O}(\log n)\).

The number of calls to the `compare`

operator is
\(\mathcal{O}(\log n)\). This matters in particular if comparison is
an expensive operation, for example, when the elements are long strings
or other more complicated objects.

We have thus established that compared with linear search, binary search is a very fast way to search for a key in a sorted sequence.

If the sequence is initially not sorted but we have to sort it first
to enable binary search, then we should include the time to sort
into the analysis. The most efficient general purpose
comparison-based sorting algorithms, such as the sorting algorithm that is implemented in
the `sorted`

method, work in time \(\mathcal{O}(n \log n)\).
(Here we again assume comparisons are constant-time operations.)
Now, if the sequence is initially unstructured and we make only
one search after the sequence is sorted, the total run-time will be
\(\mathcal{O}(n \log n) + \mathcal{O}(\log n) = \mathcal{O}(n \log n)\),
which is in total slower than just running linear search on the
unstructured input without sorting. On the other hand, if we make
a large number of searches to the sequence then it pays off to first
sort and then run binary search. Similarly, if we can afford to invest
time to *preprocess* our input before any searches take place, and each
search needs to be executed as fast as possible (for example, to supply a smooth
and responsive user experience), then it again pays off to use binary search.

Note

Although binary search is conceptually quite straightforward, its implementations are surprisingly prone to subtle errors.

Note

The declaration `implicit ev: T=>Ordered[T]`

in the function says that
the class `T`

of the elements in the sequence `s`

must be such
that it can be converted to an Ordered[T] class
which provides the `compare`

method used in the function. Moreover that a function `ev`

(evidence) should be provided implicitly by Scala if such a conversion exists, so that the function can be called as `binarySearch(s,k)`

.

Note

Recursion is a natural way to implement binary search but we can also make an iterative implementation:

```
def binarySearchIter[T](s: IndexedSeq[T], k: T)(implicit ev: T=>Ordered[T]): Boolean = {
//require(s.sliding(2).forall(p => p(0) <= p(1)), "s should be sorted")
if(s.length == 0) return false
var start = 0
var end = s.length - 1
while(start < end) {
val mid = (start + end) / 2
val cmp = k compare s(mid)
if(cmp == 0) {start = mid; end = mid}
else if(cmp < 0) end = mid-1 // k < s(mid)
else start = mid+1 // k > s(mid)
}
return s(start) == k
}
```