An Extensive Examination of Data Structures
Part 1: An Introduction to Data Structures
Scott Mitchell
4GuysFromRolla.com
October 2003
Summary: This article kicks off a six-part series that focuses on important data structures and their use in application development. We'll examine both built-in data structures present in the .NET Framework, as well as essential data structures we'll have to build ourselves. This first installment focuses on defining what data structures are, how the efficiency of data structures is analyzed, and why this analysis is important. In this article, we'll also examine the Array and ArrayList, two of the most commonly used data structures present in the .NET Framework. (12 printed pages)
Contents
Analyzing the Performance of Data Structures
Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure
The ArrayList: a Heterogeneous, Self-Redimensioning Array
Introduction
Welcome to the first in a six-part series on using data structures in .NET. Throughout this article series we will be examining a variety of data structures, some of which are included in the .NET Framework Base Class Library and others that we'll build ourselves. If you're unfamiliar with the term, data structures are abstract structures, or classes, that are used to organize data and provide various operations upon their data. The most common and likely well-known data structure is the array, which contains a contiguous collection of data items that can be accessed by an ordinal index.
Before jumping into the content for this article, let's first take a quick peek at the roadmap for this six-part article series, so that you can see what lies ahead. If there are any topics you think are missing from this outline, I invite you to e-mail me at mitchell@4guysfromrolla.com and share your thoughts. Space permitting, I'll be happy to add your suggestions to the appropriate installment or, if needed, add a seventh part to the series.
In this first part of the six-part series, we'll look at why data structures are important, and their effect on the performance of an algorithm. To determine a data structure's effect on performance, we'll need to examine how the various operations performed by a data structure can be rigorously analyzed. Finally, we'll turn our attention to two data structures present in the .NET Framework—the Array and ArrayList. Chances are you've used both of these data structures in past projects. In this article, we'll examine what operations they provide and the efficiency of these operations.
In Part 2, we'll explore the ArrayList class in more detail and examine its counterparts, the Queue class and Stack class. Like the ArrayList, both the Queue and Stack classes store a contiguous collection of data and are data structures available in the .NET Framework Base Class Library. However, unlike an ArrayList from which you can retrieve any data item, Queues and Stacks only allow data to be accessed in a predetermined sequential order. We'll examine some applications of Queues and Stacks, and see how to implement both of these classes by extending the ArrayList class. After examining Queues and Stacks, we'll look at HashTables, which allow for direct access like an ArrayList, but store data indexed by a string key.
While ArrayLists are ideal for directly accessing and storing contents, they are suboptimal candidates when the data needs to be searched. In Part 3, we'll examine the binary search tree data structure, which provides a much more efficient means for searching than the ArrayList. The .NET Framework does not include any built-in binary search tree data structures, so we will have to build our own.
The efficiency of searching a binary search trees is sensitive to the order with which the data was inserted into the tree. If the data was inserted in sorted or near-sorted order, the binary search tree loses virtually all of its efficiency advantages over the ArrayList. To combat this issue, in Part 4 we'll examine an interesting randomized data structure—the SkipList. SkipLists provide the efficiency of searching a binary search tree, but without the sensitivity to the order with which data is entered.
In Part 5 we'll turn our attention to data structures that can be used to represent graphs. A graph is a collection of nodes, with a set of edges connecting the various nodes. For example, a map can be visualized as a graph, with cities as nodes and the highways between them as edged between the nodes. Many real-world problems can be abstractly defined in terms of graphs, thereby making graphs an often-used data structure.
Finally, in Part 6 we'll look at data structures to represent sets and disjoint sets. A set is an unordered collection of items. Disjoint sets are a collection of sets that have no elements in common with one another. Both sets and disjoint sets have many uses in everyday programs, which we'll examine in detail in this final part.
Analyzing the Performance of Data Structures
When thinking about a particular application or programming problem, many developers (myself included) find themselves most interested in writing the algorithm to tackle the problem at hand, or adding cool features to the application to enhance the user's experience. Rarely, if ever, will you hear someone excited about what type of data structure they are using. However, the data structures used for a particular algorithm can greatly impact its performance. A common example is finding an element in a data structure. With an array, this process takes time proportional to the number of elements in the array. With binary search trees or SkipLists, the time required is sub-linear. When searching large amounts of data, the data structure chosen can make a difference in the application's performance that can be visibly measured in seconds or even minutes.
Since the data structure used by an algorithm can greatly affect the algorithm's performance, it is important that there exists a rigorous method by which to compare the efficiency of various data structures. What we, as developers utilizing a data structure, are primarily interested in is how the data structures performance changes as the amount of data stored increases. That is, for each new element stored by the data structure, how are the running times of the data structure's operations effected?
Consider a scenario in which you have a program that uses the System.IO.Directory.GetFiles(path) method to return the list of the files in a specified directory as a string array. Now, imagine that you wanted to search through the array to determine if an XML file existed in the list of files (namely one whose extension was .xml). One approach to do this would be to scan through the array and set some flag once an XML file was encountered. The code might look like so:
using System;
using System.Collections;
using System.IO;
public class MyClass
{
public static void Main()
{
string [] fs = Directory.GetFiles(@"C:\Inetpub\wwwroot");
bool foundXML = false;
int i = 0;
for (i = 0; i < fs.Length; i++)
if (String.Compare(Path.GetExtension(fs[i]), ".xml", true) == 0)
{
foundXML = true;
break;
}
if (foundXML)
Console.WriteLine("XML file found - " + fs[i]);
else
Console.WriteLine("No XML files found.");
}
}
Here we see that in the worst-case, when there is no XML file or the XML file is the last file in the list, we have to search through each element of the array exactly once. To analyze the array's efficiency at sorting, we must ask ourselves the following, "Assume that I have an array with n elements. If I add another element, so the array has n + 1 elements, what is the new running time?" (The term running time, despite its name, does not measure the absolute time it takes the program to run, but rather, it refers to the number of steps the program must perform to complete the given task at hand. When working with arrays, typically the steps considered are how many array accesses one needs to perform.) To search for a value in an array, we need to potentially visit every array value, so if we have n + 1 array elements, we might have to perform n + 1 checks. That is, the time it takes to search an array is linearly proportional to the number of elements in the array.
The sort of analysis described here is called asymptotic analysis, as it examines how the efficiency of a data structure changes as the data structure's size approaches infinity. The notation commonly used in asymptotic analysis is called big-Oh notation. The big-Oh notation to describe the performance of searching an array would be denoted as O(n). The large script O is where the terminology big-Oh notation comes from, and the n indicates that the number of steps required to search an array grows linearly as the size of the array grows.
A more methodical way of computing the asymptotic running time of a block of code is to do follow these simple steps:
1. Determine the steps that constitute the algorithm's running time. As aforementioned, with arrays, typically the steps considered are the read and write accesses to the array. For other data structures the steps might differ. Typically, you want to concern yourself with steps that involve the data structure itself, and not simple, atomic operations performed by the computer. That is, with the block of code above, I analyzed its running time by only counting how many times the array needs to be accessed, and did not bother worrying about the time for creating and initializing variables or the check to see if the two strings were equal.
2. Find the line(s) of code that perform the steps you are interested in counting. Put a 1 next to each of those lines.
3. For each line with a 1 next to it, see if it is in a loop. If so, change the 1 to 1 times the maximum number of repetitions the loop may perform. If you have two or more nested loops, continue the multiplication for each loop.
4. Find the largest single term you have written down. This is the running time.
Let's apply these steps to the block of code above. We've already identified that the steps we're interested in are the number of array accesses. Moving onto step 2 note that there are two lines on which the array, fs, is being accessed—as a parameter in the String.Compare() method and in the Console.WriteLine() method, so mark a 1 next to each line. Now, applying step 3 notice that the access to fs in the String.Compare() method occurs within a loop that runs at most n times (where n is the size of the array). So, scratch out the 1 in the loop and replace it with n. Finally, we see that the largest value is n, so the running time is denoted as O(n).
O(n), or linear-time, represents just one of a myriad of possible asymptotic running times. Others include O(log2 n), O(n log2 n), O(n2), O(2n), and so on. Without getting into the gory mathematical details of big-Oh, the lower the term inside the parenthesis for large values of n, the better the data structure's operation's performance. For example, an operation that runs in O(log n) is more efficient than one that runs in O(n) since log n < n.
Note In case you need a quick mathematics refresher, loga b = y is just another way to write ay = b. So, log2 4 = 2, since 22 = 4. Similarly, log2 8 = 3, since 23 = 8. Clearly, log2 n grows much slower than n alone, because when n = 8, log2 n = 3. In Part 3 we'll examine binary search trees whose search operation provides an O(log2 n) running time.
Throughout this article series, we'll be computing each new data structure and its operations asymptotic running time and comparing it to the running time for similar operations on other data structures.
Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure—The Array
Arrays are one of the simplest and most widely used data structures in computer programs. Arrays in any programming language all share a few common properties:
· The contents of an array are stored in contiguous memory.
· All of the elements of an array must be of the same type; hence arrays are referred to as homogeneous data structures.
· Array elements can be directly accessed. (This is not necessarily the case for many other data structures. For example, in part 4 of this article series we'll examine a data structure called the SkipList. To access a particular element of a SkipList you must search through other elements until you find the element for which you're looking. With arrays, however, if you know you want to access the ith element, you can simply use one line of code: arrayName[i].)
The common operations performed on arrays are:
· Allocation
· Accessing
· Redimensioning
When an array is initially declared in C# it has a null value. That is, the following line of code simply creates a variable named booleanArray that equals null:
bool [] booleanArray;
Before we can begin to work with the array, we must allocate a specified number of elements. This is accomplished using the following syntax:
booleanArray = new bool[10];
Or more generically:
arrayName = new arrayType[allocationSize];
This allocates a contiguous block of memory in the CLR-managed heap large enough to hold the allocationSize number of arrayTypes. If arrayType is a value type, then allocationSize number of unboxed arrayType values are created. If arrayType is a reference type, then allocationSize number of arrayType references are created. (If you are unfamiliar with the difference between reference and value types and the managed heap versus the stack, check out Understanding .NET's Common Type System.)
To help hammer home how the .NET Framework stores the internals of an array, consider the following example:
bool [] booleanArray;
FileInfo [] files;
booleanArray = new bool[10];
files = new FileInfo[10];
Here, the booleanArray is an array of the value type System.Boolean, while the files array is an array of a reference type, System.IO.FileInfo. Figure 1 shows a depiction of the CLR-managed heap after these four lines of code have executed.
[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig01.gif]
Figure 1. The contents of an array are laid out contiguously in the managed heap.
The thing to keep in mind is that the ten elements in the files array are references to FileInfo instances. Figure 2 hammers home this point, showing the memory layout if we assign some of the values in the files array to FileInfo instances.
[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig02.gif]
Figure 2. The contents of an array are laid out contiguously in the managed heap.
All arrays in .NET provide allow their elements to both be read and written to. The syntax for accessing an array element is:
// Read an array element
bool b = booleanArray[7];
// Write to an array element
booleanArray[0] = false;
The running time of an array access is denoted O(1) because it is constant. That is, regardless of how many elements are stored in the array, it takes the same amount of time to lookup an element. This constant running time is possible solely because an array's elements are stored contiguously, hence a lookup just requires knowledge of the array's starting location in memory, the size of each array element, and the element to be indexed.
Realize that in managed code, array lookups are a bit more involved than this because with each array access the CLR checks to ensure that the index being requested is within the array's bounds. If the array index specified is out of bounds, an IndexOutOfRangeException is thrown. This check helps ensure that when stepping through an array we do not accidentally step past the last array index and into some other memory. This check, though, does not affect the running time of an array access because the time to perform such checks does not increase as the size of the array increases.
Note This index-bounds check comes at a slight cost of performance for applications that make a large number of array accesses. With a bit of unmanaged code, though, this index out of bounds check can be bypassed. For more information, refer to Chapter 14 of Applied Microsoft .NET Framework Programming by Jeffrey Richter.
When working with an array, you might need to change the number of elements it holds. To do so, you'll need to create a new array instance of the specified size and then copy over the contents of the old array into the new, resized array. This process is called redimensioning, and can be accomplished with the following code:
using System;
using System.Collections;
public class MyClass
{
public static void Main()
{
// Create an integer array with three elements
int [] fib = new int[3];
fib[0] = 1;
fib[1] = 1;
fib[2] = 2;
// Redimension message to a 10 element array
int [] temp = new int[10];
// Copy the fib array to temp
fib.CopyTo(temp, 0);
// Assign temp to fib
fib = temp;
}
}
After the last line of code, fib references a ten-element Int32 array. The elements 3 through 9 in the fib array will have the default Int32 value—0.
Arrays are excellent data structures to use when storing a collection of heterogeneous types that you only need to access directly. Searching an unsorted array has linear running time. While this is acceptable when working with small arrays, or when performing very few searches, if your application is storing large arrays that are searched frequently, there are a number of other data structures better suited for the job. We'll look at some such data structures in upcoming pieces of this article series. (Realize that if you are searching an array on some property and the array is sorted by that property, you can use an algorithm called binary search to search the array in O(log n) running time, which is on par with the search times for binary search trees. In fact, the Array class contains a static BinarySearch() method. For more information on this method, check out an earlier article of mine, Efficiently Searching a Sorted Array.
Note The .NET Framework allows for multi-dimensional arrays as well. Multi-dimensional arrays, like single-dimensional arrays, offer a constant running time for accessing elements. Recall that the running time to search through an n-element single dimensional array was denoted O(n). For an nxn two-dimensional array, the running time is denoted O(n2) because the search must check n2 elements. More generally, a k-dimensional array has a search running time of O(nk).
The ArrayList: a Heterogeneous, Self-Redimensioning Array
While arrays definitely have their time and place, arrays create some limitations on design because a single array can only store elements of one type (homogeneity), and when using arrays you must specifically allocate a certain number of elements. Oftentimes, though, developers want something more flexible—a simple collection of objects of potentially different types that can be easily managed without having to worry about allocation issues. The .NET Framework Base Class Library provides such a data structure called the System.Collections.ArrayList.
An example of the ArrayList in action can be seen in the code snippet below. Note that with the ArrayList any type can be added and no allocation step has to be performed, and in fact, elements of different types can be added to the ArrayList. Furthermore, at no point do we have to concern ourselves with redimensioning the ArrayList. All of this is handled behind the scenes for us.
ArrayList countDown = new ArrayList();
countDown.Add(5);
countDown.Add(4);
countDown.Add(3);
countDown.Add(2);
countDown.Add(1);
countDown.Add("blast off!");
countDown.Add(new ArrayList());
Behind the scenes the ArrayList uses a System.Array of type object. Since all types are derived either directly or indirectly from object, an object array can hold elements of any type. By default, an ArrayList creates a 16-element object array, although the precise size can be specified through a parameter in the constructor or the Capacity property. When adding an element thought the Add() method, the number of elements in the internal array is checked with the array's capacity. If adding the new element causes the count to exceed the capacity, the capacity is automatically doubled and the array is redimensioned.
The ArrayList, like the array, can be directly indexed using the same syntax:
// Read access
int x = (int) countDown[0];
string y = (string) countDown[5];
// Write access
countDown[1] = 5;
// ** WILL GENERATE AN ArgumentOutOfRange EXCEPTION **
countDown[7] = 5;
Since the ArrayList stores an array of objects, when reading the value from an ArrayList you need to explicitly cast it to the data type being stored in the specified location. Also, note that if you try to reference an ArrayList element greater than the ArrayList's size, a System.ArgumentOutOfRange exception will be thrown.
While the ArrayList provides added flexibility over the standard array, this flexibility comes at the cost of performance, especially when storing value types in an ArrayList. Recall that an array of a value type—such as a System.Int32, System.Double, System.Boolean, and so on—is stored contiguously in the managed heap in its unboxed form. The ArrayList's internal array, however, is an array of object references. Therefore, even if you have an ArrayList that stores nothing but value types, each ArrayList element is a reference to a boxed value type, as shown in Figure 3.
[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig03.gif]
Figure 3. The ArrayList contains a contiguous block of object references
The boxing and unboxing, along with the extra level of indirection that comes with using value types in an ArrayList, can hamper the performance of your application when using large ArrayLists with many reads and writes. As Figure 3 illustrates, the same memory layout occurs for reference types in both ArrayLists and arrays.
The ArrayList's self-redimensioning shouldn't cause any sort of performance degradation in comparison to an array. If you know the precise number of elements that need to be stored in the ArrayList, you can essentially turn off self-redimensioning by specifying the initial capacity in the ArrayList's constructor. If you don't know the precise size, even with an array you may have to redimension the array should the number of elements inserted exceed the array's size.
A classic computer science problem is determining how much new space to allocate when running out of a space in some buffer. One option when redimensioning an array is to allocate just one more element in the resized array. That is, if the array is initially allocated with five elements, before the sixth element is inserted, the array is redimensioned to six elements. Clearly, this approach conserves the most memory, but can become costly because each insert following the first redimensioning results in another redimensioning.
Another option, at the opposite end of the spectrum, is to redimension the array 100 times larger than its current size. That is, if an array is initially allocated with five elements, before the sixth element is inserted, the array is redimensioned to 500 elements. Clearly, this approach greatly reduces the number of redimensionings that needs to occur, but, if only a few more elements are added to the array then hundreds of array elements have been unused, resulting in wasted space.
A tried and true compromise to this problem is to simply double the existing size of the array when free space becomes exhausted. So, for an array initially allocated with five elements, adding a sixth element would cause the array to be redimensioned to a size of 10. This is precisely the approach the ArrayList class takes and, best of all, it is all performed for you.
The asymptotic running time of the ArrayList's operations are the same as those of the standard array. While the ArrayList does indeed have more overhead, especially when storing value types, the relationship between the number of elements in the ArrayList and the cost per operation is the same as the standard array.
Conclusion
This article started our discussion on data structures by identifying why studying data structures was important, and by providing a means of how to analyze the performance of data structures. This material is important to understand as being able to analyze the running times of various data structure operations is a useful tool when deciding what data structure to use for a particular programming problem.
After studying how to analyze data structures, we turned to examining two of the most common data structures in the .NET Framework Base Class Library—System.Array class and System.Collections.ArrayList. Arrays allow for a contiguous block of homogeneous types. Their main benefit is that they provide lightning-fast access to reading and writing array elements. Their weak point lies in searching arrays, as each and every element must potentially be visited (in an unsorted array).
The ArrayList provides a more flexible array-like data structure. Rather than enforcing homogeneous types, the ArrayList allows for heterogeneous types to be stored by using an array of objects. Furthermore, the ArrayList does not require explicit allocation and can gracefully grow as the more elements are added.
In the next part of this article series we'll turn our attention to the Stack and Queue classes. We'll also look at associative arrays, which are arrays indexed by a string key as opposed to an integer value. Associative arrays are provided in the .NET Framework Base Class Library through the Hashtable class.