分享
 
 
 

Data Structures with .NET - Part 1: An Introduction to Data Structures

王朝c#·作者佚名  2006-01-09
窄屏简体版  字體: |||超大  

An Extensive Examination of Data Structures

Part 1: An Introduction to Data Structures

Scott Mitchell

4GuysFromRolla.com

October 2003

Summary: This article kicks off a six-part series that focuses on important data structures and their use in application development. We'll examine both built-in data structures present in the .NET Framework, as well as essential data structures we'll have to build ourselves. This first installment focuses on defining what data structures are, how the efficiency of data structures is analyzed, and why this analysis is important. In this article, we'll also examine the Array and ArrayList, two of the most commonly used data structures present in the .NET Framework. (12 printed pages)

Contents

Introduction

Analyzing the Performance of Data Structures

Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure

The ArrayList: a Heterogeneous, Self-Redimensioning Array

Conclusion

Introduction

Welcome to the first in a six-part series on using data structures in .NET. Throughout this article series we will be examining a variety of data structures, some of which are included in the .NET Framework Base Class Library and others that we'll build ourselves. If you're unfamiliar with the term, data structures are abstract structures, or classes, that are used to organize data and provide various operations upon their data. The most common and likely well-known data structure is the array, which contains a contiguous collection of data items that can be accessed by an ordinal index.

Before jumping into the content for this article, let's first take a quick peek at the roadmap for this six-part article series, so that you can see what lies ahead. If there are any topics you think are missing from this outline, I invite you to e-mail me at mitchell@4guysfromrolla.com and share your thoughts. Space permitting, I'll be happy to add your suggestions to the appropriate installment or, if needed, add a seventh part to the series.

In this first part of the six-part series, we'll look at why data structures are important, and their effect on the performance of an algorithm. To determine a data structure's effect on performance, we'll need to examine how the various operations performed by a data structure can be rigorously analyzed. Finally, we'll turn our attention to two data structures present in the .NET Framework—the Array and ArrayList. Chances are you've used both of these data structures in past projects. In this article, we'll examine what operations they provide and the efficiency of these operations.

In Part 2, we'll explore the ArrayList class in more detail and examine its counterparts, the Queue class and Stack class. Like the ArrayList, both the Queue and Stack classes store a contiguous collection of data and are data structures available in the .NET Framework Base Class Library. However, unlike an ArrayList from which you can retrieve any data item, Queues and Stacks only allow data to be accessed in a predetermined sequential order. We'll examine some applications of Queues and Stacks, and see how to implement both of these classes by extending the ArrayList class. After examining Queues and Stacks, we'll look at HashTables, which allow for direct access like an ArrayList, but store data indexed by a string key.

While ArrayLists are ideal for directly accessing and storing contents, they are suboptimal candidates when the data needs to be searched. In Part 3, we'll examine the binary search tree data structure, which provides a much more efficient means for searching than the ArrayList. The .NET Framework does not include any built-in binary search tree data structures, so we will have to build our own.

The efficiency of searching a binary search trees is sensitive to the order with which the data was inserted into the tree. If the data was inserted in sorted or near-sorted order, the binary search tree loses virtually all of its efficiency advantages over the ArrayList. To combat this issue, in Part 4 we'll examine an interesting randomized data structure—the SkipList. SkipLists provide the efficiency of searching a binary search tree, but without the sensitivity to the order with which data is entered.

In Part 5 we'll turn our attention to data structures that can be used to represent graphs. A graph is a collection of nodes, with a set of edges connecting the various nodes. For example, a map can be visualized as a graph, with cities as nodes and the highways between them as edged between the nodes. Many real-world problems can be abstractly defined in terms of graphs, thereby making graphs an often-used data structure.

Finally, in Part 6 we'll look at data structures to represent sets and disjoint sets. A set is an unordered collection of items. Disjoint sets are a collection of sets that have no elements in common with one another. Both sets and disjoint sets have many uses in everyday programs, which we'll examine in detail in this final part.

Analyzing the Performance of Data Structures

When thinking about a particular application or programming problem, many developers (myself included) find themselves most interested in writing the algorithm to tackle the problem at hand, or adding cool features to the application to enhance the user's experience. Rarely, if ever, will you hear someone excited about what type of data structure they are using. However, the data structures used for a particular algorithm can greatly impact its performance. A common example is finding an element in a data structure. With an array, this process takes time proportional to the number of elements in the array. With binary search trees or SkipLists, the time required is sub-linear. When searching large amounts of data, the data structure chosen can make a difference in the application's performance that can be visibly measured in seconds or even minutes.

Since the data structure used by an algorithm can greatly affect the algorithm's performance, it is important that there exists a rigorous method by which to compare the efficiency of various data structures. What we, as developers utilizing a data structure, are primarily interested in is how the data structures performance changes as the amount of data stored increases. That is, for each new element stored by the data structure, how are the running times of the data structure's operations effected?

Consider a scenario in which you have a program that uses the System.IO.Directory.GetFiles(path) method to return the list of the files in a specified directory as a string array. Now, imagine that you wanted to search through the array to determine if an XML file existed in the list of files (namely one whose extension was .xml). One approach to do this would be to scan through the array and set some flag once an XML file was encountered. The code might look like so:

using System;

using System.Collections;

using System.IO;

public class MyClass

{

public static void Main()

{

string [] fs = Directory.GetFiles(@"C:\Inetpub\wwwroot");

bool foundXML = false;

int i = 0;

for (i = 0; i < fs.Length; i++)

if (String.Compare(Path.GetExtension(fs[i]), ".xml", true) == 0)

{

foundXML = true;

break;

}

if (foundXML)

Console.WriteLine("XML file found - " + fs[i]);

else

Console.WriteLine("No XML files found.");

}

}

Here we see that in the worst-case, when there is no XML file or the XML file is the last file in the list, we have to search through each element of the array exactly once. To analyze the array's efficiency at sorting, we must ask ourselves the following, "Assume that I have an array with n elements. If I add another element, so the array has n + 1 elements, what is the new running time?" (The term running time, despite its name, does not measure the absolute time it takes the program to run, but rather, it refers to the number of steps the program must perform to complete the given task at hand. When working with arrays, typically the steps considered are how many array accesses one needs to perform.) To search for a value in an array, we need to potentially visit every array value, so if we have n + 1 array elements, we might have to perform n + 1 checks. That is, the time it takes to search an array is linearly proportional to the number of elements in the array.

The sort of analysis described here is called asymptotic analysis, as it examines how the efficiency of a data structure changes as the data structure's size approaches infinity. The notation commonly used in asymptotic analysis is called big-Oh notation. The big-Oh notation to describe the performance of searching an array would be denoted as O(n). The large script O is where the terminology big-Oh notation comes from, and the n indicates that the number of steps required to search an array grows linearly as the size of the array grows.

A more methodical way of computing the asymptotic running time of a block of code is to do follow these simple steps:

1. Determine the steps that constitute the algorithm's running time. As aforementioned, with arrays, typically the steps considered are the read and write accesses to the array. For other data structures the steps might differ. Typically, you want to concern yourself with steps that involve the data structure itself, and not simple, atomic operations performed by the computer. That is, with the block of code above, I analyzed its running time by only counting how many times the array needs to be accessed, and did not bother worrying about the time for creating and initializing variables or the check to see if the two strings were equal.

2. Find the line(s) of code that perform the steps you are interested in counting. Put a 1 next to each of those lines.

3. For each line with a 1 next to it, see if it is in a loop. If so, change the 1 to 1 times the maximum number of repetitions the loop may perform. If you have two or more nested loops, continue the multiplication for each loop.

4. Find the largest single term you have written down. This is the running time.

Let's apply these steps to the block of code above. We've already identified that the steps we're interested in are the number of array accesses. Moving onto step 2 note that there are two lines on which the array, fs, is being accessed—as a parameter in the String.Compare() method and in the Console.WriteLine() method, so mark a 1 next to each line. Now, applying step 3 notice that the access to fs in the String.Compare() method occurs within a loop that runs at most n times (where n is the size of the array). So, scratch out the 1 in the loop and replace it with n. Finally, we see that the largest value is n, so the running time is denoted as O(n).

O(n), or linear-time, represents just one of a myriad of possible asymptotic running times. Others include O(log2 n), O(n log2 n), O(n2), O(2n), and so on. Without getting into the gory mathematical details of big-Oh, the lower the term inside the parenthesis for large values of n, the better the data structure's operation's performance. For example, an operation that runs in O(log n) is more efficient than one that runs in O(n) since log n < n.

Note In case you need a quick mathematics refresher, loga b = y is just another way to write ay = b. So, log2 4 = 2, since 22 = 4. Similarly, log2 8 = 3, since 23 = 8. Clearly, log2 n grows much slower than n alone, because when n = 8, log2 n = 3. In Part 3 we'll examine binary search trees whose search operation provides an O(log2 n) running time.

Throughout this article series, we'll be computing each new data structure and its operations asymptotic running time and comparing it to the running time for similar operations on other data structures.

Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure—The Array

Arrays are one of the simplest and most widely used data structures in computer programs. Arrays in any programming language all share a few common properties:

· The contents of an array are stored in contiguous memory.

· All of the elements of an array must be of the same type; hence arrays are referred to as homogeneous data structures.

· Array elements can be directly accessed. (This is not necessarily the case for many other data structures. For example, in part 4 of this article series we'll examine a data structure called the SkipList. To access a particular element of a SkipList you must search through other elements until you find the element for which you're looking. With arrays, however, if you know you want to access the ith element, you can simply use one line of code: arrayName[i].)

The common operations performed on arrays are:

· Allocation

· Accessing

· Redimensioning

When an array is initially declared in C# it has a null value. That is, the following line of code simply creates a variable named booleanArray that equals null:

bool [] booleanArray;

Before we can begin to work with the array, we must allocate a specified number of elements. This is accomplished using the following syntax:

booleanArray = new bool[10];

Or more generically:

arrayName = new arrayType[allocationSize];

This allocates a contiguous block of memory in the CLR-managed heap large enough to hold the allocationSize number of arrayTypes. If arrayType is a value type, then allocationSize number of unboxed arrayType values are created. If arrayType is a reference type, then allocationSize number of arrayType references are created. (If you are unfamiliar with the difference between reference and value types and the managed heap versus the stack, check out Understanding .NET's Common Type System.)

To help hammer home how the .NET Framework stores the internals of an array, consider the following example:

bool [] booleanArray;

FileInfo [] files;

booleanArray = new bool[10];

files = new FileInfo[10];

Here, the booleanArray is an array of the value type System.Boolean, while the files array is an array of a reference type, System.IO.FileInfo. Figure 1 shows a depiction of the CLR-managed heap after these four lines of code have executed.

[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig01.gif]

Figure 1. The contents of an array are laid out contiguously in the managed heap.

The thing to keep in mind is that the ten elements in the files array are references to FileInfo instances. Figure 2 hammers home this point, showing the memory layout if we assign some of the values in the files array to FileInfo instances.

[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig02.gif]

Figure 2. The contents of an array are laid out contiguously in the managed heap.

All arrays in .NET provide allow their elements to both be read and written to. The syntax for accessing an array element is:

// Read an array element

bool b = booleanArray[7];

// Write to an array element

booleanArray[0] = false;

The running time of an array access is denoted O(1) because it is constant. That is, regardless of how many elements are stored in the array, it takes the same amount of time to lookup an element. This constant running time is possible solely because an array's elements are stored contiguously, hence a lookup just requires knowledge of the array's starting location in memory, the size of each array element, and the element to be indexed.

Realize that in managed code, array lookups are a bit more involved than this because with each array access the CLR checks to ensure that the index being requested is within the array's bounds. If the array index specified is out of bounds, an IndexOutOfRangeException is thrown. This check helps ensure that when stepping through an array we do not accidentally step past the last array index and into some other memory. This check, though, does not affect the running time of an array access because the time to perform such checks does not increase as the size of the array increases.

Note This index-bounds check comes at a slight cost of performance for applications that make a large number of array accesses. With a bit of unmanaged code, though, this index out of bounds check can be bypassed. For more information, refer to Chapter 14 of Applied Microsoft .NET Framework Programming by Jeffrey Richter.

When working with an array, you might need to change the number of elements it holds. To do so, you'll need to create a new array instance of the specified size and then copy over the contents of the old array into the new, resized array. This process is called redimensioning, and can be accomplished with the following code:

using System;

using System.Collections;

public class MyClass

{

public static void Main()

{

// Create an integer array with three elements

int [] fib = new int[3];

fib[0] = 1;

fib[1] = 1;

fib[2] = 2;

// Redimension message to a 10 element array

int [] temp = new int[10];

// Copy the fib array to temp

fib.CopyTo(temp, 0);

// Assign temp to fib

fib = temp;

}

}

After the last line of code, fib references a ten-element Int32 array. The elements 3 through 9 in the fib array will have the default Int32 value—0.

Arrays are excellent data structures to use when storing a collection of heterogeneous types that you only need to access directly. Searching an unsorted array has linear running time. While this is acceptable when working with small arrays, or when performing very few searches, if your application is storing large arrays that are searched frequently, there are a number of other data structures better suited for the job. We'll look at some such data structures in upcoming pieces of this article series. (Realize that if you are searching an array on some property and the array is sorted by that property, you can use an algorithm called binary search to search the array in O(log n) running time, which is on par with the search times for binary search trees. In fact, the Array class contains a static BinarySearch() method. For more information on this method, check out an earlier article of mine, Efficiently Searching a Sorted Array.

Note The .NET Framework allows for multi-dimensional arrays as well. Multi-dimensional arrays, like single-dimensional arrays, offer a constant running time for accessing elements. Recall that the running time to search through an n-element single dimensional array was denoted O(n). For an nxn two-dimensional array, the running time is denoted O(n2) because the search must check n2 elements. More generally, a k-dimensional array has a search running time of O(nk).

The ArrayList: a Heterogeneous, Self-Redimensioning Array

While arrays definitely have their time and place, arrays create some limitations on design because a single array can only store elements of one type (homogeneity), and when using arrays you must specifically allocate a certain number of elements. Oftentimes, though, developers want something more flexible—a simple collection of objects of potentially different types that can be easily managed without having to worry about allocation issues. The .NET Framework Base Class Library provides such a data structure called the System.Collections.ArrayList.

An example of the ArrayList in action can be seen in the code snippet below. Note that with the ArrayList any type can be added and no allocation step has to be performed, and in fact, elements of different types can be added to the ArrayList. Furthermore, at no point do we have to concern ourselves with redimensioning the ArrayList. All of this is handled behind the scenes for us.

ArrayList countDown = new ArrayList();

countDown.Add(5);

countDown.Add(4);

countDown.Add(3);

countDown.Add(2);

countDown.Add(1);

countDown.Add("blast off!");

countDown.Add(new ArrayList());

Behind the scenes the ArrayList uses a System.Array of type object. Since all types are derived either directly or indirectly from object, an object array can hold elements of any type. By default, an ArrayList creates a 16-element object array, although the precise size can be specified through a parameter in the constructor or the Capacity property. When adding an element thought the Add() method, the number of elements in the internal array is checked with the array's capacity. If adding the new element causes the count to exceed the capacity, the capacity is automatically doubled and the array is redimensioned.

The ArrayList, like the array, can be directly indexed using the same syntax:

// Read access

int x = (int) countDown[0];

string y = (string) countDown[5];

// Write access

countDown[1] = 5;

// ** WILL GENERATE AN ArgumentOutOfRange EXCEPTION **

countDown[7] = 5;

Since the ArrayList stores an array of objects, when reading the value from an ArrayList you need to explicitly cast it to the data type being stored in the specified location. Also, note that if you try to reference an ArrayList element greater than the ArrayList's size, a System.ArgumentOutOfRange exception will be thrown.

While the ArrayList provides added flexibility over the standard array, this flexibility comes at the cost of performance, especially when storing value types in an ArrayList. Recall that an array of a value type—such as a System.Int32, System.Double, System.Boolean, and so on—is stored contiguously in the managed heap in its unboxed form. The ArrayList's internal array, however, is an array of object references. Therefore, even if you have an ArrayList that stores nothing but value types, each ArrayList element is a reference to a boxed value type, as shown in Figure 3.

[url=http://msdn.microsoft.com/library/en-us/dv_vstechart/html/datastructures-fig03.gif]

Figure 3. The ArrayList contains a contiguous block of object references

The boxing and unboxing, along with the extra level of indirection that comes with using value types in an ArrayList, can hamper the performance of your application when using large ArrayLists with many reads and writes. As Figure 3 illustrates, the same memory layout occurs for reference types in both ArrayLists and arrays.

The ArrayList's self-redimensioning shouldn't cause any sort of performance degradation in comparison to an array. If you know the precise number of elements that need to be stored in the ArrayList, you can essentially turn off self-redimensioning by specifying the initial capacity in the ArrayList's constructor. If you don't know the precise size, even with an array you may have to redimension the array should the number of elements inserted exceed the array's size.

A classic computer science problem is determining how much new space to allocate when running out of a space in some buffer. One option when redimensioning an array is to allocate just one more element in the resized array. That is, if the array is initially allocated with five elements, before the sixth element is inserted, the array is redimensioned to six elements. Clearly, this approach conserves the most memory, but can become costly because each insert following the first redimensioning results in another redimensioning.

Another option, at the opposite end of the spectrum, is to redimension the array 100 times larger than its current size. That is, if an array is initially allocated with five elements, before the sixth element is inserted, the array is redimensioned to 500 elements. Clearly, this approach greatly reduces the number of redimensionings that needs to occur, but, if only a few more elements are added to the array then hundreds of array elements have been unused, resulting in wasted space.

A tried and true compromise to this problem is to simply double the existing size of the array when free space becomes exhausted. So, for an array initially allocated with five elements, adding a sixth element would cause the array to be redimensioned to a size of 10. This is precisely the approach the ArrayList class takes and, best of all, it is all performed for you.

The asymptotic running time of the ArrayList's operations are the same as those of the standard array. While the ArrayList does indeed have more overhead, especially when storing value types, the relationship between the number of elements in the ArrayList and the cost per operation is the same as the standard array.

Conclusion

This article started our discussion on data structures by identifying why studying data structures was important, and by providing a means of how to analyze the performance of data structures. This material is important to understand as being able to analyze the running times of various data structure operations is a useful tool when deciding what data structure to use for a particular programming problem.

After studying how to analyze data structures, we turned to examining two of the most common data structures in the .NET Framework Base Class Library—System.Array class and System.Collections.ArrayList. Arrays allow for a contiguous block of homogeneous types. Their main benefit is that they provide lightning-fast access to reading and writing array elements. Their weak point lies in searching arrays, as each and every element must potentially be visited (in an unsorted array).

The ArrayList provides a more flexible array-like data structure. Rather than enforcing homogeneous types, the ArrayList allows for heterogeneous types to be stored by using an array of objects. Furthermore, the ArrayList does not require explicit allocation and can gracefully grow as the more elements are added.

In the next part of this article series we'll turn our attention to the Stack and Queue classes. We'll also look at associative arrays, which are arrays indexed by a string key as opposed to an integer value. Associative arrays are provided in the .NET Framework Base Class Library through the Hashtable class.

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有