Inside MSXML Performance
MSXML性能分析
Chris Lovett
Microsoft Corporation
February 21, 2000
Download the source code for this article (1.17MB)
Contents
First DOM Walk Working Set Delta
目录
度量指标
MSXML特点
工作空间
百兆字节每秒
属性与元素
第一次DOM树遍历引起的工作空间增量
提前createNode
遍历与selectSingleNode
保存
名字空间
自由线程文档
延时的内存释放
虚拟内存
IDispatch
脚本
令人担心的“//”运算符
修剪查询树
交叉线程模式
小结
I definitely got the message from your online comments that we need more "novice-level" material and some real XML applications. However, this article was already in the pipeline-and is intended for the advanced XML developer. (After all, this column is called "Extreme XML"!) That said, this article assumes you are familiar with XML and the Microsoft XML Parser (MSXML) in particular. See the MSDN XML Developer's Center for more information.
我从网上很多评论中得知,大家需要更多的是入门级的资料和一些XML的实际应用举例。但是,本文已经基本成稿并且针对的是高级XML开发人员(毕竟,本专栏的名称叫“极限XML”!)。这就是说,本文的读者应该是比较熟悉XML和Microsoft XML解析器的。要得到更多相关信息,请查阅MSDN XML Developer's Center。
So, you're designing your XML-based Web application and you need to know what kind of performance to expect from your XML server. Obviously, this depends a lot on what processing you plan to do. It is hard to generalize, because there are so many variables—such as the size of the XML documents, the amount of script code required to process the documents, the amount of output generated, and so on.
因此,你可能正在设计基于XML的Web应用程序,而且你需要知道XML服务器的工作性能到底怎样。显然,这是由同你的处理过程密切相关。这很难概括来说,因为有太多的因素可以影响它的性能——如XML文档的大小,处理文档所使用的脚本代码的多少,产生输出的多少等等。
For example, major variables that can affect the performance of MSXML include:
例如,主要影响MSXML性能的因素有:
· The kind of XML data
· The ratio of tags to text
· The ratio of attributes to elements
· The amount of discarded white space
· XML数据的种类
· 标签对文字的比例
· 属性对元素的比例
· 可忽略的空格的数量
To illustrate some of these variables, I'll use four sample data files. Shown below is a snippet from each file to show you what each looks like:
为了说明各个因素,在此使用4个样本数据文件。一下就是这些文件中抽取的片段示例:
Ado.xml
This sample file is a persistently saved ADO Recordset object—and is extremely attribute heavy. Each attribute value is short, with little wasted white space, making it a data-dense document.
这个样本文件被永久保存的ADO Recordset对象,它充满了属性。每一个属性的值很短,没有什么空格,是一个数据密集的文档。
<rsSchema:row au_id='267-41-2394' au_lname='O'Leary' au_fname='Michael'
phone='408 286-2428' address='22 Cleveland Av. #14' city='San Jose' state='CA'
zip='95128' contract='True' name='systypes' id='4' uid='1' type='S ' userstat='0'
sysstat='113' indexdel='0' schema_ver='1' refdate='1900-01-01T00:00:00'
crdate='1996-04-03T03:38:57.387000000' version='0' deltrig='0' instrig='0'
updtrig='0' seltrig='0' category='0' cache='0'/>
Hamlet.xml
This sample file consists of Shakespeare's play "Hamlet." The file is a well -balanced combination of text and element markup, with no attributes.
这个文件包含了莎士比亚的剧本“哈姆雷特”。它由文字和元素标签组成,没有任何属性。
<SCENE><TITLE>SCENE I. Elsinore. A platform before the castle.</TITLE>
<STAGEDIR>FRANCISCO at his post. Enter to him BERNARDO</STAGEDIR>
<SPEECH>
<SPEAKER>BERNARDO</SPEAKER>
<LINE>Who's there?</LINE>
</SPEECH>
Ot.xml
This sample file consists of the entire Old Testament. Each tag is only one or two characters, which reduces the tag-to-text ratio.
这个文件包含了整本旧约全书。每个标签只有一到两个字符,降低了标签对文字的比例
<book>
<bktlong>The First Book of Moses, Called GENESIS.</bktlong>
<bktshort>Genesis</bktshort>
<chapter><chtitle>Chapter 1</chtitle>
<v><vn>1</vn><p>In the beginning God created the heaven and the earth.</p></v>
...
Northwind.xml
This sample file contains a portion of the Northwind database that ships with Microsoft Access. It uses elements instead of attributes, and has a high tag-to-text ratio, and has a lot of extra white space.
本样品包含了Microsoft Access附带的Northwind数据库的一部分。它使用元素而不是属性,有很高的标签对文字比例,还有很多多余的空格。
<OrderIDs>
<Item>
<OrderID> 10326</OrderID>
<OrderDate> 11/10/94</OrderDate>
<ShipAddress> C/ Araquil, 67</ShipAddress>
</Item>
...
Another major factor is whether the original file is stored as UCS-2. For most XML documents in English, UTF-8 is half the size of UCS-2 because the Latin characters compress down to a single byte in UTF-8. But this is not true for all languages. For some Asian languages, UTF-8 is actually larger than UCS-2, because it can expand to three bytes per character in the worst case. To be fair, the best format to use for measuring performance is UCS-2 on disk so that the numbers are more globally meaningful.
另一个主要因素是文件是否以UCS-2格式编码。由于大多数XML文档是英文的,UTF-8的大小是UCS-2的一半,因为拉丁字符在UTF-8中压缩到了一个字节。但是在对于其他语言来说并不一样。比如,对于一些亚洲语言,UTF-8比UCS-2更大,因为在最坏情况下它将每个字符扩展到三个字节。为了公正起见,度量性能的最好格式应该是UCS-2,这样更适应全球化的情况。
The following table shows the UCS-2 file sizes, number of unique names, number of elements and attributes, number of text nodes, and amount of text content (in Unicode characters) for each of our sample files. It also shows a "tagginess factor," which is the ratio of element and attribute name characters to the rest of the file.
下表显示了四个样品文件的UCS-2文件大小,唯一名的数量,元素和属性的数量,文本节点的数量和文字内容的数量(Unicode字符)。它还显示了标签比重,表示元素和属性名字符对文件中其他字符的比例。
Sample
样品
File size
文件大小
Unique names
唯一名
Elements and attributes元素和属性
Text nodes
文字节点
Text content (characters)
文本内容(字符数)
Tagginess (percentage)
标签比重(百分比)
Ado.xml
2,171,812
53
63,722
61,462
3890
18.7
Hamlet.xml
559,260
17
6637
5472
170,545
5.9
Ot.xml
7,663,624
12
71,417
47,302
3,236,900
1.4
Northwind.xml
488,140
12
3680
2761
31,155
6.0
The number of unique names is interesting because MSXML "atomizes" element and attribute names, meaning it creates only one string object for each unique name and points to that object from each element or attribute that shares the same name. This is important because the names of elements and attributes are typically highly repetitive. For example, the Ado.xml sample actually contains 63,722 element and attribute names, which consume a total of 407,148 bytes of the overall file size. This is a tag-to-file size ratio of over 18 percent! But out of all these names remain only 53 unique names. So instead of using 407 KB of memory to store them, they can be stored in just a few kilobytes.
唯一名数量很有趣,因为MSXML“原子化”了元素和属性的名字,这意味着它对于每个唯一名只创建一个字符串对象,指向有相同名字的元素和属性。这很重要,因为元素和属性名通常重复性很高。例如,在Ado.xml样本文件中,实际有63,722个元素和属性名,在整个文件中占了407,148字节。这里的标签对文件的比例超过了18%!但是这些名字中只有53个唯一名。所以不必用407KB的内存来存储了,只需要很少的内存就够了。