MSXML Features
MSXML特点
Next, let's examine some important scenarios associated with the Document Object Model (DOM)—including loading, saving, walking a DOM tree, and creating a new DOM tree in memory.
接下去,让我们讨论一些在文档对象模型(DOM)中很重要的场景,包括载入,保存,遍历DOM树和在内存中创建一个新的DOM树。
DOM
The MSXML Document Object Model ("Microsoft.XMLDOM," CLSID_DOMDocument, IID_IXMLDOMDocument) is the starting point for all XML processing within the MSXML parser. The fastest way to load an XML document is to use the default "rental" threading model (which means the DOM document can be used by only one thread at a time; it doesn't matter which thread) with validateOnParse, resolveExternals, and preserveWhiteSpace all disabled:
MSXML文档对象模型("Microsoft.XMLDOM," CLSID_DOMDocument, IID_IXMLDOMDocument)是MSXML解析器中所有处理XML过程的起始点。载入一个XML文档的最快的方法是使用默认的“租用”线程模式(这意味着该DOM文档同时只有一个线程能使用;但它并不介意是哪一个线程使用),必须将validateOnParse, resolveExternals和 preserveWhiteSpace的属性设为False:
var doc = new ActiveXObject("Microsoft.XMLDOM");
doc.validateOnParse = false;
doc.resolveExternals = false;
doc.preserveWhiteSpace = false;
doc.load("test.xml");
Working Set
工作集
When using the DOM, the first metric to consider is the working set. Memory is used to load Msxml.dll and the other .dll files on which it depends. Some of these other .dll files are "delay loaded," which means the working set won't be affected until that .dll is used. MSXML is a COM DLL, so you typically use the standard COM APIs (CoInitialize and CoCreateInstance) to create a new XML document object. The minimum working set for a simple Visual C++ 6.0 command line application that uses COM is about one megabyte. (This includes the following .dll files: Ntdll.dll, Kernel32.dll, Ole32.dll, Rpcrt4.dll, Advapi32.dll, Gdi.dll, User32.dll, and Oleaut32.dll.) The first call to CoCreateInstance of an IXMLDOMDocument object loads Msxml.dll and Shlwapi.dll, which adds another 745 KB on top of this. Once all the .dll files are loaded, a new IXMLDOMDocument object is only about 8 KB.
当使用DOM时,首先要考虑的度量指标是工作集。内存中载入了Msxml.dll和其他必须的dll文件。这些dll文件中有的是延时载入的,就是说它们在没有使用之前并不影响工作集。MSXML是一个COM DLL,所以你通常使用标准COM API(CoInitialize 和CoCreateInstance)来创建一个新的XML文档对象。对于一个简单的使用COM的Visual C++6.0命令行应用程序最少的工作集是1兆字节左右。(这包含了以下dll文件:Ntdll.dll,Kernel32.dll,Ole32.dll,Rpcrt4.dll,Advapi32.dll,Gdi.dll,User32.dll和Oleaut32.dll。)首次调用CoCreateInstance创建IXMLDOMDocument对象时载入Msxml.dll和Shlwaip.dll,在前面的基础上又增加了745KB。一旦所有的dll文件载入后,新建的IXMLDocument对象只需要8KB空间。
The memory used by the XML data loaded into an XML document is anywhere from one to four times the size of the XML file on disk, depending on the "tagginess" of the data being loaded and whether the file was already in a Unicode format on disk. The following is a very rough formula for estimating the memory required for a given XML document:
内存中XML数据的大小可能是XML文件在磁盘上大小的一至四倍,这取决于载入数据的“标签比重”和它在磁盘上是否已经是Unicode编码格式的。以下是一个粗略的公式,用来估计给定的XML文档需要的内存空间大小:
ws = 32(n+t) + 12t + 50u + 2w;
The following table describes the parts of the formula:
下表介绍了公式中的各个部分:
Part
项目
Description
描述
ws
The working set in bytes.
工作集的大小(单位为字节)
n
The number of element and attribute nodes in the tree. Each element, attribute, attribute value, and text content has one node (for example, <element attribute = "value">text</element> = four nodes).
树中元素和属性节点的数量。每一个元素,属性,属性的值和文本内容都有一个节点(例如,<element attribute = "value">text</element> 共四个节点)
t
The number of text nodes.
文本节点的数量
u
The number of unique element and attribute names.
元素和属性的唯一名数量。
w
The number of Unicode characters in text content (including attribute values). Note that loading single-byte ANSI text into memory results in twice the number, because all text is stored as Unicode characters, which are two bytes each.
文本内容中Unicode字符的数量(包括属性值)。注意,将单字节的ANSI文本载入内存后会占用两倍的空间大小,因为它们会以Unicode字符存储,每个字符占用两个字节。
This assumes you do not set the preserveWhiteSpace flag; when you do, more nodes are created to preserve the white space between elements, using more memory.
以上公式是基于没有设置preserveWhiteSpace标志的情况;当你设置该标志时,会创建更多的节点来保留元素之间的空格,这样就会占用更多的内存空间。
For the sample data above, we see the following working set numbers (not including the initial startup working set):
对于前述的样品文件,以下表格显示了所需的工作空间大小(不包括工作空间初始化时的工作空间):
Sample
样品
Working set
工作空间
Ratio to file size
与磁盘文件大小的比例
Ado.xml
4,689,920
2.16
Hamlet.xml
704,512
1.25
Ot.xml
10,720,000
1.39
Northwind.xml
249,856
0.51
An element-heavy XML document containing a lot of white space between elements and stored in Unicode can actually be smaller in memory than on disk. Files that have a more balanced ratio of elements to text content, such as Hamlet.xml and Ot.xml, end up at about 1.25 to 1.5 the UCS-2 file size when in memory. Files that are very data-dense, such as Ado.xml, end up more than twice the disk-file size when loaded into memory.
一个元素比重很大,在各元素之间有很多空格并且以Unicode格式存储的XML文档可能在内存空间所需的空间比在磁盘上要少。而元素和文本内容比较平衡的文档,如Hamlet.xml和Ot.xml,可能在内存中所占空间与在磁盘上以UCS-2格式占用的空间大小比为1.25至1.5。而那些数据密集型的文档,就像Ado.xml那样,占用的内存空间可能会是在磁盘上大小的两倍或者更多。
Megabytes Per Second
百兆字节每秒
For the megabytes-per-second metric, I loaded each sample file 10 times in a loop on a Pentium II 450-MHz dual-processor computer running Windows 2000, measured the load times, and averaged the results.
对于百兆字节每秒这个度量指标,我通过以下试验来衡量载入时间:在Pentium II 450-MHz双处理器,运行Windows 2000的计算机上,将每个样品文件循环载入10次,得到载入时间,并进行平均,结果如下表所示:
Sample
样品
Load time (milliseconds)
载入时间(单位:毫秒)
MB/second
MB/秒
Nodes/second
节点/秒
Ado.xml
677
3.2
184,909
Hamlet.xml
104
5.3
116,432
Ot.xml
1063
7.2
111,682
Northwind.xml
62
7.8
103,887
Also shown in this table is a measure of nodes per second. Notice how this correlates with megabytes per second. The more nodes processed per buffer of input data, the slower the absolute throughput. Conversely, the more compact the nodes are (as in Ado.xml), the higher the nodes per second.
在上面的表格中还显示了节点/秒的测试结果。请注意它与百兆字节每秒之间的关系。每个输入数据的缓冲区中节点数量越多,输出的绝对量就越少。相反,节点越紧凑(就像Ado.xml那样),每秒处理的节点数就越多。
Attributes vs. Elements
属性与元素
You could conclude from this that attribute-heavy formats (such as that of Ado.xml) deliver more data per second than element-heavy formats. But this should not be the reason for you to switch everything to attributes. There are many other factors to consider in the decision to use attributes versus elements.
你可以从上面得到结论:属性比重大的格式(就像Ado.xml那样)比元素比重大的格式每秒传递的数据量更大。但是这并不是要你将所有的东西都用属性来表达。在考虑使用元素还是属性时,还有很多其他的因素要斟酌。