RFC1951的部分翻译及原文(2/2)

3. Detailed specification

3.1. Overall conventions In the diagrams below, a box like this:

下面的图形表示一个字节：

+---+

| | <-- the vertical bars might be missing

+---+

下面的图形表示若干字节：

represents one byte; a box like this:

+==============+

| |

+==============+

represents a variable number of bytes.

计算机中所存贮的字节并不存在“位顺序”，因为字节本身被看作是一个单元。

但是，当一个字节被看作是一个0到255之间的整数时，就会有一些最重要的或是最不重

要的位。通常我们会将一个字节中最重要的位写在左边，将几个字节中，最重要的字节

写在左边。在图表中，我们将一个字节中的各位标上序号：位0表示最不重要的位等等：

Bytes stored within a computer do not have a "bit order", since

they are always treated as a unit. However, a byte considered as

an integer between 0 and 255 does have a most- and least-

significant bit, and since we write numbers with the most-

significant digit on the left, we also write bytes with the most-

significant bit on the left. In the diagrams below, we number the

bits of a byte so that bit 0 is the least-significant bit, i.e.,

the bits are numbered:

+--------+

|76543210|

+--------+

在计算机中，一个数可能占用几个字节。这里所说的多字节数据都是将不重要的

部分存贮在低地址的字节中，如520被保存为：

Within a computer, a number may occupy multiple bytes. All

multi-byte numbers in the format described here are stored with

the least-significant byte first (at the lower memory address).

For example, the decimal number 520 is stored as:

0 1

+--------+--------+

|00001000|00000010|

+--------+--------+

^ ^

| |

| + more significant byte = 2 x 256

+ less significant byte = 8

3.1.1. Packing into bytes

这篇文档中没有指明在位连续的传输媒介中，字节中的位顺序。因为这

里所描述的最终数据格式是字节，而不是位。但是，我们设计了下面的压缩块格式，其中

的数据元素是不定长的位序列，而不是字节。因此我们必须指明如何将这些位组合成字节

序列。

This document does not address the issue of the order in which

bits of a byte are transmitted on a bit-sequential medium,

since the final data format described here is byte- rather than

bit-oriented. However, we describe the compressed block format

in below, as a sequence of data elements of various bit

lengths, not a sequence of bytes. We must therefore specify

how to pack these data elements into bytes to form the final

compressed byte sequence:

数据元素(位)被打包到字节中时，是以逐渐增加的位序号排列的。

如：以最不重要的位开始。

数据元素(除了Huffman编码外)以最不重要的位开始。

Huffman 编码以最重要的位开始。

* Data elements are packed into bytes in order of

increasing bit number within the byte, i.e., starting

with the least-significant bit of the byte.

* Data elements other than Huffman codes are packed

starting with the least-significant bit of the data

element.

* Huffman codes are packed starting with the most-

significant bit of the code.

换句话说，如果想要以字节序列的方式打印压缩数据，从最右边的空白

开始向最左边进行。将每字节中最重要的位放在左边。

In other words, if one were to print out the compressed data as

a sequence of bytes, starting with the first byte at the

*right* margin and proceeding to the *left*, with the most-

significant bit of each byte on the left as usual, one would be

able to parse the result from right to left, with fixed-width

elements in the correct MSB-to-LSB order and Huffman codes in

bit-reversed order (i.e., with the first bit of the code in the

relative LSB position).

3.2. Compressed block format 压缩的块结构

3.2.1. Synopsis of prefix and Huffman coding

前缀编码方式描述了一种可以从位序列预知的字母表。每个符号有一个

编码，其特点是不同的符号有不同的长度的位序列。但是解析器可以明白的解析出编码的

内容。

Prefix coding represents symbols from an apriori known

alphabet by bit sequences (codes), one code for each symbol, in

a manner such that different symbols may be represented by bit

sequences of different lengths, but a parser can always parse

an encoded string unambiguously symbol-by-symbol.

我们根据一个二进制树来进行前缀编码。两边的非叶子结点依次的标上

0,1，而叶子结点则一个一人的标记上字母表中的字符。这样，每个字符的编码就是一个

从根结点到叶子结点的0,1的序列了，如：

We define a prefix code in terms of a binary tree in which the

two edges descending from each non-leaf node are labeled 0 and

1 and in which the leaf nodes correspond one-for-one with (are

labeled with) the symbols of the alphabet; then the code for a

symbol is the sequence of 0's and 1's on the edges leading from

the root to the leaf labeled with that symbol. For example:

/\ Symbol Code

0 1 ------ ----

/ \ A 00

/\ B B 1

0 1 C 011

/ \ D 010

A / 0 1

/ D C

解码器可以解码输入的内容：让输入的流从这个树的根结点走到叶子

结点，每次的前进都选择与流中下一位相同的一边。

A parser can decode the next symbol from an encoded input

stream by walking down the tree from the root, at each step

choosing the edge corresponding to the next input bit.

给出一个已知使用频率的字母表，Huffman规则允许解释成一个最佳的

前缀编码，这样的编码被称作Huffman 编码。

Given an alphabet with known symbol frequencies, the Huffman

algorithm allows the construction of an optimal prefix code

(one which represents strings with those symbol frequencies

using the fewest bits of any possible prefix codes for that

alphabet). Such a code is called a Huffman code. (See

reference [1] in Chapter 5, references for additional

information on Huffman codes.)

注意，在"deflate"格式中，Huffman 编码对于不同的字母表来说

不可以超过最大的编码长度。这个规定使得通过使用频率来计算编码长度的算法变得复杂

了。详细内容见第五节。

Note that in the "deflate" format, the Huffman codes for the

various alphabets must not exceed certain maximum code lengths.

This constraint complicates the algorithm for computing code

lengths from symbol frequencies. Again, see Chapter 5,

references for details.

3.2.2. Use of Huffman coding in the "deflate" format 在"deflate"格式中使用Huffman编码

在"deflate"格式中，每一个字母表使用的Huffman编码方式都有两条附加规则：

The Huffman codes used for each alphabet in the "deflate"

format have two additional rules:

所有指定了位长度的编码都有相邻值。而且与他们所代表的符号有

相同的顺序。

* All codes of a given bit length have lexicographically

consecutive values, in the same order as the symbols

they represent;

较短的编码在索引时要先于长的编码。

* Shorter codes lexicographically precede longer codes.

我们可以依照这些规则来重新编码上面的例子，假定字母表中的

顺序是ABCD：

We could recode the example above to follow this rule as

follows, assuming that the order of the alphabet is ABCD:

Symbol Code

------ ----

A 10

B 0

C 110

D 111

0在10的前面，10在11X的前面。110和111是连续的。

I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are

lexicographically consecutive.

有了这个规则，我们可以通过按顺序给定字母表中每一个符号的位长度来

为一个字母表定义Huffman编码。在我们的例子中，编码由位长度的序列所完全定义。下面

的算法将编码生成为整数，规定为从最重要的位读向最不重要的位。编码长度为

tree[I].len，编码则存在于tree[I].Code中。

Given this rule, we can define the Huffman code for an alphabet

just by giving the bit lengths of the codes for each symbol of

the alphabet in order; this is sufficient to determine the

actual codes. In our example, the code is completely defined

by the sequence of bit lengths (2, 1, 3, 3). The following

algorithm generates the codes as integers, intended to be read

from most- to least-significant bit. The code lengths are

initially in tree[I].Len; the codes are produced in

tree[I].Code.

为每一个编码长度计算编码的数量。使bl_count[N]为长度为N(N>=1)

的编码的数量。

1) Count the number of codes for each code length. Let

bl_count[N] be the number of codes of length N, N >= 1.

在每一个编码长度中找出最小的数值。

2) Find the numerical value of the smallest code for each

code length:

code = 0;

bl_count[0] = 0;

for (bits = 1; bits <= MAX_BITS; bits++) {

code = (code + bl_count[bits-1]) << 1;

next_code[bits] = code;

}

将数值赋给所有的编码：以第二步中找到的(next_code[bits])为基础，为相同长度

的所有编码连续的赋值。不要为永远不使用的编码(一位的0)赋值。

3) Assign numerical values to all codes, using consecutive

values for all codes of the same length with the base

values determined at step 2. Codes that are never used

(which have a bit length of zero) must not be assigned a

value.

for (n = 0; n <= max_code; n++) {

len = tree[n].Len;

if (len != 0) {

tree[n].Code = next_code[len];

next_code[len]++;

}

Example:

Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,

3, 2, 4, 4). After step 1, we have:

N bl_count[N]

- -----------

2 1

3 5

4 2

Step 2 computes the following next_code values:

N next_code[N]

- ------------

1 0

2 0

3 2

4 14

Step 3 produces the following code values:

Symbol Length Code

------ ------ ----

A 3 010

B 3 011

C 3 100

D 3 101

E 3 110

F 2 00

G 4 1110

H 4 1111

3.2.3. Details of block format 块结构的详细资料

每个压缩数据块都以三个头位开始：

Each block of compressed data begins with 3 header bits

containing the following data:

first bit BFINAL 第一位

next 2 bits BTYPE 下二位

注意，头位不需要从一个字节的边界开始，因为“块”不需要占据完整

的若干字节。

Note that the header bits do not necessarily begin on a byte

boundary, since a block does not necessarily occupy an integral

number of bytes.

只有当前块是最后一块的时候，才设置BFINAL位。

BFINAL is set if and only if this is the last block of the data

set.

BTYPE指定了块数据是如何压缩的：

BTYPE specifies how the data are compressed, as follows:

00 - no compression 非压缩

01 - compressed with fixed Huffman codes 用固定的Huffman编码来压缩。

10 - compressed with dynamic Huffman codes 用动态的Huffman编码来压缩。

11 - reserved (error) 保留(错误)

两种压缩方案的唯一不同是：如何定义Huffman编码。

The only difference between the two compressed cases is how the

Huffman codes for the literal/length and distance alphabets are

defined.

在所有的情况下，解码的方案如下：

In all cases, the decoding algorithm for the actual data is as

follows:

从输入流读入块头部

if 非压缩存贮

跳过当前处理的字节中剩余的位

读入LEN和NLEN(见下一节)

向输出拷贝LEN字节。

else

if 动态的Huffman编码

读入编码树

loop(直到块的结尾)

从输入流解码出文字/长度值

if value < 256

向输出拷贝value(文字字节)

else

if value = 块的结尾(256)

break;(跳出loop)

else (value = 257到285)

从输入流解码出距离(distance)

在输出流中，向后(已经解码的部分)跳过distance

字节，拷贝当前的length字节

到输出流中。

end loop

while 不是最后一块时

read block header from input stream.

if stored with no compression

skip any remaining bits in current partially

processed byte

read LEN and NLEN (see next section)

copy LEN bytes of data to output

otherwise

if compressed with dynamic Huffman codes

read representation of code trees (see

subsection below)

loop (until end of block code recognized)

decode literal/length value from input stream

if value < 256

copy value (literal byte) to output stream

otherwise

if value = end of block (256)

break from loop

otherwise (value = 257..285)

decode distance from input stream

move backwards distance bytes in the output

stream, and copy length bytes from this

position to the output stream.

end loop

while not last block

注意，一个复制的参考串可能指向前一个块，如：向后的distnace可能

穿过一个或几个块边界。但是distance不可能越过输出串的开始。(一个使用预先设定的

字母表的应用程序可能会忽略输出流中的一部分，distance可以指出这部分)还要注意，

被参考的串可能会与当前位置重合，如：如果最后的两个字节解码后是X，Y，一个串提

及：<length = 5,distance = 2>,则向输出中增加X,Y,X,Y,X。

Note that a duplicated string reference may refer to a string

in a previous block; i.e., the backward distance may cross one

or more block boundaries. However a distance cannot refer past

the beginning of the output stream. (An application using a

preset dictionary might discard part of the output stream; a

distance can refer to that part of the output stream anyway)

Note also that the referenced string may overlap the current

position; for example, if the last 2 bytes decoded have values

X and Y, a string reference with <length = 5, distance = 2>

adds X,Y,X,Y,X to the output stream.

现在，我们按顺序说明每种压缩方法。

We now specify each compression method in turn.

3.2.4. Non-compressed blocks (BTYPE=00) 非压缩的块

忽略直到下一个字节边界的所有位，这块中的其它内容包含如下信息：

Any bits of input up to the next byte boundary are ignored.

The rest of the block consists of the following information:

0 1 2 3 4...

+---+---+---+---+================================+

| LEN | NLEN |... LEN bytes of literal data...|

+---+---+---+---+================================+

LEN是块中的字节数。NLEN是LEN的补码。

LEN is the number of data bytes in the block. NLEN is the

one's complement of LEN.

3.2.5. Compressed blocks (length and distance codes)

被压缩的块(length,distance的编码)

如前所述，用"deflate"格式所编码的数据块包含一系列的记号来表示

三种概念清楚的字母表：文字字节，值在0至255之间；或<length,distance>对。

length表示3至258之间，distnace在1至32768之间。实际上，文字和长度字母表被合并

到一个字母表中,从0至285。其中的0至255表示文字字节，256表示块的结束，257至285

表示length，见下表(length可能和几个存在于符号编码后面的额外位联合在一起，表示

不同的含义：)

As noted above, encoded data blocks in the "deflate" format

consist of sequences of symbols drawn from three conceptually

distinct alphabets: either literal bytes, from the alphabet of

byte values (0..255), or <length, backward distance> pairs,

where the length is drawn from (3..258) and the distance is

drawn from (1..32,768). In fact, the literal and length

alphabets are merged into a single alphabet (0..285), where

values 0..255 represent literal bytes, the value 256 indicates

end-of-block, and values 257..285 represent length codes

(possibly in conjunction with extra bits following the symbol

code) as follows:

Extra Extra Extra

Code Bits Length(s) Code Bits Lengths Code Bits Length(s)

---- ---- ------ ---- ---- ------- ---- ---- -------

257 0 3 267 1 15,16 277 4 67-82

258 0 4 268 1 17,18 278 4 83-98

259 0 5 269 2 19-22 279 4 99-114

260 0 6 270 2 23-26 280 4 115-130

261 0 7 271 2 27-30 281 5 131-162

262 0 8 272 2 31-34 282 5 163-194

263 0 9 273 3 35-42 283 5 195-226

264 0 10 274 3 43-50 284 5 227-257

265 1 11,12 275 3 51-58 285 0 258

266 1 13,14 276 3 59-66

额外的位应该被声明成整数，最重要的位在前面，如：1110代表14。

The extra bits should be interpreted as a machine integer

stored with the most-significant bit first, e.g., bits 1110

represent the value 14.

Extra Extra Extra

Code Bits Dist Code Bits Dist Code Bits Distance

---- ---- ---- ---- ---- ------ ---- ---- --------

0 0 1 10 4 33-48 20 9 1025-1536

1 0 2 11 4 49-64 21 9 1537-2048

2 0 3 12 5 65-96 22 10 2049-3072

3 0 4 13 5 97-128 23 10 3073-4096

4 1 5,6 14 6 129-192 24 11 4097-6144

5 1 7,8 15 6 193-256 25 11 6145-8192

6 2 9-12 16 7 257-384 26 12 8193-12288

7 2 13-16 17 7 385-512 27 12 12289-16384

8 3 17-24 18 8 513-768 28 13 16385-24576

9 3 25-32 19 8 769-1024 29 13 24577-32768

3.2.6. Compression with fixed Huffman codes (BTYPE=01)

用固定的Huffman编码压缩

两个字母表的Huffman编码是固定的，没有在数据中明确的指出。

文字/length的Huffman编码长度如下：

The Huffman codes for the two alphabets are fixed, and are not

represented explicitly in the data. The Huffman code lengths

for the literal/length alphabet are:

Lit Value Bits Codes

--------- ---- -----

0 - 143 8 00110000 through

10111111

144 - 255 9 110010000 through

111111111

256 - 279 7 0000000 through

0010111

280 - 287 8 11000000 through

11000111

上面所述的编码长度足以产生实际的编码。文字/length值286至287

永远不会真正的出现在压缩数据中，但是也存在于编码结构中。

The code lengths are sufficient to generate the actual codes,

as described above; we show the codes in the table for added

clarity. Literal/length values 286-287 will never actually

occur in the compressed data, but participate in the code

construction.

distance的代码 0-31被表示成5位的编码以及可能存在的附加位，

如上面3.2.5节中的图表所示。注意距离代码30-31不会真正出现在压缩数据中。

Distance codes 0-31 are represented by (fixed-length) 5-bit

codes, with possible additional bits as shown in the table

shown in Paragraph 3.2.5, above. Note that distance codes 30-

31 will never actually occur in the compressed data.

3.2.7. Compression with dynamic Huffman codes (BTYPE=10)

用动态Huffman编码压缩

一个块中的Huffman 编码(对两个字母表的编码)存在于头位的后面，

实际压缩内容的前面。第一个是文字/length 编码，第二个是distance编码。每个

编码都被一系列编码长度所定义，见3.2.2节。为了更加简洁，编码长度本身也是经过

了Huffman编码压缩的(这句话十分重要，以前就是由于没有透彻理解这一点才会看不

懂后面的内容--注)。字母表的编码长度如下：

The Huffman codes for the two alphabets appear in the block

immediately after the header bits and before the actual

compressed data, first the literal/length code and then the

distance code. Each code is defined by a sequence of code

lengths, as discussed in Paragraph 3.2.2, above. For even

greater compactness, the code length sequences themselves are

compressed using a Huffman code. The alphabet for code lengths

is as follows:

0 - 15: 表示编码长度为0 - 15

16: 复制前面的编码长度3 - 6次，

下两位指出了复制的次数

(0 = 3, ... , 3 = 6)

如:编码： 8, 16 (+2 bits 11),16 (+2 bits 10)

会被扩展为：

12个长度为8 的编码 (12 = 1 + 6 + 5)

17: 重复3 - 10 次长度为0

(下三位指出了重复次数)

18: 重复11 - 138 次长度为0

(下7位指出了重复次数)

0 - 15: Represent code lengths of 0 - 15

16: Copy the previous code length 3 - 6 times.

The next 2 bits indicate repeat length

(0 = 3, ... , 3 = 6)

Example: Codes 8, 16 (+2 bits 11),

16 (+2 bits 10) will expand to

12 code lengths of 8 (1 + 6 + 5)

17: Repeat a code length of 0 for 3 - 10 times.

(3 bits of length)

18: Repeat a code length of 0 for 11 - 138 times

(7 bits of length)

编码长度为0表示文字/length或distance不会出现在块中，也就不存在

于Huffman编码中。如果只使用一个distance编码长度，则编码时，使用一位，而不是0

位。

在这种情况下，只有一个编码长度1，和一个不使用的编码。0位的distance编码表示没

有

distance编码(所有的内容都是文字的)

A code length of 0 indicates that the corresponding symbol in

the literal/length or distance alphabet will not occur in the

block, and should not participate in the Huffman code

construction algorithm given earlier. If only one distance

code is used, it is encoded using one bit, not zero bits; in

this case there is a single code length of one, with one unused

code. One distance code of zero bits means that there are no

distance codes used at all (the data is all literals).

现在我们可以定义块的格式了(在某一块中，除了最前面的三个标志位之外，

就是如下的内容了--注)：

We can now define the format of the block:

5 Bits: HLIT, # 文字/length编码 - 257 (见3.2.5节。

由于使用了LZ77算法，所以除了文字之外，还要对<length,

distance>对进行编码。文字和长度字母表被合并到一个字

母表中,从0至285。其中的0至255表示文字字节，256表示块

的结束，257至285表示length。但由于不一定要使用所有的

length, 所以，这里的5位表示本块中所能用到的最大的

值--是指0到285间的一个值--减去257之后的值----注)

5 Bits: HDIST, # of Distance 编码 - 1 (同理，这里保存

的是3.2.5节中描述的在本块中可能用到的distance的最

大值-1----注)

4 Bits: HCLEN, # of "编码长度"的编码 - 4 (其实这4

位中保存的是所能用到的编码长度的种类 - 4，这里的“编

码长度”是指从0到18的值，其含义在3.2.7节中所定义。换

句话说，有下面一个数组：16, 17, 18, 0, 8, 7, 9, 6,

10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15，HCLEN+4的值就

表示了在本块的编码时，都用到了本数组中前几项的长度值，

而其它的长度值则不被使用争----注)

5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)

5 Bits: HDIST, # of Distance codes - 1 (1 - 32)

4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)

(HCLEN + 4) x 3 bits: 编码长度表中的编码长度按照如下顺序排列(关键的

地方在这里。这里3位一组，表示相应位置的长度值在Huffman

编码中有几位。这句话不好理解，详细解释一下：在前面说过

了，HCLEN+4表示会使用多少种不同的编码长度值，至于都使

用哪些长度，则查看下面的数组，其前面的HCLEN+4项中所示

的值就是本块编码中所要使用的编码长度。于是，在下一部分

中就要分别指出每个字符的编码长度了 ---- 在Huffman解码

时，只要知道了每一个字符的编码长度，就可以无误的构造出

编码时使用的二叉树了。但是目前的问题是，怎样表示出每个

字符用几位编码。为了获得更大的压缩，Gzip采用了另一个

huffman树来编码不同的长度值。就在本部分中，每三位表示

一个值X，这个值X就是指要用X位来编码数组中相应位置的那

个长度值。----注)

(HCLEN + 4) x 3 bits: code lengths for the code length

alphabet given just above, in the order: 16, 17, 18,

0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

这些编码的长度被表示成3位的整数(0-7)，如前所述，一个编码

的长度为0，意味着相应的符号(文字/length或distance编码长度不被使用)

These code lengths are interpreted as 3-bit integers

(0-7); as above, a code length of 0 means the

corresponding symbol (literal/length or distance code

length) is not used.

用Huffman编码的HLIT+257个编码长度。(这一部分解码后是HLIT+257个值，每

个值分别依次代表从0到HLIT+257-1的编码长度。至于解码的方法

则是用从前一部分中所得到的Huffman码。----注)

HLIT + 257 code lengths for the literal/length alphabet,

encoded using the code length Huffman code

用Huffman编码的HDIST+1的编码长度。(这一部分解码后是HDIST+1个值，依次

代表<LENGTH，DISTENCE>中的DISTENCE从0到HDIST的编码长度。

----注)

HDIST + 1 code lengths for the distance alphabet,

encoded using the code length Huffman code

实际的压缩数据，使用Huffman编码的文字/length、distance。(用上面两部分

中所列举的编码长度，可以构造出两个Huffman树，再用它们对

这部分中的压缩数据进行解码。以上所说的编码长度均是指用

多少位对数据进行编码----注)

The actual compressed data of the block,

encoded using the literal/length and distance Huffman

codes

最后写入256的Huffman编码,这样解码时，解出值256，就会知道

已经到块尾了。

The literal/length symbol 256 (end of data),

encoded using the literal/length Huffman code

The code length repeat codes can cross from HLIT + 257 to the

HDIST + 1 code lengths. In other words, all code lengths form

a single sequence of HLIT + HDIST + 258 values.

3.3. Compliance

A compressor may limit further the ranges of values specified in

the previous section and still be compliant; for example, it may

limit the range of backward pointers to some value smaller than

32K. Similarly, a compressor may limit the size of blocks so that

a compressible block fits in memory.

A compliant decompressor must accept the full range of possible

values defined in the previous section, and must accept blocks of

arbitrary size.