3. Detailed specification
3.1. Overall conventions In the diagrams below, a box like this:
下面的图形表示一个字节:
+---+
| | <-- the vertical bars might be missing
+---+
下面的图形表示若干字节:
represents one byte; a box like this:
+==============+
| |
+==============+
represents a variable number of bytes.
计算机中所存贮的字节并不存在“位顺序”,因为字节本身被看作是一个单元。
但是,当一个字节被看作是一个0到255之间的整数时,就会有一些最重要的或是最不重
要的位。通常我们会将一个字节中最重要的位写在左边,将几个字节中,最重要的字节
写在左边。在图表中,我们将一个字节中的各位标上序号:位0表示最不重要的位等等:
Bytes stored within a computer do not have a "bit order", since
they are always treated as a unit. However, a byte considered as
an integer between 0 and 255 does have a most- and least-
significant bit, and since we write numbers with the most-
significant digit on the left, we also write bytes with the most-
significant bit on the left. In the diagrams below, we number the
bits of a byte so that bit 0 is the least-significant bit, i.e.,
the bits are numbered:
+--------+
|76543210|
+--------+
在计算机中,一个数可能占用几个字节。这里所说的多字节数据都是将不重要的
部分存贮在低地址的字节中,如520被保存为:
Within a computer, a number may occupy multiple bytes. All
multi-byte numbers in the format described here are stored with
the least-significant byte first (at the lower memory address).
For example, the decimal number 520 is stored as:
0 1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + more significant byte = 2 x 256
+ less significant byte = 8
3.1.1. Packing into bytes
这篇文档中没有指明在位连续的传输媒介中,字节中的位顺序。因为这
里所描述的最终数据格式是字节,而不是位。但是,我们设计了下面的压缩块格式,其中
的数据元素是不定长的位序列,而不是字节。因此我们必须指明如何将这些位组合成字节
序列。
This document does not address the issue of the order in which
bits of a byte are transmitted on a bit-sequential medium,
since the final data format described here is byte- rather than
bit-oriented. However, we describe the compressed block format
in below, as a sequence of data elements of various bit
lengths, not a sequence of bytes. We must therefore specify
how to pack these data elements into bytes to form the final
compressed byte sequence:
数据元素(位)被打包到字节中时,是以逐渐增加的位序号排列的。
如:以最不重要的位开始。
数据元素(除了Huffman编码外)以最不重要的位开始。
Huffman 编码以最重要的位开始。
* Data elements are packed into bytes in order of
increasing bit number within the byte, i.e., starting
with the least-significant bit of the byte.
* Data elements other than Huffman codes are packed
starting with the least-significant bit of the data
element.
* Huffman codes are packed starting with the most-
significant bit of the code.
换句话说,如果想要以字节序列的方式打印压缩数据,从最右边的空白
开始向最左边进行。将每字节中最重要的位放在左边。
In other words, if one were to print out the compressed data as
a sequence of bytes, starting with the first byte at the
*right* margin and proceeding to the *left*, with the most-
significant bit of each byte on the left as usual, one would be
able to parse the result from right to left, with fixed-width
elements in the correct MSB-to-LSB order and Huffman codes in
bit-reversed order (i.e., with the first bit of the code in the
relative LSB position).
3.2. Compressed block format 压缩的块结构
3.2.1. Synopsis of prefix and Huffman coding
前缀编码方式描述了一种可以从位序列预知的字母表。每个符号有一个
编码,其特点是不同的符号有不同的长度的位序列。但是解析器可以明白的解析出编码的
内容。
Prefix coding represents symbols from an apriori known
alphabet by bit sequences (codes), one code for each symbol, in
a manner such that different symbols may be represented by bit
sequences of different lengths, but a parser can always parse
an encoded string unambiguously symbol-by-symbol.
我们根据一个二进制树来进行前缀编码。两边的非叶子结点依次的标上
0,1,而叶子结点则一个一人的标记上字母表中的字符。这样,每个字符的编码就是一个
从根结点到叶子结点的0,1的序列了,如:
We define a prefix code in terms of a binary tree in which the
two edges descending from each non-leaf node are labeled 0 and
1 and in which the leaf nodes correspond one-for-one with (are
labeled with) the symbols of the alphabet; then the code for a
symbol is the sequence of 0's and 1's on the edges leading from
the root to the leaf labeled with that symbol. For example:
/\ Symbol Code
0 1 ------ ----
/ \ A 00
/\ B B 1
0 1 C 011
/ \ D 010
A / 0 1
/ D C
解码器可以解码输入的内容:让输入的流从这个树的根结点走到叶子
结点,每次的前进都选择与流中下一位相同的一边。
A parser can decode the next symbol from an encoded input
stream by walking down the tree from the root, at each step
choosing the edge corresponding to the next input bit.
给出一个已知使用频率的字母表,Huffman规则允许解释成一个最佳的
前缀编码,这样的编码被称作Huffman 编码。
Given an alphabet with known symbol frequencies, the Huffman
algorithm allows the construction of an optimal prefix code
(one which represents strings with those symbol frequencies
using the fewest bits of any possible prefix codes for that
alphabet). Such a code is called a Huffman code. (See
reference [1] in Chapter 5, references for additional
information on Huffman codes.)
注意,在"deflate"格式中,Huffman 编码对于不同的字母表来说
不可以超过最大的编码长度。这个规定使得通过使用频率来计算编码长度的算法变得复杂
了。详细内容见第五节。
Note that in the "deflate" format, the Huffman codes for the
various alphabets must not exceed certain maximum code lengths.
This constraint complicates the algorithm for computing code
lengths from symbol frequencies. Again, see Chapter 5,
references for details.
3.2.2. Use of Huffman coding in the "deflate" format 在"deflate"格式中使用Huffman编码
在"deflate"格式中,每一个字母表使用的Huffman编码方式都有两条附加规则:
The Huffman codes used for each alphabet in the "deflate"
format have two additional rules:
所有指定了位长度的编码都有相邻值。而且与他们所代表的符号有
相同的顺序。
* All codes of a given bit length have lexicographically
consecutive values, in the same order as the symbols
they represent;
较短的编码在索引时要先于长的编码。
* Shorter codes lexicographically precede longer codes.
我们可以依照这些规则来重新编码上面的例子,假定字母表中的
顺序是ABCD:
We could recode the example above to follow this rule as
follows, assuming that the order of the alphabet is ABCD:
Symbol Code
------ ----
A 10
B 0
C 110
D 111
0在10的前面,10在11X的前面。110和111是连续的。
I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
lexicographically consecutive.
有了这个规则,我们可以通过按顺序给定字母表中每一个符号的位长度来
为一个字母表定义Huffman编码。在我们的例子中,编码由位长度的序列所完全定义。下面
的算法将编码生成为整数,规定为从最重要的位读向最不重要的位。编码长度为
tree[I].len,编码则存在于tree[I].Code中。
Given this rule, we can define the Huffman code for an alphabet
just by giving the bit lengths of the codes for each symbol of
the alphabet in order; this is sufficient to determine the
actual codes. In our example, the code is completely defined
by the sequence of bit lengths (2, 1, 3, 3). The following
algorithm generates the codes as integers, intended to be read
from most- to least-significant bit. The code lengths are
initially in tree[I].Len; the codes are produced in
tree[I].Code.
为每一个编码长度计算编码的数量。使bl_count[N]为长度为N(N>=1)
的编码的数量。
1) Count the number of codes for each code length. Let
bl_count[N] be the number of codes of length N, N >= 1.
在每一个编码长度中找出最小的数值。
2) Find the numerical value of the smallest code for each
code length:
code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code;
}
将数值赋给所有的编码:以第二步中找到的(next_code[bits])为基础,为相同长度
的所有编码连续的赋值。不要为永远不使用的编码(一位的0)赋值。
3) Assign numerical values to all codes, using consecutive
values for all codes of the same length with the base
values determined at step 2. Codes that are never used
(which have a bit length of zero) must not be assigned a
value.
for (n = 0; n <= max_code; n++) {
len = tree[n].Len;
if (len != 0) {
tree[n].Code = next_code[len];
next_code[len]++;
}
}
Example:
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
3, 2, 4, 4). After step 1, we have:
N bl_count[N]
- -----------
2 1
3 5
4 2
Step 2 computes the following next_code values:
N next_code[N]
- ------------
1 0
2 0
3 2
4 14
Step 3 produces the following code values:
Symbol Length Code
------ ------ ----
A 3 010
B 3 011
C 3 100
D 3 101
E 3 110
F 2 00
G 4 1110
H 4 1111
3.2.3. Details of block format 块结构的详细资料
每个压缩数据块都以三个头位开始:
Each block of compressed data begins with 3 header bits
containing the following data:
first bit BFINAL 第一位
next 2 bits BTYPE 下二位
注意,头位不需要从一个字节的边界开始,因为“块”不需要占据完整
的若干字节。
Note that the header bits do not necessarily begin on a byte
boundary, since a block does not necessarily occupy an integral
number of bytes.
只有当前块是最后一块的时候,才设置BFINAL位。
BFINAL is set if and only if this is the last block of the data
set.
BTYPE指定了块数据是如何压缩的:
BTYPE specifies how the data are compressed, as follows:
00 - no compression 非压缩
01 - compressed with fixed Huffman codes 用固定的Huffman编码来压缩。
10 - compressed with dynamic Huffman codes 用动态的Huffman编码来压缩。
11 - reserved (error) 保留(错误)
两种压缩方案的唯一不同是:如何定义Huffman编码。
The only difference between the two compressed cases is how the
Huffman codes for the literal/length and distance alphabets are
defined.
在所有的情况下,解码的方案如下:
In all cases, the decoding algorithm for the actual data is as
follows:
do
从输入流读入块头部
if 非压缩存贮
跳过当前处理的字节中剩余的位
读入LEN和NLEN(见下一节)
向输出拷贝LEN字节。
else
if 动态的Huffman编码
读入编码树
loop(直到块的结尾)
从输入流解码出文字/长度值
if value < 256
向输出拷贝value(文字字节)
else
if value = 块的结尾(256)
break;(跳出loop)
else (value = 257到285)
从输入流解码出距离(distance)
在输出流中,向后(已经解码的部分)跳过distance
字节,拷贝当前的length字节
到输出流中。
end loop
while 不是最后一块时
do
read block header from input stream.
if stored with no compression
skip any remaining bits in current partially
processed byte
read LEN and NLEN (see next section)
copy LEN bytes of data to output
otherwise
if compressed with dynamic Huffman codes
read representation of code trees (see
subsection below)
loop (until end of block code recognized)
decode literal/length value from input stream
if value < 256
copy value (literal byte) to output stream
otherwise
if value = end of block (256)
break from loop
otherwise (value = 257..285)
decode distance from input stream
move backwards distance bytes in the output
stream, and copy length bytes from this
position to the output stream.
end loop
while not last block
注意,一个复制的参考串可能指向前一个块,如:向后的distnace可能
穿过一个或几个块边界。但是distance不可能越过输出串的开始。(一个使用预先设定的
字母表的应用程序可能会忽略输出流中的一部分,distance可以指出这部分)还要注意,
被参考的串可能会与当前位置重合,如:如果最后的两个字节解码后是X,Y,一个串提
及:<length = 5,distance = 2>,则向输出中增加X,Y,X,Y,X。
Note that a duplicated string reference may refer to a string
in a previous block; i.e., the backward distance may cross one
or more block boundaries. However a distance cannot refer past
the beginning of the output stream. (An application using a
preset dictionary might discard part of the output stream; a
distance can refer to that part of the output stream anyway)
Note also that the referenced string may overlap the current
position; for example, if the last 2 bytes decoded have values
X and Y, a string reference with <length = 5, distance = 2>
adds X,Y,X,Y,X to the output stream.
现在,我们按顺序说明每种压缩方法。
We now specify each compression method in turn.
3.2.4. Non-compressed blocks (BTYPE=00) 非压缩的块
忽略直到下一个字节边界的所有位,这块中的其它内容包含如下信息:
Any bits of input up to the next byte boundary are ignored.
The rest of the block consists of the following information:
0 1 2 3 4...
+---+---+---+---+================================+
| LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+================================+
LEN是块中的字节数。NLEN是LEN的补码。
LEN is the number of data bytes in the block. NLEN is the
one's complement of LEN.
3.2.5. Compressed blocks (length and distance codes)
被压缩的块(length,distance的编码)
如前所述,用"deflate"格式所编码的数据块包含一系列的记号来表示
三种概念清楚的字母表:文字字节,值在0至255之间;或<length,distance>对。
length表示3至258之间,distnace在1至32768之间。实际上,文字和长度字母表被合并
到一个字母表中,从0至285。其中的0至255表示文字字节,256表示块的结束,257至285
表示length,见下表(length可能和几个存在于符号编码后面的额外位联合在一起,表示
不同的含义:)
As noted above, encoded data blocks in the "deflate" format
consist of sequences of symbols drawn from three conceptually
distinct alphabets: either literal bytes, from the alphabet of
byte values (0..255), or <length, backward distance> pairs,
where the length is drawn from (3..258) and the distance is
drawn from (1..32,768). In fact, the literal and length
alphabets are merged into a single alphabet (0..285), where
values 0..255 represent literal bytes, the value 256 indicates
end-of-block, and values 257..285 represent length codes
(possibly in conjunction with extra bits following the symbol
code) as follows:
Extra Extra Extra
Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
---- ---- ------ ---- ---- ------- ---- ---- -------
257 0 3 267 1 15,16 277 4 67-82
258 0 4 268 1 17,18 278 4 83-98
259 0 5 269 2 19-22 279 4 99-114
260 0 6 270 2 23-26 280 4 115-130
261 0 7 271 2 27-30 281 5 131-162
262 0 8 272 2 31-34 282 5 163-194
263 0 9 273 3 35-42 283 5 195-226
264 0 10 274 3 43-50 284 5 227-257
265 1 11,12 275 3 51-58 285 0 258
266 1 13,14 276 3 59-66
额外的位应该被声明成整数,最重要的位在前面,如:1110代表14。
The extra bits should be interpreted as a machine integer
stored with the most-significant bit first, e.g., bits 1110
represent the value 14.
Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- --------
0 0 1 10 4 33-48 20 9 1025-1536
1 0 2 11 4 49-64 21 9 1537-2048
2 0 3 12 5 65-96 22 10 2049-3072
3 0 4 13 5 97-128 23 10 3073-4096
4 1 5,6 14 6 129-192 24 11 4097-6144
5 1 7,8 15 6 193-256 25 11 6145-8192
6 2 9-12 16 7 257-384 26 12 8193-12288
7 2 13-16 17 7 385-512 27 12 12289-16384
8 3 17-24 18 8 513-768 28 13 16385-24576
9 3 25-32 19 8 769-1024 29 13 24577-32768
3.2.6. Compression with fixed Huffman codes (BTYPE=01)
用固定的Huffman编码压缩
两个字母表的Huffman编码是固定的,没有在数据中明确的指出。
文字/length的Huffman编码长度如下:
The Huffman codes for the two alphabets are fixed, and are not
represented explicitly in the data. The Huffman code lengths
for the literal/length alphabet are:
Lit Value Bits Codes
--------- ---- -----
0 - 143 8 00110000 through
10111111
144 - 255 9 110010000 through
111111111
256 - 279 7 0000000 through
0010111
280 - 287 8 11000000 through
11000111
上面所述的编码长度足以产生实际的编码。文字/length值286至287
永远不会真正的出现在压缩数据中,但是也存在于编码结构中。
The code lengths are sufficient to generate the actual codes,
as described above; we show the codes in the table for added
clarity. Literal/length values 286-287 will never actually
occur in the compressed data, but participate in the code
construction.
distance的代码 0-31被表示成5位的编码以及可能存在的附加位,
如上面3.2.5节中的图表所示。注意距离代码30-31不会真正出现在压缩数据中。
Distance codes 0-31 are represented by (fixed-length) 5-bit
codes, with possible additional bits as shown in the table
shown in Paragraph 3.2.5, above. Note that distance codes 30-
31 will never actually occur in the compressed data.
3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
用动态Huffman编码压缩
一个块中的Huffman 编码(对两个字母表的编码)存在于头位的后面,
实际压缩内容的前面。第一个是文字/length 编码,第二个是distance编码。每个
编码都被一系列编码长度所定义,见3.2.2节。为了更加简洁,编码长度本身也是经过
了Huffman编码压缩的(这句话十分重要,以前就是由于没有透彻理解这一点才会看不
懂后面的内容--注)。字母表的编码长度如下:
The Huffman codes for the two alphabets appear in the block
immediately after the header bits and before the actual
compressed data, first the literal/length code and then the
distance code. Each code is defined by a sequence of code
lengths, as discussed in Paragraph 3.2.2, above. For even
greater compactness, the code length sequences themselves are
compressed using a Huffman code. The alphabet for code lengths
is as follows:
0 - 15: 表示编码长度为0 - 15
16: 复制前面的编码长度3 - 6次,
下两位指出了复制的次数
(0 = 3, ... , 3 = 6)
如:编码: 8, 16 (+2 bits 11),16 (+2 bits 10)
会被扩展为:
12个长度为8 的编码 (12 = 1 + 6 + 5)
17: 重复3 - 10 次长度为0
(下三位指出了重复次数)
18: 重复11 - 138 次长度为0
(下7位指出了重复次数)
0 - 15: Represent code lengths of 0 - 15
16: Copy the previous code length 3 - 6 times.
The next 2 bits indicate repeat length
(0 = 3, ... , 3 = 6)
Example: Codes 8, 16 (+2 bits 11),
16 (+2 bits 10) will expand to
12 code lengths of 8 (1 + 6 + 5)
17: Repeat a code length of 0 for 3 - 10 times.
(3 bits of length)
18: Repeat a code length of 0 for 11 - 138 times
(7 bits of length)
编码长度为0表示文字/length或distance不会出现在块中,也就不存在
于Huffman编码中。如果只使用一个distance编码长度,则编码时,使用一位,而不是0
位。
在这种情况下,只有一个编码长度1,和一个不使用的编码。0位的distance编码表示没
有
distance编码(所有的内容都是文字的)
A code length of 0 indicates that the corresponding symbol in
the literal/length or distance alphabet will not occur in the
block, and should not participate in the Huffman code
construction algorithm given earlier. If only one distance
code is used, it is encoded using one bit, not zero bits; in
this case there is a single code length of one, with one unused
code. One distance code of zero bits means that there are no
distance codes used at all (the data is all literals).
现在我们可以定义块的格式了(在某一块中,除了最前面的三个标志位之外,
就是如下的内容了--注):
We can now define the format of the block:
5 Bits: HLIT, # 文字/length编码 - 257 (见3.2.5节。
由于使用了LZ77算法,所以除了文字之外,还要对<length,
distance>对进行编码。文字和长度字母表被合并到一个字
母表中,从0至285。其中的0至255表示文字字节,256表示块
的结束,257至285表示length。但由于不一定要使用所有的
length, 所以,这里的5位表示本块中所能用到的最大的
值--是指0到285间的一个值--减去257之后的值----注)
5 Bits: HDIST, # of Distance 编码 - 1 (同理,这里保存
的是3.2.5节中描述的在本块中可能用到的distance的最
大值-1----注)
4 Bits: HCLEN, # of "编码长度"的编码 - 4 (其实这4
位中保存的是所能用到的编码长度的种类 - 4,这里的“编
码长度”是指从0到18的值,其含义在3.2.7节中所定义。换
句话说,有下面一个数组:16, 17, 18, 0, 8, 7, 9, 6,
10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15,HCLEN+4的值就
表示了在本块的编码时,都用到了本数组中前几项的长度值,
而其它的长度值则不被使用争----注)
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
(HCLEN + 4) x 3 bits: 编码长度表中的编码长度按照如下顺序排列(关键的
地方在这里。这里3位一组,表示相应位置的长度值在Huffman
编码中有几位。这句话不好理解,详细解释一下:在前面说过
了,HCLEN+4表示会使用多少种不同的编码长度值,至于都使
用哪些长度,则查看下面的数组,其前面的HCLEN+4项中所示
的值就是本块编码中所要使用的编码长度。于是,在下一部分
中就要分别指出每个字符的编码长度了 ---- 在Huffman解码
时,只要知道了每一个字符的编码长度,就可以无误的构造出
编码时使用的二叉树了。但是目前的问题是,怎样表示出每个
字符用几位编码。为了获得更大的压缩,Gzip采用了另一个
huffman树来编码不同的长度值。就在本部分中,每三位表示
一个值X,这个值X就是指要用X位来编码数组中相应位置的那
个长度值。----注)
(HCLEN + 4) x 3 bits: code lengths for the code length
alphabet given just above, in the order: 16, 17, 18,
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
这些编码的长度被表示成3位的整数(0-7),如前所述,一个编码
的长度为0,意味着相应的符号(文字/length或distance编码长度不被使用)
These code lengths are interpreted as 3-bit integers
(0-7); as above, a code length of 0 means the
corresponding symbol (literal/length or distance code
length) is not used.
用Huffman编码的HLIT+257个编码长度。(这一部分解码后是HLIT+257个值,每
个值分别依次代表从0到HLIT+257-1的编码长度。至于解码的方法
则是用从前一部分中所得到的Huffman码。----注)
HLIT + 257 code lengths for the literal/length alphabet,
encoded using the code length Huffman code
用Huffman编码的HDIST+1的编码长度。(这一部分解码后是HDIST+1个值,依次
代表<LENGTH,DISTENCE>中的DISTENCE从0到HDIST的编码长度。
----注)
HDIST + 1 code lengths for the distance alphabet,
encoded using the code length Huffman code
实际的压缩数据,使用Huffman编码的文字/length、distance。(用上面两部分
中所列举的编码长度,可以构造出两个Huffman树,再用它们对
这部分中的压缩数据进行解码。以上所说的编码长度均是指用
多少位对数据进行编码----注)
The actual compressed data of the block,
encoded using the literal/length and distance Huffman
codes
最后写入256的Huffman编码,这样解码时,解出值256,就会知道
已经到块尾了。
The literal/length symbol 256 (end of data),
encoded using the literal/length Huffman code
The code length repeat codes can cross from HLIT + 257 to the
HDIST + 1 code lengths. In other words, all code lengths form
a single sequence of HLIT + HDIST + 258 values.
3.3. Compliance
A compressor may limit further the ranges of values specified in
the previous section and still be compliant; for example, it may
limit the range of backward pointers to some value smaller than
32K. Similarly, a compressor may limit the size of blocks so that
a compressible block fits in memory.
A compliant decompressor must accept the full range of possible
values defined in the previous section, and must accept blocks of
arbitrary size.