以下内容只是RFC1952中的一部分,其余内容请参照原文。
2. Detailed specification
2.1. Overall conventions
下面的图形表示一个字节:
+---+
| | <-- the vertical bars might be missing
+---+
下面的图形表示若干字节:
+==============+
| |
+==============+
计算机中所存贮的字节并不存在“位顺序”,因为字节本身被看作是一个单元。
但是,当一个字节被看作是一个0到255之间的整数时,就会有一些最重要的或是最不重
要的位。通常我们会将一个字节中最重要的位写在左边,将几个字节中,最重要的字节
写在左边。在图表中,我们将一个字节中的各位标上序号:位0表示最不重要的位等等:
Bytes stored within a computer do not have a "bit order", since
they are always treated as a unit. However, a byte considered as
an integer between 0 and 255 does have a most- and least-
significant bit, and since we write numbers with the most-
significant digit on the left, we also write bytes with the most-
significant bit on the left. In the diagrams below, we number the
bits of a byte so that bit 0 is the least-significant bit, i.e.,
the bits are numbered:
+--------+
|76543210|
+--------+
这篇文档不适用于位传输的情况,因为这里所说的数据格式都是以字节为单位的。
This document does not address the issue of the order in which
bits of a byte are transmitted on a bit-sequential medium, since
the data format described here is byte- rather than bit-oriented.
在计算机中,一个数可能占用几个字节。这里所说的多字节数据都是将不重要的
部分存贮在低地址的字节中,如520被保存为:
Within a computer, a number may occupy multiple bytes. All
multi-byte numbers in the format described here are stored with
the least-significant byte first (at the lower memory address).
For example, the decimal number 520 is stored as:
0 1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + more significant byte = 2 x 256
+ less significant byte = 8
2.2. File format
gzip文件是由一系列连续的成员(被压缩的数据单元)组成的。每一个成员格式
的说明见后面的章节。这些成员在文件中都是一个接一个的排列的,而没有其它的附加信息。
A gzip file consists of a series of "members" (compressed data
sets). The format of each member is specified in the following
section. The members simply appear one after another in the file,
with no additional information before, between, or after them.
2.3. Member format
成员格式:每个成员都有如下的结构:
Each member has the following structure:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
(if FLG.FHCRC set)
+---+---+
| CRC16 |
+---+---+
+=======================+
|...compressed blocks...| (more-->)
+=======================+
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| CRC32 | ISIZE |
+---+---+---+---+---+---+---+---+
2.3.1. Member header and trailer
成员的头部及尾部:
ID1 (IDentification 1)
ID2 (IDentification 2)
这两个字节是标识符用来识别gzip文件,有固定值:ID1 = 31,ID2 = 139;
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
(0x8b, \213), to identify the file as being in gzip format.
CM (Compression Method)
这个字节标识了文件的压缩方式。CM = 0-7的值是被保留的,CM = 8表示
“deflate”压缩的方式,通常被gzip及使用。
This identifies the compression method used in the file. CM
= 0-7 are reserved. CM = 8 denotes the "deflate"
compression method, which is the one customarily used by
gzip and which is documented elsewhere.
FLG (FLaGs)
这个字节被拆分成单独的位:
This flag byte is divided into individual bits as follows:
bit 0 FTEXT
bit 1 FHCRC
bit 2 FEXTRA
bit 3 FNAME
bit 4 FCOMMENT
bit 5 reserved
bit 6 reserved
bit 7 reserved
如果FTEXT位被设置:则文件可能是ASCII文本文件。这是一个可选的
标识符。压缩程序可以检查很小一部分的输入数据,看看有没有非ASCII码的字符,如
果没有,则可以设置这位。如果存在怀疑,可以清除这位,表示一个二进制文件。对于
有不同文件格式(ASCII及二进制)的系统来说,可以根据FTEXT来选择适当的格式。
我们不指定设置这一位的规则,压缩程序可以始终设置这一位为0,解压程序也会
始终忽略这一位而让其它的程序进行数据转换工作。
If FTEXT is set, the file is probably ASCII text. This is
an optional indication, which the compressor may set by
checking a small amount of the input data to see whether any
non-ASCII characters are present. In case of doubt, FTEXT
is cleared, indicating binary data. For systems which have
different file formats for ascii text and binary data, the
decompressor can use FTEXT to choose the appropriate format.
We deliberately do not specify the algorithm used to set
this bit, since a compressor always has the option of
leaving it cleared and a decompressor always has the option
of ignoring it and letting some other program handle issues
of data conversion.
如果FHCRC位被设置,则gzip的头部中,在被压缩的数据前面,有
CRC16的部分。CRC16中包含有两字节的内容,它们是整个头部内容(不包括CRC16
这两字节)的CRC32中两个不重要的字节。[FHCRC位永远不会被1.2.4版本以上的
gzip所设置,即使它被1.2.4版本定义为不同的含义]
If FHCRC is set, a CRC16 for the gzip header is present,
immediately before the compressed data. The CRC16 consists
of the two least significant bytes of the CRC32 for all
bytes of the gzip header up to and not including the CRC16.
[The FHCRC bit was never set by versions of gzip up to
1.2.4, even though it was documented with a different
meaning in gzip 1.2.4.]
如果FEXTRA位被设置,则存在有可选的附加文件。将在后几节中叙述。
If FEXTRA is set, optional extra fields are present, as
described in a following section.
如果FNAME位设置,则提供了原始的文件名称,由0字节终止。
名称必须由ISO8859-1中所定义的字符所组成。当操作系统使用EBCDIC或其它字符集
生成文件名的时候,文件名必须被转换到ISOLATIN-1字符集中。这个是被压缩的
文件的原始名字,不包括目录部分。如果操作系统对文件名称的大小写字母不敏感,
则将文件名称中的所有的字母强制转换成小写。如果数据不是从一个源始文件压缩而
来的,则不存在原始文件的名称。
If FNAME is set, an original file name is present,
terminated by a zero byte. The name must consist of ISO
8859-1 (LATIN-1) characters; on operating systems using
EBCDIC or any other character set for file names, the name
must be translated to the ISO LATIN-1 character set. This
is the original name of the file being compressed, with any
directory components removed, and, if the file being
compressed is on a file system with case insensitive names,
forced to lower case. There is no original file name if the
data was compressed from a source other than a named file;
for example, if the source was stdin on a Unix system, there
is no file name.
如果设置了FCOMMENT位,则提供有一个O-终结的文件内容。这段内
容不被解释,它只是被用来为人们所用。这部分内容必须包含有ISO8859-1(LATIN-1)
字符。行终结符应该是0x0A。
If FCOMMENT is set, a zero-terminated file comment is
present. This comment is not interpreted; it is only
intended for human consumption. The comment must consist of
ISO 8859-1 (LATIN-1) characters. Line breaks should be
denoted by a single line feed character (10 decimal).
保留的FLG位必须是0。
Reserved FLG bits must be zero.
MTIME (Modification TIME)
MTIME:修改时间。这个部分提供了原始文件在压缩前的最新的修改时间。
时间是Unix格式的,是自从1970年1月1日0时0分0秒开始的秒数。如果被压缩的内容不是
文件,MTIME被设置为压缩的开始时间。
This gives the most recent modification time of the original
file being compressed. The time is in Unix format, i.e.,
seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this
may cause problems for MS-DOS and other systems that use
local rather than Universal time.) If the compressed data
did not come from a file, MTIME is set to the time at which
compression started. MTIME = 0 means no time stamp is
available.
XFL (eXtra FLags)
这个标志会被特殊的压缩方法所用到。“deflate”方法会这样设置:
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
使用最大的压缩,最慢的算法
XFL = 2 - compressor used maximum compression,
slowest algorithm
采用最快的算法
XFL = 4 - compressor used fastest algorithm
OS (Operating System)
这个标志指明了进行压缩时系统的类型。这在用来决定文本文件的行终结
符时十分有用。
This identifies the type of file system on which compression
took place. This may be useful in determining end-of-line
convention for text files. The currently defined values are
as follows:
0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
1 - Amiga
2 - VMS (or OpenVMS)
3 - Unix
4 - VM/CMS
5 - Atari TOS
6 - HPFS filesystem (OS/2, NT)
7 - Macintosh
8 - Z-System
9 - CP/M
10 - TOPS-20
11 - NTFS filesystem (NT)
12 - QDOS
13 - Acorn RISCOS
255 - unknown
XLEN (eXtra LENgth)
如果FLG。FEXTRA被设置了,这两个字节是可选的额外的内容的长度。
If FLG.FEXTRA is set, this gives the length of the optional
extra field. See below for details.
CRC32 (CRC-32)
这个是未压缩数据的循环冗余校验值。
This contains a Cyclic Redundancy Check value of the
uncompressed data computed according to CRC-32 algorithm
used in the ISO 3309 standard and in section 8.1.1.6.2 of
ITU-T recommendation V.42. (See http://www.iso.ch for
ordering ISO documents. See gopher://info.itu.ch for an
online version of ITU-T V.42.)
ISIZE (Input SIZE)
这是原始数据的长度以2的32次方为模的值。
This contains the size of the original (uncompressed) input
data modulo 2^32.
2.3.1.1. Extra field
如果设置了FLG.FEXTRA位,则头部中存在有这部分的内容,总长度是
XLEN字节。它包含了一系列子域:
If the FLG.FEXTRA bit is set, an "extra field" is present in
the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1和SI2提供了子域的ID,表示为两个可以记忆的ASCII字符。SI2=0
的值是为将来的使用而保留的。如下的ID是目前定义的:
SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<gzip@prep.ai.mit.edu> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use. The following
IDs are currently defined:
SI1 SI2 Data
---------- ---------- ----
0x41 ('A') 0x70 ('P') Apollo file type information
LEN字段给出了子域的长度,包括最初的四个字节。
LEN gives the length of the subfield data, excluding the 4
initial bytes.
2.3.1.2. Compliance
一个压缩程序所产生的文件应该有正确的ID1,ID2,CM,CRC32,
和ISIZE。但是可以将所有其它存在于可变长度的部分的字段设置为默认值(255或
0)。必须设置所有有保留值为0;
A compliant compressor must produce files with correct ID1,
ID2, CM, CRC32, and ISIZE, but may set all the other fields in
the fixed-length part of the header to default values (255 for
OS, 0 for all others). The compressor must set all reserved
bits to zero.
解压程序必须检查ID1,ID2,CM,D而且,当这些值存在错误时,要
提供错误提示。必须要检查:FEXTRA/XLEN, FNAME, FCOMMENT 和 FHCRC至少这样
可以跳过可选字段。不需要检查其它的头部和尾部中的字段。特别是解压程序可以忽略
FTEXT和OS而总是产生二进制的输。如果保留位非0,要给出错误提示,因为这一
位可能指出了一个新字段的存在,而这又可能导致对后面数据的错误解释。
A compliant decompressor must check ID1, ID2, and CM, and
provide an error indication if any of these have incorrect
values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
at least so it can skip over the optional fields if they are
present. It need not examine any other part of the header or
trailer; in particular, a decompressor may ignore FTEXT and OS
and always produce binary output, and still be compliant. A
compliant decompressor must give an error indication if any
reserved bit is non-zero, since such a bit could indicate the
presence of a new field that would cause subsequent data to be
interpreted incorrectly.