Java中的Big/Little问题
1. 解决Endian问题:一个总结
Java二进制文件中的所有东西都以big-endian形式存在,高字节优先,这有时被称为网络顺序。这是一个好的消息,意味着如果你只使用Java。所有文件在所有平台(Mac,PC,Solaris等)上按同样的方式进行处理。可以自由地交换二进制数据,以电子形式在Internet上,或在软盘上,而无需考虑endian问题。存在的问题是当你与那些不是使用Java编写的程序交换数据文件时,会存在一些问题。因为这些程序使用的是little-endian顺序,通常是在PC上使用的C语言。有些平台内部使用big-endian字节顺序(Mac,IBM390);有些平台使用little-endian字节顺序(Intel)。Java对用户隐瞒了endian问题。
在二进制文件中,在域之间没有分割符,文件是二进制形式的,不可读的ASCII。如果你想读的数据不是标准格式,通常由非Java程序准备的。可以由四种选择:
1). 重写提供输入文件的输出程序。它可以直接输出big-endian字节流DataOutputStream或者字符DataOutputSream格式。
2). 写一个独立的翻译程序,读和排列字节。可以用任何语言编写。
3). 以字节形式读数据,并重新安排它们(on the fly)。
4). 最简单的方式是,使用我编写的LEDataInputStream, LEDataOutputStream 和LERandomAccessFile模拟 DataInputStream, DataOutputStream and RandomAccessFile ,它们使用的是little-endian字节流。 You can read about LEDataStream. You can download the code and source free. You can get help from the File I/O Amanuensis to show you how to use the classes. Just tell it you have little-endian binary data.
2.你可能甚至不会有任何问题。
从C来的许多Java新手可能会认为需要考虑它们所依赖的平台内部所使用的是big还是little问题。在Java中这不是一个问题。进一步,不借助于本地类,你无法知道它们是如何存储的。Java has no struct I/O and no unions or any of the other endian-sensitive language constructs.
仅在与遗留的C/C++应用程序通讯时需要考虑endian问题。下列代码在big or little endian机器上都将产生同样的结果:
// take 16-bit short apart into two 8-bit bytes.
short x = 0xabcd;
byte high = (byte) (x >>> 8);
byte low = (byte) x;/* cast implies & 0xff */
System.out.println ("x=" + x + " high=" + high + " low=" + low );
3.读Little-Endian Binary Files
The most common problem is dealing with files stored in little-endian format.
I had to implement routines parallel to those in java.io.DataInputStream which reads raw binary, in my LEDataInputStream and LEDataOutputStream classes. Don't confuse this with the io.DataInput human-readable character-based file-interchange format.
If you wanted to do it yourself, without the overhead of the full LEDataInputStream and LEDataOutputStream classes, here is the basic technique:
Presuming your integers are in 2's complement little-endian format, shorts are pretty easy to handle:
short readShortLittleEndian( )
{
// 2 bytes
int low = readByte() & 0xff;
int high = readByte() & 0xff;
return (short )(high << 8 | low);
}
Or if you want to get clever and puzzle your readers, you can avoid one mask since the high bits will later be shaved off by conversion back to short.
short readShortLittleEndian( )
{
// 2 bytes
int low = readByte() & 0xff;
int high = readByte();
// avoid masking here
return (short )(high << 8 | low);
}
Longs are a little more complicated:
long readLongLittleEndian( )
{
// 8 bytes
long accum = 0;
for ( int shiftBy = 0; shiftBy < 64; shiftBy+ =8 )
{
// must cast to long or shift done modulo 32
accum |= ( long)(readByte () & 0xff) << shiftBy;
}
return accum;
}
In a similar way we handle char and int.
char readCharLittleEndian( )
{
// 2 bytes
int low = readByte() & 0xff;
int high = readByte();
return (char )(high << 8 | low);
}
int readIntLittleEndian( )
{
// 4 bytes
int accum = 0;
for ( int shiftBy = 0; shiftBy < 32; shiftBy+ =8 )
{
accum |= (readByte () & 0xff) << shiftBy;
}
return accum;
}
Floating point is a little trickier. Presuming your data is in IEEE little-endian format, you need something like this:
double readDoubleLittleEndian( )
{
long accum = 0;
for ( int shiftBy = 0; shiftBy < 64; shiftBy+ =8 )
{
// must cast to long or shift done modulo 32
accum |= ( (long)(readByte() & 0xff)) << shiftBy;
}
return Double.longBitsToDouble (accum);
}
float readFloatLittleEndian( )
{
int accum = 0;
for ( int shiftBy = 0; shiftBy < 32; shiftBy+ =8 )
{
accum |= (readByte () & 0xff) << shiftBy;
}
return Float.intBitsToFloat (accum);
}
You don't need a readByteLittleEndian since the code would be identical to readByte, though you might create one just for consistency:
byte readByteLittleEndian( )
{
// 1 byte
return readByte();
}
4.History
In Gulliver's travels the Lilliputians liked to break their eggs on the small end and the Blefuscudians on the big end. They fought wars over this. There is a computer analogy. Should numbers be stored most or least significant byte first? This is sometimes referred to as byte sex.
Those in the big-endian camp (most significant byte stored first) include the Java VM virtual computer, the Java binary file format, the IBM 360 and follow-on mainframes such as the 390, and the Motorola 68K and most mainframes. The Power PC is endian-agnostic.
Blefuscudians (big-endians) assert this is the way God intended integers to be stored, most important part first. At an assembler level fields of mixed positive integers and text can be sorted as if it were one big text field key. Real programmers read hex dumps, and big-endian is a lot easier to comprehend.
In the little-endian camp (least significant byte first) are the Intel 8080, 8086, 80286, Pentium and follow ons and the AMD 6502 popularised by the Apple ][.
Lilliputians (little-endians) assert that putting the low order part first is more natural because when you do arithmetic manually, you start at the least significant part and work toward the most significant part. This ordering makes writing multi-precision arithmetic easier since you work up not down. It made implementing 8-bit microprocessors easier. At the assembler level (not in Java) it also lets you cheat and pass addresses of a 32-bit positive ints to a routine expecting only a 16-bit parameter and still have it work. Real programmers read hex dumps, and little-endian is more of a stimulating challenge.
If a machine is word addressable, with no finer addressing supported, the concept of endianness means nothing since words are fetched from RAM in parallel, both ends first.
5.What Sex Is Your CPU?
Byte Sex Endianness of CPUs
CPU
Endianness
Notes
AMD 6502, Duron, Athlon, Thunderird
little
6502 was used in the Apple ][, the Duron, Athlon and Thunderbird in Windows 95/08/ME/NT/2000/XP
Apple ][ 6502
little
Apple Mac 68000
big
Uses Motorola 68000
Apple Power PC
big
CPU is bisexual but stays big in the Mac OS.
Burroughs 1700, 1800, 1900
?
bit addressable. Used different interpreter firmware instruction sets for each language.
Burroughs 7800
?
Algol machine
CDC LGP-30
word-addressable only, hence no endianness
31½ bit words. Low order bit must be 0 on the drum, but can be 1 in the accumulator.
CDC 3300, 6600
word-addressable
?
DEC PDP, Vax
little
IBM 360, 370, 380, 390
big
IBM 7044, 7090
word addressable
36 bits
IBM AS-400
big
?
Power PC
either
The endian-agnostic Power-PC's have a foot in both camps. They are bisexual, but the OS usually imposes one convention or the other. e.g. Mac PowerPCs are big-endian.
Intel 8080, 8080, 8086, 80286, 80386, 80486, Pentium I, II, III, IV
little
Chips used in PCs
Intel 8051
big
MIPS R4000, R5000, R10000
big
Used in Silcon Graphics IRIX.
Motorola 6800, 6809, 680x0, 68HC11
big
Early Macs used the 68000. Amiga.
NCR 8500
big
NCR Century
big
Sun Sparc and UltraSparc
big
Sun's Solaris. Normally used as big-endian, but also has support for operating for little-endian mode, including being able to switch endianness under program control for particular loads and stores.
Univac 1100
word-addressable
36-bit words.
Univac 90/30
big
IBM 370 clone
Zilog Z80
little
Used in CPM machines.
If you know the endianness of other CPUs/OSes/platforms please email me at roedy@mindprod.com.
In theory data can have two different byte sexes but CPUs can have four. Let us give thanks, in this world of mixed left and right hand drive, that there are not real CPUs with all four sexes to contend with.
The Four Possible Byte Sexes for CPUS
Which Byte
Is Stored in the
Lower-Numbered
Address?
Which Byte
Is Addressed?
Used In
LSB
LSB
Intel, AMD, Power PC, DEC.
LSB
MSB
none that I know of.
MSB
LSB
Perhaps one of the old word mark architecture machines.
MSB
MSB
Mac, IBM 390, Power PC
You are visitor number 8680.
You can get an updated copy of this page from http://mindprod.com/endian.html