Compressing and Decompressing Data using JavaTM APIs
by Qusay H. Mahmoud
with contributions from Konstantin Kladko
February 2002
Many sources of information contain redundant data or data that adds little to the stored information. This results in tremendous amounts of data being transferred between client and server applications or computers in general. The obvious solution to the problems of data storage and information transfer is to install additional storage devices and expand existing communication facilities. To do so, however, requires an increase in an organization's operating costs. One method to alleviate a portion of data storage and information transfer is through the representation of data by more efficient code. This article presents a brief introduction to data compression and decompression, and shows how to compress and decompress data, efficiently and conveniently, from within your JavaTM applications using the java.util.zip package.
While it is possible to compress and decompress data using tools such as WinZip, gzip, and Java ARchive (or jar), these tools are used as standalone applications. It is possible to invoke these tools from your Java applications, but this is not a straightforward approach and not an efficient solution. This is especially true if you wish to compress and decompress data on the fly (before transferring it to a remote machine for example). This article:
Gives you a brief overview of data compression
Describes the java.util.zip package
Shows how to use this package to compress and decompress data
Shows how to compress and decompress serialized objects to save disk space
Shows how to compress and decompress data on the fly to improve the performance of client/server applications
Overview of Data Compression
The simplest type of redundancy in a file is the repetition of characters. For example, consider the following string:
BBBBHHDDXXXXKKKKWWZZZZ
This string can be encoded more compactly by replacing each repeated string of characters by a single instance of the repeated character and a number that represents the number of times it is repeated. The earlier string can be encoded as follows:
4B2H2D4X4K2W4Z
Here "4B" means four B's, and 2H means two H's, and so on. Compressing a string in this way is called run-length encoding.
As another example, consider the storage of a rectangular image. As a single color bitmapped image, it can be stored as shown in Figure 1.
Figure 1: A bitmap with information for run-length encoding
Another approach might be to store the image as a graphics metafile:
Rectangle 11, 3, 20, 5
This says, the rectangle starts at coordinate (11, 3) of width 20 and length 5 pixels.
The rectangular image can be compressed with run-length encoding by counting identical bits as follows:
0, 40
0, 40
0,10 1,20 0,10
0,10 1,1 0,18 1,1 0,10
0,10 1,1 0,18 1,1 0,10
0,10 1,1 0,18 1,1 0,10
0,10 1,20 0,10
0,40
The first line above says that the first line of the bitmap consists of 40 0's. The third line says that the third line of the bitmap consists of 10 0's followed by 20 1's followed by 10 more 0's, and so on for the other lines.
Note that run-length encoding requires separate representations for the file and its encoded version. Therefore, this method cannot work for all files. Other compression techniques include variable-length encoding (also known as Huffman Coding), and many others. For more information, there are many books available on data and image compression techniques.
There are many benefits to data compression. The main advantage of it, however, is to reduce storage requirements. Also, for data communications, the transfer of compressed data over a medium results in an increase in the rate of information transfer. Note that data compression can be implemented on existing hardware by software or through the use of special hardware devices that incorporate compression techniques. Figure 2 shows a basic data-compression block diagram.
Figure 2: Data-compression block diagram
ZIP vs. GZIP
If you are working on Windows, you might be familiar with the WinZip tool, which is used to create a compressed archive and to extract files from a compressed archive. On UNIX, however, things are done a bit differently. The tar command is used to create an archive (not compressed) and another program (gzip or compress) is used to compress the archive.
Tools such as WinZip and PKZIP act as both an archiver and a compressor. They compress files and store them in an archive. On the other hand, gzip does not archive files. Therefore, on UNIX, the tar command is usually used to create an archive then the gzip command is used to compress the archived file.
The java.util.zip Package
Java provides the java.util.zip package for zip-compatible data compression. It provides classes that allow you to read, create, and modify ZIP and GZIP file formats. It also provides utility classes for computing checksums of arbitrary input streams that can be used to validate input data. This package provides one interface, fourteen classes, and two exception classes as shown in Table 1.
Table 1: The java.util.zip package
Item
Type
Description
Checksum
Interface
Represents a data checksum. Implemented by the classes Adler32 and CRC32
Adler32
Class
Used to compute the Adler32 checksum of a data stream
CheckedInputStream
Class
An input stream that maintains the checksum of the data being read
CheckedOutputStream
Class
An output stream that maintains the checksum of the data being written
CRC32
Class
Used to compute the CRC32 checksum of a data stream
Deflater
Class
Supports general compression using the ZLIB compression library
DeflaterOutputStream
Class
An output stream filter for compressing data in the deflate compression format
GZIPInputStream
Class
An input stream filter for reading compressed data in the GZIP file format
GZIPOutputStream
Class
An output stream filter for writing compressed data in the GZIP file format
Inflater
Class
Supports general decompression using the ZLIB compression library
InlfaterInputStream
Class
An input stream filter for decompressing data in the deflate compression format
ZipEntry
Class
Represents a ZIP file entry
ZipFile
Class
Used to read entries from a ZIP file
ZipInputStream
Class
An input stream filter for reading files in the ZIP file format
ZipOutputStream
Class
An output stream filter for writing files in the ZIP file format
DataFormatException
Exception Class
Thrown to signal a data format error
ZipException
Exception Class
Thrown to signal a zip error
Note: The ZLIB compression library was initially developed as part of the Portable Network Graphics (PNG) standard that is not protected by patents.
Decompressing and Extracting Data from a ZIP file
The java.util.zip package provides classes for data compression and decompression. Decompressing a ZIP file is a matter of reading data from an input stream. The java.util.zip package provides a ZipInputStream class for reading ZIP files. A ZipInputStream can be created just like any other input stream. For example, the following segment of code can be used to create an input stream for reading data from a ZIP file format:
FileInputStream fis = new FileInputStream("figs.zip");
ZipInputStream zin = new
ZipInputStream(new BufferedInputStream(fis));
Once a ZIP input stream is opened, you can read the zip entries using the getNextEntry method which returns a ZipEntry object. If the end-of-file is reached, getNextEntry returns null: ZipEntry entry;
while((entry = zin.getNextEntry()) != null) {
// extract data
// open output streams
}
Now, it is time to set up a decompressed output stream, which can be done as follows:
int BUFFER = 2048;
FileOutputStream fos = new
FileOutputStream(entry.getName());
BufferedOutputStream dest = new
BufferedOutputStream(fos, BUFFER);
Note: In this segment of code we have used the BufferedOutputStream instead of the ZIPOutputStream. The ZIPOutputStream and the GZIPOutputStream use internal buffer sizes of 512. The use of the BufferedOutputStream is only justified when the size of the buffer is much more than 512 (in this example it is set to 2048). While the ZIPOutputStream doesn't allow you to set the buffer size, in the case of the GZIPOutputStream however, you can specify the internal buffer size as a constructor argument.
In this segment of code, a file output stream is created using the entry's name, which can be retrieved using the entry.getName method. Source zipped data is then read and written to the decompressed stream:
while ((count = zin.read(data, 0, BUFFER)) != -1) {
//System.out.write(x);
dest.write(data, 0, count);
}
And finally, close the input and output streams:
dest.flush();
dest.close();
zin.close();
The source program in Code Sample 1 shows how to decompress and extract files from a ZIP archive. To test this sample, compile the class and run it by passing a compressed file in ZIP format:
prompt> java UnZip somefile.zip
Note that somefile.zip could be a ZIP archive created using any ZIP-compatible tool, such as WinZip.
Code Sample 1: UnZip.java
import java.io.*;
import java.util.zip.*;
public class UnZip {
final int BUFFER = 2048;
public static void main (String argv[]) {
try {
BufferedOutputStream dest = null;
FileInputStream fis = new
FileInputStream(argv[0]);
ZipInputStream zis = new
ZipInputStream(new BufferedInputStream(fis));
ZipEntry entry;
while((entry = zis.getNextEntry()) != null) {
System.out.println("Extracting: " +entry);
int count;
byte data[] = new byte[BUFFER];
// write the files to the disk
FileOutputStream fos = new
FileOutputStream(entry.getName());
dest = new
BufferedOutputStream(fos, BUFFER);
while ((count = zis.read(data, 0, BUFFER))
!= -1) {
dest.write(data, 0, count);
}
dest.flush();
dest.close();
}
zis.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
It is important to note that the ZipInputStream class reads ZIP files sequentially. The class ZipFile, however, reads the contents of a ZIP file using a random access file internally so that the entries of the ZIP file do not have to be read sequentially.
Note: Another fundamental difference between ZIPInputStream and ZipFile is in terms of caching. Zip entries are not cached when the file is read using a combination of ZipInputStream and FileInputStream. However, if the file is opened using ZipFile(fileName) then it is cached internally, so if ZipFile(fileName) is called again the file is opened only once. The cached value is used on the second open. If you work on UNIX, it is worth noting that all zip files opened using ZipFile are memory mapped, and therefore the performance of ZipFile is superior to ZipInputStream. If the contents of the same zip file, however, are be to frequently changed and reloaded during program execution, then using ZipInputStream is preferred.
This is how a ZIP file can be decompressed using the ZipFile class:
Create a ZipFile object by specifying the ZIP file to be read either as a String filename or as a File object:
ZipFile zipfile = new ZipFile("figs.zip");
Use the entries method, returns an Enumeration object, to loop through all the ZipEntry objects of the file: while(e.hasMoreElements()) {
entry = (ZipEntry) e.nextElement();
// read contents and save them
}
Read the contents of a specific ZipEntry within the ZIP file by passing the ZipEntry to getInputStream, which will return an InputStream object from which you can read the entry's contents: is = new
BufferedInputStream(zipfile.getInputStream(entry));
Retrieve the entry's filename and create an output stream to save it: byte data[] = new byte[BUFFER];
FileOutputStream fos = new
FileOutputStream(entry.getName());
dest = new BufferedOutputStream(fos, BUFFER);
while ((count = is.read(data, 0, BUFFER)) != -1) {
dest.write(data, 0, count);
}
Finally, close all input and output streams: dest.flush();
dest.close();
is.close();
The complete source program is shown in Code Sample 2. Again, to test this class, compile it and run it by passing a file in a ZIP format as an argument:
prompt> java UnZip2 somefile.zip
Code Sample 2: UnZip2.java
import java.io.*;
import java.util.*;
import java.util.zip.*;
public class UnZip2 {
static final int BUFFER = 2048;
public static void main (String argv[]) {
try {
BufferedOutputStream dest = null;
BufferedInputStream is = null;
ZipEntry entry;
ZipFile zipfile = new ZipFile(argv[0]);
Enumeration e = zipfile.entries();
while(e.hasMoreElements()) {
entry = (ZipEntry) e.nextElement();
System.out.println("Extracting: " +entry);
is = new BufferedInputStream
(zipfile.getInputStream(entry));
int count;
byte data[] = new byte[BUFFER];
FileOutputStream fos = new
FileOutputStream(entry.getName());
dest = new
BufferedOutputStream(fos, BUFFER);
while ((count = is.read(data, 0, BUFFER))
!= -1) {
dest.write(data, 0, count);
}
dest.flush();
dest.close();
is.close();
}
} catch(Exception e) {
e.printStackTrace();
}
}
}
Compressing and Archiving Data in a ZIP File
The ZipOutputStream can be used to compress data to a ZIP file. The ZipOutputStream writes data to an output stream in a ZIP format. There are a number of steps involved in creating a ZIP file.
The first step is to create a ZipOutputStream object, to which we pass the output stream of the file we wish to write to. Here is how you create a ZIP file entitled "myfigs.zip": FileOutputStream dest = new
FileOutputStream("myfigs.zip");
ZipOutputStream out = new
ZipOutputStream(new BufferedOutputStream(dest));
Once the target zip output stream is created, the next step is to open the source data file. In this example, source data files are those files in the current directory. The list command is used to get a list of files in the current directory: File f = new File(".");
String files[] = f.list();
for (int i=0; i<files.length; i++) {
System.out.println("Adding: "+files[i]);
FileInputStream fi = new FileInputStream(files[i]);
// create zip entry
// add entries to ZIP file
}
Note: This code sample is capable of compressing all files in the current directory. It doesn't handle subdirectories. As an exercise, you may want to modify Code Sample 3 to handle subdirectories.
Create a zip entry for each file that is read:
ZipEntry entry = new ZipEntry(files[i]))
Before you can write data to the ZIP output stream, you must first put the zip entry object using the putNextEntry method:
out.putNextEntry(entry);
Write the data to the ZIP file: int count;
while((count = origin.read(data, 0, BUFFER)) != -1) {
out.write(data, 0, count);
}
Finally, you close the input and output streams: origin.close();
out.close();
The complete source program is shown in Code Sample 3.
Code Sample 3: Zip.java
import java.io.*;
import java.util.zip.*;
public class Zip {
static final int BUFFER = 2048;
public static void main (String argv[]) {
try {
BufferedInputStream origin = null;
FileOutputStream dest = new
FileOutputStream("c:\\zip\\myfigs.zip");
ZipOutputStream out = new ZipOutputStream(new
BufferedOutputStream(dest));
//out.setMethod(ZipOutputStream.DEFLATED);
byte data[] = new byte[BUFFER];
// get a list of files from current directory
File f = new File(".");
String files[] = f.list();
for (int i=0; i<files.length; i++) {
System.out.println("Adding: "+files[i]);
FileInputStream fi = new
FileInputStream(files[i]);
origin = new
BufferedInputStream(fi, BUFFER);
ZipEntry entry = new ZipEntry(files[i]);
out.putNextEntry(entry);
int count;
while((count = origin.read(data, 0,
BUFFER)) != -1) {
out.write(data, 0, count);
}
origin.close();
}
out.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
Note: Entries can be added to a ZIP file either in a compressed (DEFLATED) or uncompressed (STORED) form. The setMethod can be used to set the method of storage. For example, to set the method to DEFLATED (compressed) use: out.setMethod(ZipOutputStream.DEFLATED) and to set it to STORED (not compressed) use: out.setMethod(ZipOutputStream.STORED).
ZIP File Properties
The ZipEntry class describes a compressed file stored in a ZIP file. The various methods contained in this class can be used to set and get pieces of information about the entry. The ZipEntry class is used by the ZipFile and ZipInputStream to read ZIP files, and the ZipOutputStream to write ZIP files. Some of the most useful methods available in the ZipEntry class are shown, along with a description, in Table 2.
Table 2: Some useful methods from the ZipEntry class
Method Signature
Description
public String getComment()
Returns the comment string for the entry, null if none
public long getCompressedSize()
Returns the compressed size of the entry, -1 if not known
public int getMethod()
Returns the compression method of the entry, -1 if not specified
public String getName()
Returns the name of the entry
public long getSize()
Returns the uncompressed zip of the entry, -1 if unknown
public long getTime()
Returns the modification time of the entry, -1 if not specified
public void setComment(String c)
Sets the optional comment string for the entry
public void setMethod(int method)
Sets the compression method for the entry
public void setSize(long size)
Sets the uncompressed size of the entry
public void setTime(long time)
Sets the modification time of the entry
Checksums
Some of the other important classes in the java.util.zip package are the Adler32 and CRC32 classes, which implement the java.util.zip.Checksum interface and compute the checksums required for data compression. The Adler32 algorithm is known to be faster than the CRC32 and it is as reliable. The getValue method can be used to obtain the current value of the checksum. The reset method can be used to reset the checksum to its default value.
Checksums can be used to mask corrupted files or messages. For example, suppose you want to create a ZIP file then transfer it to a remote machine. Once it is at the remote machine, using the checksum you can check whether the file got corrupted during the transmission. To demonstrate how to create checksums, we modify Code Sample 1 and Code Sample 3 to use CheckedInputStream and CheckedOutputStream as shown in Code Sample 4 and Code Sample 5.
Code Sample 4: Zip.java
import java.io.*;
import java.util.zip.*;
public class Zip {
static final int BUFFER = 2048;
public static void main (String argv[]) {
try {
BufferedInputStream origin = null;
FileOutputStream dest = new
FileOutputStream("c:\\zip\\myfigs.zip");
CheckedOutputStream checksum = new
CheckedOutputStream(dest, new Adler32());
ZipOutputStream out = new
ZipOutputStream(new
BufferedOutputStream(checksum));
//out.setMethod(ZipOutputStream.DEFLATED);
byte data[] = new byte[BUFFER];
// get a list of files from current directory
File f = new File(".");
String files[] = f.list();
for (int i=0; i<files.length; i++) {
System.out.println("Adding: "+files[i]);
FileInputStream fi = new
FileInputStream(files[i]);
origin = new
BufferedInputStream(fi, BUFFER);
ZipEntry entry = new ZipEntry(files[i]);
out.putNextEntry(entry);
int count;
while((count = origin.read(data, 0,
BUFFER)) != -1) {
out.write(data, 0, count);
}
origin.close();
}
out.close();
System.out.println("checksum:
"+checksum.getChecksum().getValue());
} catch(Exception e) {
e.printStackTrace();
}
}
}
Code Sample 5: UnZip.java
import java.io.*;
import java.util.zip.*;
public class UnZip {
public static void main (String argv[]) {
try {
final int BUFFER = 2048;
BufferedOutputStream dest = null;
FileInputStream fis = new
FileInputStream(argv[0]);
CheckedInputStream checksum = new
CheckedInputStream(fis, new Adler32());
ZipInputStream zis = new
ZipInputStream(new
BufferedInputStream(checksum));
ZipEntry entry;
while((entry = zis.getNextEntry()) != null) {
System.out.println("Extracting: " +entry);
int count;
byte data[] = new byte[BUFFER];
// write the files to the disk
FileOutputStream fos = new
FileOutputStream(entry.getName());
dest = new BufferedOutputStream(fos,
BUFFER);
while ((count = zis.read(data, 0,
BUFFER)) != -1) {
dest.write(data, 0, count);
}
dest.flush();
dest.close();
}
zis.close();
System.out.println("Checksum:
"+checksum.getChecksum().getValue());
} catch(Exception e) {
e.printStackTrace();
}
}
}
To test Code Sample 4 and 5, compile the classes and then run the Zip class to create a ZIP archive (a checksum value will be calculated and printed on the screen for your information) and then run the UnZip class to decompress the archive (a checksum value will be printed on the console). The two values must be exactly the same, otherwise the file is corrupted. Checksums are very useful in validating data. For example, you can create a ZIP file and send it to your friend along with a checksum. Your friend unzips the file and compares the checksum with the one you provided, if they are the same your friend knows that the file is authentic.
Compressing Objects
We have seen how to compress data available in file form and add it to an archive. But what if the data you wish to compress is not available in a file? Assume for example, that you are transferring large objects over sockets. To improve the performance of your application, you may want to compress the objects before sending them across the network and uncompress them at the destination. As another example, let's say you want to save objects on the disk in compressed format. The ZIP format, which is record-based, is not really suitable for this job. The GZIP is more appropriate as it operates on a single stream of data.
Now, let's see an example of how to compress objects before writing them on disk and how to decompress them after reading them from the disk. Code Sample 6 is a simple class that implements the Serializable interface to signal the JVM1 that we wish to serialize instances of this class.
Code Sample 6: Employee.java
import java.io.*;
public class Employee implements Serializable {
String name;
int age;
int salary;
public Employee(String name, int age, int salary) {
this.name = name;
this.age = age;
this.salary = salary;
}
public void print() {
System.out.println("Record for: "+name);
System.out.println("Name: "+name);
System.out.println("Age: "+age);
System.out.println("Salary: "+salary);
}
}
Now, write another class that creates a couple of objects from the Employee class. Code Sample 7 creates two objects (sarah and sam) of the Employee class, then saves their state in a file in a compressed format.
Code Sample 7 SaveEmployee.java
import java.io.*;
import java.util.zip.*;
public class SaveEmployee {
public static void main(String argv[]) throws
Exception {
// create some objects
Employee sarah = new Employee("S. Jordan", 28,
56000);
Employee sam = new Employee("S. McDonald", 29,
58000);
// serialize the objects sarah and sam
FileOutputStream fos = new
FileOutputStream("db");
GZIPOutputStream gz = new GZIPOutputStream(fos);
ObjectOutputStream oos = new
ObjectOutputStream(gz);
oos.writeObject(sarah);
oos.writeObject(sam);
oos.flush();
oos.close();
fos.close();
}
}
Now, the ReadEmployee class shown in Code Sample 8 is used to reconstruct the state of the two objects. Once the state has been constructed the print method is invoked on them.
Code Sample 8: ReadEmployee.java
import java.io.*;
import java.util.zip.*;
public class ReadEmployee {
public static void main(String argv[]) throws
Exception{
//deserialize objects sarah and sam
FileInputStream fis = new FileInputStream("db");
GZIPInputStream gs = new GZIPInputStream(fis);
ObjectInputStream ois = new ObjectInputStream(gs);
Employee sarah = (Employee) ois.readObject();
Employee sam = (Employee) ois.readObject();
//print the records after reconstruction of state
sarah.print();
sam.print();
ois.close();
fis.close();
}
}
The same idea can be used to compress large objects that are sent over sockets. The following segment of code show how to write objects in a compressed format, from the server to the client:
// write to client
GZIPOutputStream gzipout = new
GZIPOutputStream(socket.getOutputStream());
ObjectOutputStream oos = new
ObjectOutputStream(gzipout);
oos.writeObject(obj);
gzipos.finish();
And, the following segment of code shows how to decompress the objects at the client side once received from the server:
// read from server
Socket socket = new Socket(remoteServerIP, PORT);
GZIPInputStream gzipin = new
GZIPInputStream(socket.getInputStream());
ObjectInputStream ois = new ObjectInputStream(gzipin);
Object o = ois.readObject();
What about JAR Files?
The Java ARchive (JAR) format is based on the standard ZIP file format with an optional manifest file. If you wish to create JAR files or extract files from a JAR file from within your Java applications, use the java.util.jar package, which provides classes for reading and writing JAR files. Using the classes provided by the java.util.jar package is very similar to using the classes provided by the java.util.zip package as described in this article. Therefore, you should be able to adapt much of the code in this article if you wish to use the java.util.jar package.
Conclusion
This article discussed the APIs that you can use to compress and decompress data from within your applications, with code samples throughout the article to show how to use the java.util.zip package to compress and decompress data. Now you have the tools to utilize data compression and decompression in your applications.
The article also shows how to compress and decompress data on the fly in order to reduce network traffic and improve the performance of your client/server applications. Compressing data on the fly, however, improves the performance of client/server applications only when the objects being compressed are more than a couple of hundred bytes. You would not be able to observe improvement in performance if the objects being compressed and transferred are simple String objects, for example.
For more information
Transporting Objects over Sockets
About the Author
Qusay H. Mahmoud provides Java consulting and training services. He has published dozens of articles on Java, and is the author of Distributed Programming with Java (Manning Publications, 1999) and Learning Wireless Java (O'Reilly, 2002).