Scanning Text With java.util.Scanner

J2SE 5.0 adds classes and methods that can make every day tasks easier to perform. In this tip you will see how the newly added java.util.Scanner class makes it easier to read and parse strings and primitive types using regular expressions.

Before the J2SE 5.0 release, you probably wrote code such as the following TextReader class to read text from a file:

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.io.File;

public class TextReader {

private static void readFile(String fileName) {

try {

File file = new File(fileName);

FileReader reader = new FileReader(file);

BufferedReader in = new BufferedReader(reader);

String string;

while ((string = in.readLine()) != null) {

System.out.println(string);

}

in.close();

} catch (IOException e) {

e.printStackTrace();

}

public static void main(String[] args) {

if (args.length != 1) {

System.err.println("usage: java TextReader " + "file location");

System.exit(0);

}

readFile(args[0]);

}

The basic approach in classes like this is to create a File object that corresponds to the actual file on the hard drive. The class then creates a FileReader associated with the file and then a BufferedReader from the FileReader. It then uses the BufferedFile reader to read the file one line at a time.

To view the TextReader class in action, you need to create a document for the class to read and parse. To create the document, save the following two lines of text in a file named TextSample.txt in the same directory as TextReader:

Here is a small text file that you will

use to test java.util.scanner.

Compile TextReader. Then run it by entering the following:

java TextReader TextSample.txt

You should see the original file echoed back to you in standard output.

You can simplify the code in TextReader by using java.util.Scanner, a class that parses primitive types and strings:

import java.io.File;

import java.io.FileNotFoundException;

import java.util.Scanner;

public class TextScanner {

private static void readFile(String fileName) {

try {

File file = new File(fileName);

Scanner scanner = new Scanner(file);

while (scanner.hasNext()) {

System.out.println(scanner.next());

}

scanner.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

}

public static void main(String[] args) {

if (args.length != 1) {

System.err.println("usage: java TextScanner1" + "file location");

System.exit(0);

}

readFile(args[0]);

}

Compile TextScanner. Then run it as follows:

java TextScanner TextSample.txt

You should get the following output:

Here

small

text

file

that

you

will

use

test

java.util.scanner.

TextScanner creates a Scanner object from the File. The Scanner breaks the contents of the File into tokens using a delimiter pattern, By default the delimiter pattern is whitespace. TextScanner then calls the hasNext() method in Scanner. This method returns true if another token exists in the Scanner's input, which is the case until it reaches the end of the file. The next() method returns a String that represents the next token. So until it reaches the end of the file, TextScanner prints the String returned by next() on a separate line.

You can change the delimeter that is used to tokenize the input, through the useDelimiter method of Scanner. You can pass in a String or a java.util.regex.Pattern to the method. See the JavaDocs page for Pattern for information on what patterns are appropriate. For example, you can read the input one line at a time by using the newline character (\n) as a delimiter. Here is the revised readFile() method for TextScanner that uses a newline character as the delimiter:

private static void readFile(String fileName) {

try {

Scanner scanner = new Scanner(new File(fileName));

scanner.useDelimiter(System.getProperty("line.separator"));

while (scanner.hasNext())

System.out.println(scanner.next());

scanner.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

}

Note that there are other options for detecting the end of a line. You could, for example, test for lines that end with a newline character or that end with a carriage return and a newline character. You can do that using the regular expression "\r\n|\n". The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]". You can also use the hasNextLine() and nextLine() methods from the Scanner class. In any case, with the revised TextScanner, the output should match the contents and layout of TextSample.txt. In other words, you should see the following:

Here is a small text file that you will

use to test java.util.scanner.

A simple change of the pattern in the delimiter used by the Scanner gives you a great deal of power and flexibility. For example, if you specify the following delimiter:

scanner.useDelimiter("\\z");

it reads in the entire file at once. This is similar to the trick suggested by Pat Niemeyer in his java.net blog. You can read in the entire contents of a web page without creating several intermediate objects. The code for the following class, WebPageScanner, reads in the current contents of the java.net homepage:

import java.net.URL;

import java.net.URLConnection;

import java.io.IOException;

import java.util.Scanner;

public class WebPageScanner {

public static void main(String[] args) {

try {

URLConnection connection = new URL("http://java.net").openConnection();

String text = new Scanner(connection.getInputStream()).useDelimiter("\\Z").next();

} catch (IOException e) {

e.printStackTrace();

}

You can handle more than Strings with the Scanner class. You can also use Scanner to parse data that consists of primitives. To illustrate this, save the following three lines in a file named Employee.data (in the same directory as TextSample):

Joe, 38, true

Kay, 27, true

Lou, 33, false

You could still treat this as one large String and perform the conversions after parsing the String. Instead, you can parse this file in two steps. This is illustrated in the following class, DataScanner:

import java.util.Scanner;

import java.io.File;

import java.io.FileNotFoundException;

public class DataScanner {

private static void readFile(String fileName) {

try {

Scanner scanner = new Scanner(new File(fileName));

scanner.useDelimiter(System.getProperty("line.separator"));

while (scanner.hasNext()) {

parseLine(scanner.next());

}

scanner.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

}

private static void parseLine(String line) {

Scanner lineScanner = new Scanner(line);

lineScanner.useDelimiter("\\s*,\\s*");

String name = lineScanner.next();

int age = lineScanner.nextInt();

boolean isCertified = lineScanner.nextBoolean();

System.out.println("It is " + isCertified + " that " + name + ", age " + age + ", is certified.");

}

public static void main(String[] args) {

if (args.length != 1) {

System.err.println("usage: java TextScanner2" + "file location");

System.exit(0);

}

readFile(args[0]);

}

The outer Scanner object in DataScanner reads a file, one line at a time. The readFile() method passes each line to a second scanner. The second scanner parses the comma delimited data and discards the whitespace on either side of the comma. There are variants of the hasNext() and next() methods which enable you to test whether or not the next token is of a specified type and to attempt to treat the next token as an instance of that type. For example, nextBoolean() attempts to treat the next token as a boolean and tries to match it to either the String "true" or the String "false". If the match cannot be made, a java.util.InputMismatchException is thrown. The parseLine() method of DataScanner shows how each line is parsed into a String, an int, and a boolean.

Compile DataScanner. Then run it as follows:

java DataScanner Employee.data

You should get the following output:

It is true that Joe, age 38, is certified.

It is true that Kay, age 27, is certified.

It is false that Lou, age 33, is certified.

You might be tempted to use just the comma as a delimiter. In other words you might try this:

lineScanner.useDelimiter(",");

This will result in an InputMismatchException. That's because an extra space will be included in the token that you are trying to convert to a boolean, and the space does not match either "true" or "false". As is the case with all applications of regular expressions, the underlying power requires that you take extra care in constructing your patterns.

For more information on Scanner, see the formal documentation.