Example 5-6. Example XML output from CSV parser
<?xml version="1.0" encoding="UTF-8"?>
<csvFile>
<line>
<value>Burke</value>
<value>Eric</value>
<value>M</value>
</line>
<line>
<value>Burke</value>
<value>Jennifer</value>
<value>L</value>
</line>
<line>
<value>Burke</value>
<value>Aidan</value>
<value>G</value>
</line>
</csvFile>
One enhancement would be to design the CSV parser so it could accept
a list of meaningful column names as parameters, and these could be
used in the XML that is generated. Another option would be to write
an XSLT stylesheet that transformed this initial output into another
form of XML that used meaningful column names. To keep the code
example relatively manageable, these features were omitted from this
implementation. But there are some complexities to the CSV file
format that have to be considered. For example, fields that contain
commas must be surrounded with quotes:
"Consultant,Author,Teacher",Burke,Eric,M
Teacher,Burke,Jennifer,L
None,Burke,Aidan,G
To further complicate matters, fields may also contain quotes
("). In this case, they are doubled up, much in the same way
you use double backslash characters (\\) in Java to represent a
single backslash. In the following example, the first column contains
a single quote, so the entire field is quoted, and the single quote
is doubled up:
"test""quote",Teacher,Burke,Jennifer,L
This would be interpreted as:
test"quote,Teacher,Burke,Jennifer,L
Example 5-7. CSVXMLReader.java
package com.oreilly.javaxslt.util;
import java.io.*;
import java.net.URL;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
/**
* A utility class that parses a Comma Separated Values (CSV) file
* and outputs its contents using SAX2 events. The format of CSV that
* this class reads is identical to the export format for Microsoft
* Excel. For simple values, the CSV file may look like this:
* <pre>
* a,b,c
* d,e,f
* </pre>
* Quotes are used as delimiters when the values contain commas:
* <pre>
* a,"b,c",d
* e,"f,g","h,i"
* </pre>
* And double quotes are used when the values contain quotes. This parser
* is smart enough to trim spaces around commas, as well.
*
* @author Eric M. Burke
*/
public class CSVXMLReader extends AbstractXMLReader {
// an empty attribute for use with SAX
private static final Attributes EMPTY_ATTR = new AttributesImpl( );
/**
* Parse a CSV file. SAX events are delivered to the ContentHandler
* that was registered via <code>setContentHandler</code>.
*
* @param input the comma separated values file to parse.
*/
public void parse(InputSource input) throws IOException,
SAXException {
// if no handler is registered to receive events, don't bother
// to parse the CSV file
ContentHandler ch = getContentHandler( );
if (ch == null) {
return;
}
// convert the InputSource into a BufferedReader
BufferedReader br = null;
if (input.getCharacterStream( ) != null) {
br = new BufferedReader(input.getCharacterStream( ));
} else if (input.getByteStream( ) != null) {
br = new BufferedReader(new InputStreamReader(
input.getByteStream( )));
} else if (input.getSystemId( ) != null) {
java.net.URL url = new URL(input.getSystemId( ));
br = new BufferedReader(new InputStreamReader(url.openStream( )));
} else {
throw new SAXException("Invalid InputSource object");
}
ch.startDocument( );
// emit <csvFile>
ch.startElement("","","csvFile",EMPTY_ATTR);
// read each line of the file until EOF is reached
String curLine = null;
while ((curLine = br.readLine( )) != null) {
curLine = curLine.trim( );
if (curLine.length( ) > 0) {
// create the <line> element
ch.startElement("","","line",EMPTY_ATTR);
// output data from this line
parseLine(curLine, ch);
// close the </line> element
ch.endElement("","","line");
}
}
// emit </csvFile>
ch.endElement("","","csvFile");
ch.endDocument( );
}
// Break an individual line into tokens. This is a recursive function
// that extracts the first token, then recursively parses the
// remainder of the line.
private void parseLine(String curLine, ContentHandler ch)
throws IOException, SAXException {
String firstToken = null;
String remainderOfLine = null;
int commaIndex = locateFirstDelimiter(curLine);
if (commaIndex > -1) {
firstToken = curLine.substring(0, commaIndex).trim( );
remainderOfLine = curLine.substring(commaIndex+1).trim( );
} else {
// no commas, so the entire line is the token
firstToken = curLine;
}
// remove redundant quotes
firstToken = cleanupQuotes(firstToken);
// emit the <value> element
ch.startElement("","","value",EMPTY_ATTR);
ch.characters(firstToken.toCharArray(), 0, firstToken.length( ));
ch.endElement("","","value");
// recursively process the remainder of the line
if (remainderOfLine != null) {
parseLine(remainderOfLine, ch);
}
}
// locate the position of the comma, taking into account that
// a quoted token may contain ignorable commas.
private int locateFirstDelimiter(String curLine) {
if (curLine.startsWith("\"")) {
boolean inQuote = true;
int numChars = curLine.length( );
for (int i=1; i<numChars; i++) {
char curChar = curLine.charAt(i);
if (curChar == '"') {
inQuote = !inQuote;
} else if (curChar == ',' && !inQuote) {
return i;
}
}
return -1;
} else {
return curLine.indexOf(',');
}
}
// remove quotes around a token, as well as pairs of quotes
// within a token.
private String cleanupQuotes(String token) {
StringBuffer buf = new StringBuffer( );
int length = token.length( );
int curIndex = 0;
if (token.startsWith("\"") && token.endsWith("\"")) {
curIndex = 1;
length--;
}
boolean oneQuoteFound = false;
boolean twoQuotesFound = false;
while (curIndex < length) {
char curChar = token.charAt(curIndex);
if (curChar == '"') {
twoQuotesFound = (oneQuoteFound) ? true : false;
oneQuoteFound = true;
} else {
oneQuoteFound = false;
twoQuotesFound = false;
}
if (twoQuotesFound) {
twoQuotesFound = false;
oneQuoteFound = false;
curIndex++;
continue;
}
buf.append(curChar);
curIndex++;
}
return buf.toString( );
}
}
CSVXMLReader is a subclass of
AbstractXMLReader, so it must provide an
implementation of the abstract parse method:
public void parse(InputSource input) throws IOException,
SAXException {
// if no handler is registered to receive events, don't bother
// to parse the CSV file
ContentHandler ch = getContentHandler( );
if (ch == null) {
return;
}
The first thing this method does is check for the existence of a SAX
ContentHandler. The base class,
AbstractXMLReader, provides access to this object,
which is responsible for listening to the SAX events. In our example,
an instance of JAXP's TransformerHandler is
used as the SAX ContentHandler implementation. If
this handler is not registered, our parse method
simply returns because nobody is registered to listen to the events.
In a real SAX parser, the XML would be parsed anyway, which provides
an opportunity to check for errors in the XML data. Choosing to
return immediately was merely a performance optimization selected for
this class.
The SAX InputSource parameter allows our custom
parser to locate the CSV file. Since an
InputSource has many options for reading its data,
parsers must check each potential source in the order shown here:
// convert the InputSource into a BufferedReader
BufferedReader br = null;
if (input.getCharacterStream( ) != null) {
br = new BufferedReader(input.getCharacterStream( ));
} else if (input.getByteStream( ) != null) {
br = new BufferedReader(new InputStreamReader(
input.getByteStream( )));
} else if (input.getSystemId( ) != null) {
java.net.URL url = new URL(input.getSystemId( ));
br = new BufferedReader(new InputStreamReader(url.openStream( )));
} else {
throw new SAXException("Invalid InputSource object");
}
Assuming that our InputSource was valid, we can
now begin parsing the CSV file and emitting SAX events. The first
step is to notify the ContentHandler that a new
document has begun:
ch.startDocument( );
// emit <csvFile>
ch.startElement("","","csvFile",EMPTY_ATTR);
The XSLT processor interprets this to mean the following:
<?xml version="1.0" encoding="UTF-8"?>
<csvFile>
Our parser simply ignores many SAX 2 features, particularly XML
namespaces. This is why many values passed as parameters to the
various ContentHandler methods simply contain
empty strings. The EMPTY_ATTR constant indicates
that this XML element does not have any attributes.
The CSV file itself is very straightforward, so we merely loop over
every line in the file, emitting SAX events as we read each line. The
parseLine method is a private helper method that
does the actual CSV parsing:
// read each line of the file until EOF is reached
String curLine = null;
while ((curLine = br.readLine( )) != null) {
curLine = curLine.trim( );
if (curLine.length( ) > 0) {
// create the <line> element
ch.startElement("","","line",EMPTY_ATTR);
parseLine(curLine, ch);
ch.endElement("","","line");
}
}
And finally, we must indicate that the parsing is complete:
// emit </csvFile>
ch.endElement("","","csvFile");
ch.endDocument( );
The remaining methods in CSVXMLReader are not
discussed in detail here because they are really just responsible for
breaking down each line in the CSV file and checking for commas,
quotes, and other mundane parsing tasks. One thing worth noting is
the code that emits text, such as the following:
<value>Some Text Here</value>
SAX parsers use the characters method on
ContentHandler to represent text, which has this
signature:
public void characters(char[] ch, int start, int length)
Although this method could have been designed to take a
String, using an array allows SAX parsers to
preallocate a large character array and then reuse that buffer
repeatedly. This is why an implementation of
ContentHandler cannot simply assume that the
entire ch array contains meaningful data. Instead,
it must read only the specified number of characters beginning at the
start position.
Our parser uses a relatively straightforward approach, simply
converting a String to a character array and
passing that as a parameter to the characters
method:
// emit the <value>text</value> element
ch.startElement("","","value",EMPTY_ATTR);
ch.characters(firstToken.toCharArray(), 0, firstToken.length( ));
ch.endElement("","","value");