Return to the main page

Stream Parsing with REXML

James Britt, December 15, 2001

Introduction

There are two dominate ways of working with XML documents: as a tree (e.g., the W3C XML DOM) and as an event stream (e.g., SAX). There are various arguments for and against each of these, depending on what it is you need to do. The DOM is basically a good choice when dealing with document data, such as what might come a word processor. SAX is better suited for handling highly-structured data, such as rows from a database. There are, of course, forms of XML that fall somewhere in-between. Often the choice is not obvious, and extraneous details may be influential. (Kendall Grant Clark has written a very interesting article on the choice between SAX and DOM, at http://www.xml.com/pub/a/2001/11/14/dom-sax.html)

When transforming XML into, say, HTML, the typical approach is to read the source XML into a DOM, then apply an XSLT style sheet. However, there may be reasons to want an event-based approach to this. One is that the amount of data you need to transform is so large that it could not be held in memory. Or, you may be writing code that will be maintained by someone else, and you want another alternative to having them learn XSLT. Another reason may be that no suitable XSLT processor is available for your language of choice. In my own case, it was that last reason that prompted me to explore event streams for XML transformation.

What I will describe here is how to use REXML’s stream parser to create HTML based on some regularly-structured XML. In a sense it’s like a poor-man’s XSLT. Nonetheless, it does the job, and provides a good example of event-based programming.

The Problem Space

The rubyxml.com home page shows three columns of data: XML-related RAA entries, Ruby-specific XML news, and general XML news. The data for each of them come from XML files (though this may change for the Ruby XML news items). The XML has a periodic structure: there’s a root element, possibly some header stuff, then some number of identically-structured child elements. For example, the RAA data look like this:

<raa>

<package>

<product_download>http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/soap4r-1.3.8.tar.gz</product_download>

<product_status>usable</product_status>

<product_version>1.3.8</product_version>

<product_license>GPL</product_license>

<product_name>SOAP4R</product_name>

<product_homepage> http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/wiki.cgi?cmd=view;name=top

</product_homepage>

<product_description>"SOAP4R" is a Ruby library program to handle Simple Object Access Protocol (SOAP) 1.1 (W3C Note).

For more details, see

http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/RELEASE.en.html

or http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/RELEASE.ja.html

</product_description>

<owner_email></owner.email>

<owner_name>NaHi</owner.name>

<owner_id>-NaHi</owner.id>

<category_major>Library</category_major>

<category_minor>XML</category_minor>

<update>2001-10-04</update>

</package>

<package>

<product_download>http://www.chadfowler.com/ruby/rss/rss-ruby-0.9.1.tar.gz</product_download>

<product_status>usable/needs error handling</product_status>

<product_version>0.9.1</product_version>

<product_license>Ruby"s</product_license>

<product_name>Ruby/RSS</product_name>

<product_homepage>

http://www.chadfowler.com/index.cgi?mode=max&amp;cat=ruby

</product_homepage>

<product_description>

An object oriented Ruby library for parsing, creating, downloading, and

caching RSS (http://my.netscape.com/publish/help/mnn20/quickstart.html.

</product_description>

<owner_email>cfowler@chadfowler.com</owner.email>

<owner_name>Chad Fowler</owner.name>

<owner_id>cfowler@chadfowler.com-Chad Fowler</owner.id>

<category_major>Library</category_major>

<category_minor>XML</category_minor>

<update>2001-02-18</update>

</package>

<!-- More packages ... -->

</raa>

The data come from an XML-RPC call; I have some code that makes the call and writes out the results as an XML file. The markup specifics aren’t important; the main thing is that this sort of XML is uniformly structured, and as such is a good candidate for SAX (simple API for XML). A SAX parser reads an XML source and raises an event each time it encounters something of interest. "Something of interest" means, for example, the start and end of elements, or the presence of a processing instruction.

There is no formal SAX specification. SAX evolved from discussions within the XML developer community on the xml-dev mailing list. It was first written in Java, and it is the implementation that serves as the spec. Nonetheless, SAX parsers have been written for other programming languages, such as Python, Perl, and Visual Basic. These versions provide essentially the same classes and methods, with allowances for various language quirks.

There is currently no SAX parser for Ruby, but REXML does include a stream parser whose API is suitably SAX-like. Like a SAX parser, it allows us to read an XML source, and have methods called at key events.

Why SAX?

If you"ve spent much time working with XML you could solve this problem using XSLT. A simple template could grab the data and emit the HTML needed. So why not? Earlier I said that my event-driven template tool was like a poor-man"s XSLT. In fact, this approach can outshine XSLT when the amount of XML to process is large. Imagine that, instead of creating some simple HTML from a relatively small file, we needed to transform several megabytes of data into CSV. While the XSLT style sheet would perhaps be quite simple, the application would need sufficient memory to hold the source data and the resulting source tree. This may not be practical.

Stream-parsing XML is very useful in another situation. Imagine you are going to receive this multi-meg data file, and have reason to believe that, out of the bazillion elements, at least one of them will contain something that breaks the XML well-formed criteria. A proper XML parser is obligated to stop processing at the first such error. This would be a catastrophe in a file this large; fixing all errors and re-parsing the entire file just won"t do.

However, we can run the file through a stream parser and construct temporary in-memory XML documents from the passing data. The application can check that each subsection is well-formed before passing it on. If any errors are encountered, the faulty text can be written to a log file, and the parsing can resume. In this way, you can get all of the good data, while isolating the bad.

On a more immediate practical matter, Ruby does not have a suitable XSLT processor I can use. Nonetheless, I want a process that can read an XML source and transform it. If later I decide XSLT is a better option, I at least won’t need to change how I get my data, and the general process flow is the same, minimizing code changes.

What I came up with was to use an event stream to populate an HTML template. It would grab the RAA data and repeatedly "fill in the blanks" in the template. The final set of processed data would then be inserted into a main template. The process happens in two steps: parse the event stream into a temporary internal structure, then interpolate that structure into the templates. (The internal structure isn"t essential in the current version, but it could be useful if I later decide to do some post-parsing processing, such as sorting, on the acquired data.)

From Events to Responses

The REXML stream parser requires two object: a Source and a StreamListener. A Source is pretty much what you might expect: it’s where the XML comes from. REXML has a Source class, and provides two ways to instantiate one. First, there is the SourceFactory::create_from method. It accepts either a String or anIOobject. You would use it like this:

xmlfile = File.new "raa.xml"

src = SourceFactory.create_from xmlfile

Another way is to call new on the Source class:

src = Source.new "<doc>My XML string</doc>"

This only works with a String.

A StreamListener compliments the Source: it is what will respond to the events triggered by the Source. There is no factory or base class for this. To create one you have to write a class that implements the REXML::StreamListener interface. A listener class must be able to handle all of the StreamListener methods. If you do not provide all of the method definitions then you must define the method_missing method.

The ItemBuilderListener

What I wrote was a simple StreamListener class with a few extra methods. It reads in an XML file with RAA information, building up a hash table for each RAA item. Each item hash then goes into an array. When the parsing is done, the data is interpolated into a set of templates: one template defines each item, the other template acts as a container.

class ItemBuilderListener

As the source XML is parsed, the class maintains internal state in the form of an array of hashes.

@itemHash = Hash.new()

@itemArray = Array.new()

Two flags are used as a simple state machine. The code will only store the XML when it is parsing item data contained by the designated root element. By default these flags are set to false.

@insideRoot = false

@insideItem = false

The class also holds the element names for the root and item elements.

@root = nil

@item = nil

If the class is parsing item data, each item child element will become a hash item in current hash. However, that element may itself contain child elements.

For example, assume the item element to process is named "package". Then, each immediate child of "package" will become a hash item. Now, imagine the parser comes across this XML:

<package>

<name>Foo</name>

<date>01/01/2002</date>

<description>Look for <i>child</i> elements.</description>

</pacakge>

When the stream parser encounters the package element, it examines the element name and sees that this is the item element. A new hash object is created; as each child element is encountered, hash entries are created (for name, date, and description). The textual content is added as the hash item entry:

name => Foo

date => 01/01/2002

However, as the parser moves through the content of the description element, it will encounter additional elements. We want to store this as part of the description hash entry, so the listener class watches to see how "far" it is from the item element. It stores this information in two variables. One tells the class how far (or deep) the current element is form the designated item element. The other variable tracks the current hash entry so that all subsequent data is appended to the right place.

@depthFromItem = 0

@currentProperty = ""

Finally, there are variables to store templates to be used when building the output.

@itemTmpl = ""

@rootTmpl = ""

Interpolation is simply a matter of replacing each instance of a template variable with the corresponding hash entry. Template variables follow the Ruby syntax for embedding variables in strings:

This is where a #{variable} goes

A method is defined to iterate through a hash, using the hash key as the base for the template variable it is substituting:

def interpolateItem iHash

res = String.new(@itemTmpl)

iHash.each{ |k,v|

var = "\#{" + k + "}"

res.gsub!(var, v.strip)

}

res

end

Another method is provided to emit the completed interpolation. It walks through the array of hashes, interpolating each one. The combined set is then swapped into the root template:

def get_interpolation

s = ""

@itemArray.each{ |h|

s << interpolateItem(h)

}

res = String.new(@rootTmpl)

var = "\#{" + @root + "}"

res.gsub!(var, s)

res

end

As the stream parser accumulates data, it stores each completed item hash by adding it to the global array:

def pushItem iHash

newHash = iHash.dup

@itemArray.push newHash

interpolateItem iHash

end

A new ItemBuilderListener is created by calling new with the names of the root and item elements, plus the root and item templates. As a small nicety, the method allows File objects, rather than literal Strings, to be passed in for the templates.

def initialize root, item, rootTmpl, itemTmplHtml

@root = root

@item = item

@itemHash = Hash.new(nil)

@itemArray = Array.new()

if itemTmplHtml.kind_of? File

@itemTmpl = itemTmplHtml.readlines.join "\r"

elsif itemTmplHtml.kind_of? String

@itemTmpl = itemTmplHtml

else

raise "Bad item template argument!"

end

if rootTmpl.kind_of? File

@rootTmpl = rootTmpl.readlines.join "\r"

elsif itemTmplHtml.kind_of? String

@rootTmpl = rootTmpl

else

raise "Bad root template argument!"

end

end

Now we come to the essential code. Whenever the stream parser encounters the beginning of an element, it will call tag_start, passing in the name of the element and an attributes object. The listener class examines this information, checking and perhaps altering its internal state as it acts on the data.

def tag_start name, attrs

if name == @root

@insideRoot = true

elsif name == @item

@depthFromItem = 0

@insideItem = true

If the name of the element matches the designated root, then the insideRoot flag is set to true. Likewise, if the element name matches the item name, then the class state changes to insideItem. Further, if this is an item element, then the depth from the item is, of course, zero.

elsif @insideItem

@depthFromItem += 1

If this is not the root nor the item element, then the class checks if parsing is already occurring inside an item element. If so, then the depth is incremented.

If item parsing is in progress, and this is an immediate child element (i.e., depth from item is one), then the class sets the current property to the current element name. It also clears out any leftover values from the corresponding hash item:

if @depthFromItem == 1

@currentProperty = name

@itemHash[@currentProperty] = ""

Finally, if item parsing is in progress, and this element is deeper than an immediate child, the start tag is appended to the current property hash item. If there are any attributes, they must be reconstructed as well before being added to the has value.

elsif @depthFromItem > 1

@itemHash[@currentProperty] += "<#{name}"

attrs.each{|a|

@itemHash[@currentProperty] += " " + attr_to_s(a)

}

@itemHash[@currentProperty] += ">"

end

end

end

When a closing tag is encountered, a reverse process takes place. If this was the root element, then parsing of the root has ended, and the class state is updated.

def tag_end name

if name == @root

@insideRoot = false

Likewise, if this is the closing tag for an item element, then insideItem is set back to false, and the current hash table is added to the global array:

elsif name == @item

@insideItem = false

pushItem @itemHash

@itemHash.clear

On the other hand, if parsing is still happening inside an item, then the code checks the depth from the item element. If it"s greater than one, then the element was deeper then an immediate child, so a closing tag has to be constructed and added to the current property hash:

elsif @insideItem

if @depthFromItem > 1

@itemHash[@currentProperty] += "</#{name}>"

end

The depth is then decreased by one

@depthFromItem -= 1

end

end

When any text is encountered, the code must look at its current state to see if it the content is to be added to the hash value of the current property:

def text text

if @insideItem && @depthFromItem > 0

@itemHash[@currentProperty] += text

end

end

Finally, a small helper method is used to ensure that attributes are correctly reconstructed from the attributes object passed into tag_start. The REXML attribute class provides its own method for this, but relying it might put this application at risk of breaking, should that method change. This helper method ensures that any quote characters in the attribute"s value are encoded, and that single quotes are used:

def attr_to_s attr

val = attr[1].gsub(/"/, "&quot;")

attr[0] + "="#{val}""

end

end

That ends the StreamListener class. It’s very task-specific, ignoring various markup such as processing instructions or comments. Still, it serves a very practical purpose, and handling the other parts of an XML document works basically the same way. So, with a listener, the next step is to get a Source to feed it.

Picking a Source

Stream parsing is provided via a Document class method:

Document.parse_stream( source, listener )

We’ve seen the listener class; let’s see what the options are for a source. The parse_stream method requires an object that responds to the methods exposed by the Source class. You can create a Source object yourself, and pass it to the stream parser, or you can give the stream parser an object that can be converted to a Source object.

The most recent version of REXML allows one to call parse_stream passing in a Source, a String, or a File object. Looks at what it has been given, and will convert a String or a File into an appropriate Source class before moving on. This allows one to write more natural code:

# Assumes we already have a listener ...

xmlfile = File.open "raa.xml"

Document.stream_parse xmlfile, mylistener

My current setup has the RAA data in a text file. To run it through a stream parser I would just need to create the File object, plus the listener. Let’s take a look at a Ruby program that uses the ItemBuilderListener, class, and Document.parse_stream, to take the RAA data and convert into some spiffy HTML.

We saw above that the listener class needs two templates to do its job. Here then are those templates. The first one is the root template (raaTmpl.html):

<div>

#{raa}

</div>

Pretty simple. It will just wrap a div element around the collection of RAA items. Now here’s the item template (itemTmpl.html):

<table class="rssitem_table" cellSpacing="1"

cellPadding="3" width="160" bgColor="#003366" border="0">

<tbody>

<tr>

<td class="rssitem_top" bgColor="#ffc66">

<a href="#{product_homepage}">#{product_name}</a><br>

Status: <i>#{product_status}</i><br>

Version: <i>#{product_version}</i><br>

Updated: <i>#{update}</i><br>

</font>

</td>

</tr>

<tr>

<td class="rssitem_bottom">

<div class="rssitem_bottom_font" >#{product_description}</div>

</td>

</tr>

</tbody>

</table>

It creates a table for each RAA entry, displaying a subset of the data available in the source XML.

We can now write the small app to put this all together:

$:.push(".")

require "rexml/document"

require "ItemBuilderListener"

begin

itemTmpl = File.new("itemTmpl.html")

mainTmpl = File.new("raaTmpl.html")

mylistener = ItemBuilderListener.new "raa", "package", mainTmpl, itemTmpl

xmlfile = File.new "raa3.xml"

begin

REXML::Document.parse_stream xmlfile, mylistener

rescue Exception

puts "Error: #{$!}\n"

end

print mylistener.get_interpolation

end

The code creates File objects for the templates, and passes them to the listener constructor, along with the names of the root and item elements (here “raa”, and “package”). Another File is used to provide an XML source (“raa3.xml”).

The listener and the source and then passed to the stream parser. The results are then emitted by calling get_interpolation.

Use the Source: Creating a Custom Source Class

Although parse_stream allows you to pass in a File or String in place of a Source object, it ultimately uses a Source object. But what is a Source? If you look at the code in source.rb, you’ll find that a Source class exposes these attributes and methods:

@buffer

Read-only attribute holding some part of the source XML.

@line

Read-only attribute indicating the current line number of the XML source.

initialize

Basis for the new method, takes one argument

scan

Like the scan method of the String class. Takes one or two arguments. The first is a RegExp pattern, the second is a Boolean telling the method to consume the source text already scanned. This defaults to false

match

Like the =~ method of the String class. Takes one or two arguments. The first is a RegExp pattern, the second is a Boolean telling the method to consume the source text already matched. This defaults to false

empty?

Method indicating if there is any more text to process.

current_line

The current line number being processed.

encoding

What encoding the XML uses (e.g. UTF-8, UTF-16)

utf8_enc

Modifies the character encoding.

It’s the job of a Source class to provide the stream parser with the means to pull more data from the source, and to examine the data for markup. This is incredibly handy, because it means the parser is not concerned with the underlying implementation, only the methods the object responds to.

Files and Strings are fairly obvious candidates for XML sources, and with a little reflection you could think up a few more. For me, one that quickly came to mind was a database query. Running an SQL query and getting the results back as an XML string is becoming more common place; the Ruby DBI library even includes a method to do just that.

Now, you could make your SQL call, format the results in XML, and simply pass the string to the parser. But, for me at least, that exposes too much of how the process works. My site currently pulls the RAA XML from a file, but I may one day prefer to get it from a database. It would be nice if I could simply create a database Source class and swap that for the current file-based Source further up the code chain.

Note: One of the touted benefits of stream parsing is that your code does not have to manage a growing in-memory image of all the XML. The ItemListenerClass currently does hold the data, but it would not be too hard to extend it to interpolate the data on the fly, and immediately write the results to a file or socket. Then, in principle, it could handle arbitrarily large XML sources. Similarly, a good database-derived Source should not rely on collecting all of the data up front. It should implement the required Source methods, but (optionally) pull the data from the database only as the parser needs it. On the other hand, to make the code portable, the database calls should be as driver-agnostic as possible; it should not require a particular database. I choose the DBI library to improve portability, but at the cost of performance. There is no guarantee how the underlying driver is managing the query and result set. The example is serves to demonstrate how to create a potentially useful custom Source class, but if you need the best performance you should write a class based on a specific database driver, using the specific API to its best advantage.

From SQL to Source

A Source class works by populating a string buffer and running a regular expression over it. If you look at the code for the IOSource Source class you’ll see that the code works by pulling blocks of text, 500 bytes each, from the file while the parser works its way through the XML. My DbSource class does something very similar. Where the IOSource version uses the File#read method, my class uses fetch_many to retrieve some number of rows. The row retrieval is wrapped in a method that takes the record sets and coverts them into XML before returning the data.

The code for my source class is similar to the code for IOSource Source class. The reason is that, as an IOSource was used to fill in for a String, I wanted a database query to fill in as a String. I felt it would be easier to go through the IOSource Source class and add/modify code so that the new source would behave the same.

The DBSource class derives from the REXML Source class. This allows the code to make calls to super and reuse methods implemented in the base class. The initialize method takes a single argument, presumed to be a DBI::StatementHandle object. Note that the code does not have any require statements for either DBI or REXML. These are not needed, as the main program where you would use a DBSource class would already have the necessary require statements.

class DBSource < REXML::Source

def initialize dbhArg

@firstRead = true # Have we read any rows before?

@thisMany = 5 # How many rows to retrieve at a time

@currLine = 0 # Where are we?

@eof = false # More data left?

@dbh = dbhArg # The DBI statement handle

The method sets some instance variables, and assigns the statement handle. It then calls super to execute code in the base class initialize method. The base class gives us a stringer buffer to hold the text being processed, and also tracks the encoding of the source XML.

super readRows(@thisMany)

readRows is a private method that pulls takes a specified number of record sets, converts them to XML, and returns the text.

A variable is then set to track whether the source XML needs to be converted to UTF8 before being processed

@to_utf = (@encoding == "UTF16" or @encoding == "UNILE")

end

The current_line method simply returns the current value of the @currLine instance variable.

def current_line

@currLine

end

A Source class has two methods used to search chunks of text for markup. The first is scan. The first parameter is a regular expression; the second is a Boolean value specifying if the source XML should be discarded after being parsed.

def scan pattern, consume=false

mtchdata = super

The method uses super to retrieve the MatchData object returned when String#scan is called. In this case, the base method calls @buffer.scan(pattern). It will return nil if the buffer is empty. The code for the base version of scan is brief. That’s because it is working with a String; it has all of the text in one place. However, as with the IOSource class, the DBSource scan needs to replenish the buffer so long as there is more data to process. Since the tasks are essentially the same, I took the code for IOSource, and modified it to use readRows when replenishing the buffer.

if mtchdata.size == 0

until @buffer =~ pattern or empty?

begin

s = readRows(@thisMany)

s = utf8_enc(s) if s and @to_utf

@buffer << s

rescue

@eof = true

@dbh.finish

end

end

mtchdata = super

end

mtchdata

end

The code loops while grabbing more XML; .it breaks when either the buffer contains a RegExp match, or the XML source is depleted.

The match method is very similar. It, too, was taken from the IOSource code, and is a sort of compliment to the scan method. Where scan mimics the String method of the same name (and which takes a RegExp as a parameter), match fills in for the RegExp match method (which takes a String parameter).

And as with scan, the method calls super to perform a first pass, then loops to replenish the working buffer.

def match(pattern, consume=false)

mtchdata = pattern.match @buffer

@buffer = $" if consume and mtchdata

while !mtchdata and !empty?

begin

s = readRows(@thisMany)

if s.length == 0

@eof = true

else

s = utf8_enc(s) if s and @to_utf

@buffer << s

mtchdata = pattern.match @buffer

@buffer = $" if consume and mtchdata

end

rescue

@eof = true

end

end

mtchdata

end

The last public method is empty?, and it returns the True if the @eof variable is True, and the buffer is depleted.

def empty?

@eof && (@buffer.strip!.nil?)

end

Some private methods are defined to help things along. First, since XML has rules governing special text, the text pulled from the database is munged, with certain characters replaced by entity references:

private

def textconv(str)

str = str.to_s.gsub("&", "&amp;")

str = str.gsub("\"", "&apos;")

str = str.gsub("\"", "&quot;")

str = str.gsub("<", "&lt;")

str.gsub(">", "&gt;")

end

Where the IOSource could use the IO#read method to grab more data, the database source uses readRows. It takes a single parameter, which specifies how many rows to read.

def readRows(cnt)

xml = ""

if @firstRead

xml = "<source>"

@firstRead = false

end

If this is the first time the method is called, then the XML emitted needs to include a one-time beginning tag for the root element. (For the sake of simplicity I’ve hard-coded the elements names used for the root, and for each row. You may prefer to have these as parameters passed to initialize.)

The code tries to retrieve the specified number of rows, incrementing the line number by the number of rows actually returned. If there’s an error, the code just bails, setting @eof to true and releasing the statement handle:

begin

rows = @dbh.fetch_many(cnt)

@currLine += @dbh.rows

rescue

@eof = true

@dbh.finish

return ""

end

If all went well, the code iterates over the rows returned, converting the data into XML. It uses the field names to create elements for the data:

if rows && !@eof

rows.each{ |ro|

xml << "<row>"

begin

ro.each_with_name do |val, name|

xml << " <#{name}>" + textconv(val) + "</#{name}>\n"

end

rescue

@eof = true

@dbh.finish

return xml

end

xml << "</row>"

}

Again, if there is an error, the code tries to clean up and get out.

If no rows could be fetched then the code closes up the XML stream by emitting the end root tag:

else

xml << "</source>"

@eof = true

@dbh.finish

end

Finally, whatever XML has been constructed is returned:

xml

end

end

Putting it Together

Now we can rework the first application to use a DBSource source and pipe the data into some templates using the stream parser.

require "rexml/document"

require "ItemBuilderListener"

require "dbsource"

require "dbi"

include REXML

This version needs to require dbsource and dbi to create the needed classes. It begins by connecting to a MySQL database where article information is stored. A DBI statement handle is created by executing a SQL query to pull back all of the articles. This statement handle is then used to construct a DBSource instance:

begin

dbh = DBI.connect("DBI:Mysql:rubyxml_stuff", "user", "passwd")

sth = dbh.execute("SELECT * FROM articles")

dbSrc = DBSource.new sth

Two File objects are created that point to the needed templates:

rowTmpl = File.new("rowTmpl.html")

mainTmpl = File.new("sourceTmpl.html")

We use the same listener class as before, but tell it look for different root and item elements:

mylistener = ItemBuilderListener.new "source", "row", mainTmpl, rowTmpl

The code uses the REXML Document stream parser to join data and templates in commingled bliss; the results are then retrieved using get_interpolation:

Document.stream_parse(dbSrc, mylistener)

print mylistener.get_interpolation

end

I’m not showing the templates and results here, partly to save space, and partly because they won’t do you much good unless you have the same table structure I use. They’re like the first set of templates, though, they consisted of some basic HTML, with #{variables} embedded. The variable names match the names of the fields in the database table. Then, once you have a Source class producing XML, the parser and the listener source perform as they would with a file-based Source.

Summary

This article explored using an XML event stream to drive a data transformation process. We saw how to use the stream parser from REXML, which works with StreamListeners and Sources. We looked at a StreamListener class that builds an output string by reading in an XML file and repeatedly populating a simple template. We also saw that the REXML stream parser can work with any class that implements the Source API, and wrote a Source class that wraps a database statement handle.

If you were not familiar with event-based XML processing, I hope this article piqued your interest and give you enough information to explore on your own. Source code may be downloaded here. As always, please send any comments, questions, or corrections to jbritt@rubyxml.com

Additional Resources

SAX

The SAX Project homepage: http://www.saxproject.org

REXML

The REXML homepage: http://www.germane-software.com/~ser/Software/rexml/

"DOM and SAX Are Dead, Long Live DOM and SAX"

Article by Kendall Grant Clark: http://www.xml.com/pub/a/2001/11/14/dom-sax.html

 
Please send questions or comments to webmaster@rubyxml.com