Stream Parsing with
REXML
James Britt, December
15, 2001
Introduction
There are two dominate ways of
working with XML documents: as a tree (e.g., the W3C
XML DOM) and as an event stream (e.g., SAX). There are
various arguments for and against each of these,
depending on what it is you need to do. The DOM is
basically a good choice when dealing with document
data, such as what might come a word processor. SAX is
better suited for handling highly-structured data, such
as rows from a database. There are, of course, forms
of XML that fall somewhere in-between. Often the
choice is not obvious, and extraneous details may be
influential. (Kendall Grant Clark has written a very
interesting article on the choice between SAX and DOM,
at
http://www.xml.com/pub/a/2001/11/14/dom-sax.html)
When transforming XML into, say,
HTML, the typical approach is to read the source XML
into a DOM, then apply an XSLT style sheet. However,
there may be reasons to want an event-based approach to
this. One is that the amount of data you need to
transform is so large that it could not be held in
memory. Or, you may be writing code that will be
maintained by someone else, and you want another
alternative to having them learn XSLT. Another reason
may be that no suitable XSLT processor is available for
your language of choice. In my own case, it was that
last reason that prompted me to explore event streams
for XML transformation.
What I will describe here is how
to use REXML’s stream parser to create HTML based
on some regularly-structured XML. In a sense
it’s like a poor-man’s XSLT. Nonetheless,
it does the job, and provides a good example of
event-based programming.
The Problem Space
The rubyxml.com home page shows
three columns of data: XML-related RAA entries,
Ruby-specific XML news, and general XML news. The data
for each of them come from XML files (though this may
change for the Ruby XML news items). The XML has a
periodic structure: there’s a root element,
possibly some header stuff, then some number of
identically-structured child elements. For example,
the RAA data look like this:
<raa>
<package>
<product_download>http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/soap4r-1.3.8.tar.gz</product_download>
<product_status>usable</product_status>
<product_version>1.3.8</product_version>
<product_license>GPL</product_license>
<product_name>SOAP4R</product_name>
<product_homepage>
http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/wiki.cgi?cmd=view;name=top
</product_homepage>
<product_description>"SOAP4R" is a Ruby
library program to handle Simple Object Access
Protocol (SOAP) 1.1 (W3C Note).
For more
details, see
http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/RELEASE.en.html
or
http://www.jin.gr.jp/~nahi/Ruby/SOAP4R/RELEASE.ja.html
</product_description>
<owner_email></owner.email>
<owner_name>NaHi</owner.name>
<owner_id>-NaHi</owner.id>
<category_major>Library</category_major>
<category_minor>XML</category_minor>
<update>2001-10-04</update>
</package>
<package>
<product_download>http://www.chadfowler.com/ruby/rss/rss-ruby-0.9.1.tar.gz</product_download>
<product_status>usable/needs error
handling</product_status>
<product_version>0.9.1</product_version>
<product_license>Ruby"s</product_license>
<product_name>Ruby/RSS</product_name>
<product_homepage>
http://www.chadfowler.com/index.cgi?mode=max&cat=ruby
</product_homepage>
<product_description>
An object
oriented Ruby library for parsing, creating,
downloading, and
caching RSS
(http://my.netscape.com/publish/help/mnn20/quickstart.html.
</product_description>
<owner_email>cfowler@chadfowler.com</owner.email>
<owner_name>Chad Fowler</owner.name>
<owner_id>cfowler@chadfowler.com-Chad
Fowler</owner.id>
<category_major>Library</category_major>
<category_minor>XML</category_minor>
<update>2001-02-18</update>
</package>
<!-- More
packages ... -->
</raa>
The data come from an XML-RPC
call; I have some code that makes the call and writes
out the results as an XML file. The markup specifics
aren’t important; the main thing is that this
sort of XML is uniformly structured, and as such is a
good candidate for SAX (simple API for XML). A SAX parser reads an XML source and
raises an event each time it encounters something of
interest. "Something of interest" means,
for example, the start and end of elements, or the
presence of a processing instruction.
There is no formal SAX
specification. SAX evolved from discussions within the
XML developer community on the
xml-dev mailing list. It was first written in
Java, and it is the implementation that serves as the
spec. Nonetheless, SAX parsers have been written for
other programming languages, such as Python, Perl, and
Visual Basic. These versions provide essentially the
same classes and methods, with allowances for various
language quirks.
There is currently no SAX parser
for Ruby, but REXML does include a stream parser whose
API is suitably SAX-like. Like a SAX parser, it allows
us to read an XML source, and have methods called at
key events.
Why SAX?
If you"ve spent much time
working with XML you could solve this problem using
XSLT. A simple template could grab the data and emit
the HTML needed. So why not? Earlier I said that my
event-driven template tool was like a poor-man"s
XSLT. In fact, this approach can outshine XSLT when
the amount of XML to process is large. Imagine that,
instead of creating some simple HTML from a relatively
small file, we needed to transform several megabytes of
data into CSV. While the XSLT style sheet would
perhaps be quite simple, the application would need
sufficient memory to hold the source data and the
resulting source tree. This may not be practical.
Stream-parsing XML is very useful
in another situation. Imagine you are going to receive
this multi-meg data file, and have reason to believe
that, out of the bazillion elements, at least one of
them will contain something that breaks the XML
well-formed criteria. A proper XML parser is obligated
to stop processing at the first such error. This would
be a catastrophe in a file this large; fixing all
errors and re-parsing the entire file just won"t
do.
However, we can run the file
through a stream parser and construct temporary
in-memory XML documents from the passing data. The
application can check that each subsection is
well-formed before passing it on. If any errors are
encountered, the faulty text can be written to a log
file, and the parsing can resume. In this way, you can
get all of the good data, while isolating the bad.
On a more immediate practical
matter, Ruby does not have a suitable XSLT processor I
can use. Nonetheless, I want a process that can read
an XML source and transform it. If later I decide XSLT
is a better option, I at least won’t need to
change how I get my data, and the general process flow
is the same, minimizing code changes.
What I came up with was to use an
event stream to populate an HTML template. It would
grab the RAA data and repeatedly "fill in the
blanks" in the template. The final set of
processed data would then be inserted into a main
template. The process happens in two steps: parse the
event stream into a temporary internal structure, then
interpolate that structure into the templates. (The
internal structure isn"t essential in the current
version, but it could be useful if I later decide to do
some post-parsing processing, such as sorting, on the
acquired data.)
From Events to Responses
The REXML stream parser requires
two object: a Source and a
StreamListener.
A
Source is pretty much what you might expect:
it’s where the XML comes from. REXML has a
Source class, and provides two ways to
instantiate one. First, there is the SourceFactory::create_from
method. It accepts either a String or
anIOobject. You
would use it like this:
xmlfile = File.new
"raa.xml"
src =
SourceFactory.create_from xmlfile
Another way is to call
new on the Source
class:
src = Source.new
"<doc>My XML string</doc>"
This only works with a String.
A StreamListener
compliments the
Source: it is what will respond to the events
triggered by the Source.
There is no factory or base class for this. To create
one you have to write a class that implements the
REXML::StreamListener interface. A listener
class must be able to handle all of the
StreamListener methods. If you do not provide
all of the method definitions then you must define the
method_missing method.
The ItemBuilderListener
What I wrote was a simple
StreamListener class with a few extra methods.
It reads in an XML file with RAA information, building
up a hash table for each RAA item. Each item hash
then goes into an array. When the parsing is done, the
data is interpolated into a set of templates: one
template defines each item, the other template acts as
a container.
class
ItemBuilderListener
As the source XML is parsed, the
class maintains internal state in the form of an array
of hashes.
@itemHash =
Hash.new()
@itemArray =
Array.new()
Two flags are used as a simple
state machine. The code will only store the XML when
it is parsing item data contained by the designated
root element. By default these flags are set to
false.
@insideRoot =
false
@insideItem =
false
The class also holds the element
names for the root and item elements.
If the class is parsing item data,
each item child element will become a hash item in
current hash. However, that element may itself contain
child elements.
For example, assume the item
element to process is named "package". Then,
each immediate child of "package" will become a
hash item. Now, imagine the parser comes across this
XML:
<package>
<name>Foo</name>
<date>01/01/2002</date>
<description>Look
for <i>child</i>
elements.</description>
</pacakge>
When the stream parser encounters
the package element, it examines the element name and
sees that this is the item element. A new hash object
is created; as each child element is encountered, hash
entries are created (for name, date, and description).
The textual content is added as the hash item
entry:
name =>
Foo
date =>
01/01/2002
However, as the parser moves
through the content of the description element, it will
encounter additional elements. We want to store this as
part of the description hash entry, so the listener
class watches to see how "far" it is from the
item element. It stores this information in two
variables. One tells the class how far (or deep) the
current element is form the designated item element.
The other variable tracks the current hash entry so
that all subsequent data is appended to the right
place.
@depthFromItem
= 0
@currentProperty = ""
Finally, there are variables to
store templates to be used when building the
output.
@itemTmpl =
""
@rootTmpl =
""
Interpolation is simply a matter
of replacing each instance of a template variable with
the corresponding hash entry. Template variables
follow the Ruby syntax for embedding variables in
strings:
This is where a
#{variable} goes
A method is defined to iterate
through a hash, using the hash key as the base for the
template variable it is substituting:
def
interpolateItem iHash
res =
String.new(@itemTmpl)
iHash.each{
|k,v|
var =
"\#{" + k + "}"
res.gsub!(var, v.strip)
}
res
end
Another method is provided to emit
the completed interpolation. It walks through the
array of hashes, interpolating each one. The combined
set is then swapped into the root template:
def
get_interpolation
s =
""
@itemArray.each{ |h|
s <<
interpolateItem(h)
}
res =
String.new(@rootTmpl)
var =
"\#{" + @root + "}"
res.gsub!(var, s)
res
end
As the stream parser accumulates
data, it stores each completed item hash by adding it
to the global array:
def pushItem
iHash
newHash =
iHash.dup
@itemArray.push newHash
interpolateItem iHash
end
A new
ItemBuilderListener is created by calling
new with the names of the root and item
elements, plus the root and item templates. As a small
nicety, the method allows
File objects, rather than literal
Strings, to be passed in for the templates.
def
initialize root, item, rootTmpl, itemTmplHtml
@root =
root
@item =
item
@itemHash =
Hash.new(nil)
@itemArray =
Array.new()
if
itemTmplHtml.kind_of? File
@itemTmpl
= itemTmplHtml.readlines.join "\r"
elsif
itemTmplHtml.kind_of? String
@itemTmpl
= itemTmplHtml
else
raise
"Bad item template argument!"
end
if
rootTmpl.kind_of? File
@rootTmpl
= rootTmpl.readlines.join "\r"
elsif
itemTmplHtml.kind_of? String
@rootTmpl
= rootTmpl
else
raise
"Bad root template argument!"
end
end
Now we come to the essential
code. Whenever the stream parser encounters the
beginning of an element, it will call
tag_start, passing in the name of the element
and an
attributes object. The listener class examines
this information, checking and perhaps altering its
internal state as it acts on the data.
def tag_start
name, attrs
if name ==
@root
@insideRoot = true
elsif name
== @item
@depthFromItem = 0
@insideItem = true
If the name of the element matches
the designated root, then the
insideRoot flag is set to true. Likewise, if
the element name matches the item name, then the class
state changes to
insideItem. Further, if this is an item
element, then the depth from the item is, of course,
zero.
elsif
@insideItem
@depthFromItem += 1
If this is not the root nor the
item element, then the class checks if parsing is
already occurring inside an item element. If so, then
the depth is incremented.
If item parsing is in progress,
and this is an immediate child element (i.e., depth
from item is one), then the class sets the
current property to the current element name. It also
clears out any leftover values from the corresponding
hash item:
if
@depthFromItem == 1
@currentProperty = name
@itemHash[@currentProperty] = ""
Finally, if item parsing is in
progress, and this element is deeper than an immediate
child, the start tag is appended to the current
property hash item. If there are any attributes, they
must be reconstructed as well before being added to the
has value.
elsif
@depthFromItem > 1
@itemHash[@currentProperty] +=
"<#{name}"
attrs.each{|a|
@itemHash[@currentProperty] += " " +
attr_to_s(a)
}
@itemHash[@currentProperty] += ">"
end
end
end
When a closing tag is encountered,
a reverse process takes place. If this was the root
element, then parsing of the root has ended, and the
class state is updated.
def tag_end
name
if name ==
@root
@insideRoot = false
Likewise, if this is the closing
tag for an item element, then
insideItem is set back to false, and the current
hash table is added to the global array:
elsif name
== @item
@insideItem = false
pushItem
@itemHash
@itemHash.clear
On the other hand, if parsing is
still happening inside an item, then the code checks
the depth from the item element. If it"s greater
than one, then the element was deeper then an immediate
child, so a closing tag has to be constructed and added
to the current property hash:
elsif
@insideItem
if
@depthFromItem > 1
@itemHash[@currentProperty] +=
"</#{name}>"
end
The depth is then decreased by
one
@depthFromItem -= 1
end
end
When any text is encountered, the
code must look at its current state to see if it the
content is to be added to the hash value of the current
property:
def text
text
if
@insideItem && @depthFromItem > 0
@itemHash[@currentProperty] += text
end
end
Finally, a small helper method is
used to ensure that attributes are correctly
reconstructed from the attributes object passed into
tag_start.
The REXML
attribute class provides its own method for
this, but relying it might put this application at risk
of breaking, should that method change. This helper
method ensures that any quote characters in the
attribute"s value are encoded, and that single
quotes are used:
def attr_to_s
attr
val =
attr[1].gsub(/"/, """)
attr[0] +
"="#{val}""
end
end
That ends the
StreamListener class. It’s very
task-specific, ignoring various markup such as
processing instructions or comments. Still, it serves
a very practical purpose, and handling the other parts
of an XML document works basically the same way. So,
with a listener, the next step is to get a
Source to feed it.
Picking a Source
Stream parsing is provided via a
Document class method:
Document.parse_stream( source,
listener )
We’ve seen the listener
class; let’s see what the options are for a
source. The
parse_stream method requires an object that
responds to the methods exposed by the
Source class. You can create a
Source object yourself, and pass it to the
stream parser, or you can give the stream parser an
object that can be converted to a
Source object.
The most recent version of REXML
allows one to call
parse_stream passing in a
Source, a
String, or a
File object. Looks at what it has been given,
and will convert a
String or a
File into an appropriate
Source class before moving on. This allows one
to write more natural code:
# Assumes we
already have a listener ...
xmlfile =
File.open "raa.xml"
Document.stream_parse xmlfile, mylistener
My current setup has the RAA data
in a text file. To run it through a stream parser I
would just need to create the
File object, plus the listener. Let’s
take a look at a Ruby program that uses the
ItemBuilderListener, class, and
Document.parse_stream, to take the RAA data and
convert into some spiffy HTML.
We saw above that the listener
class needs two templates to do its job. Here then are
those templates. The first one is the root template
(raaTmpl.html):
Pretty simple. It will just wrap a
div element around the collection of RAA items. Now
here’s the item template (itemTmpl.html):
<table
class="rssitem_table"
cellSpacing="1"
cellPadding="3" width="160"
bgColor="#003366" border="0">
<tbody>
<tr>
<td
class="rssitem_top"
bgColor="#ffc66">
<a
href="#{product_homepage}">#{product_name}</a><br>
Status:
<i>#{product_status}</i><br>
Version:
<i>#{product_version}</i><br>
Updated:
<i>#{update}</i><br>
</font>
</td>
</tr>
<tr>
<td
class="rssitem_bottom">
<div
class="rssitem_bottom_font"
>#{product_description}</div>
</td>
</tr>
</tbody>
</table>
It creates a table for each RAA
entry, displaying a subset of the data available in the
source XML.
We can now write the small app to
put this all together:
$:.push(".")
require
"rexml/document"
require
"ItemBuilderListener"
begin
itemTmpl =
File.new("itemTmpl.html")
mainTmpl =
File.new("raaTmpl.html")
mylistener =
ItemBuilderListener.new "raa",
"package", mainTmpl, itemTmpl
xmlfile =
File.new "raa3.xml"
begin
REXML::Document.parse_stream xmlfile, mylistener
rescue
Exception
puts
"Error: #{$!}\n"
end
print
mylistener.get_interpolation
end
The code creates
File objects for the templates, and passes them
to the listener constructor, along with the names of
the root and item elements (here “raa”, and
“package”). Another
File is used to provide an XML source
(“raa3.xml”).
The listener and the source and
then passed to the stream parser. The results are then
emitted by calling
get_interpolation.
Use the Source: Creating a Custom Source Class
Although
parse_stream allows you to pass in a
File or
String in place of a
Source object, it ultimately uses a
Source object. But what is a Source?
If you look at the code in
source.rb, you’ll find that a
Source class exposes these attributes and
methods:
|
@buffer
|
Read-only attribute holding some part of the
source XML.
|
|
@line
|
Read-only attribute indicating the current line
number of the XML source.
|
|
initialize
|
Basis for the new method, takes one argument
|
|
scan
|
Like
the scan method of the String class. Takes one
or two arguments. The first is a RegExp
pattern, the second is a Boolean telling the
method to consume the source text already
scanned. This defaults to false
|
|
match
|
Like
the
=~ method of the
String class. Takes one or two
arguments. The first is a RegExp
pattern, the second is a
Boolean telling the method to consume the
source text already matched. This defaults to
false
|
|
empty?
|
Method indicating if there is any more text to
process.
|
|
current_line
|
The
current line number being processed.
|
|
encoding
|
What
encoding the XML uses (e.g. UTF-8, UTF-16)
|
|
utf8_enc
|
Modifies the character encoding.
|
It’s the job of a
Source class to provide the stream parser with
the means to pull more data from the source, and to
examine the data for markup. This is incredibly
handy, because it means the parser is not concerned
with the underlying implementation, only the methods
the object responds to.
Files and Strings are fairly
obvious candidates for XML sources, and with a little
reflection you could think up a few more. For me, one
that quickly came to mind was a database query.
Running an SQL query and getting the results back as
an XML string is becoming more common place; the Ruby
DBI library even includes a method to do just that.
Now, you could make your SQL call,
format the results in XML, and simply pass the string
to the parser. But, for me at least, that exposes too
much of how the process works. My site currently pulls
the RAA XML from a file, but I may one day prefer to
get it from a database. It would be nice if I could
simply create a database
Source class and swap that for the current
file-based
Source further up the code chain.
Note: One of the
touted benefits of stream parsing is that your code
does not have to manage a growing in-memory image of
all the XML. The
ItemListenerClass currently does hold the data,
but it would not be too hard to extend it to
interpolate the data on the fly, and immediately write
the results to a file or socket. Then, in principle,
it could handle arbitrarily large XML sources.
Similarly, a good database-derived
Source should not rely on collecting all of the
data up front. It should implement the required
Source methods, but (optionally) pull the data
from the database only as the parser needs it. On the
other hand, to make the code portable, the database
calls should be as driver-agnostic as possible; it
should not require a particular database. I choose the
DBI library to improve portability, but at the cost of
performance. There is no guarantee how the underlying
driver is managing the query and result set. The
example is serves to demonstrate how to create a
potentially useful custom Source class, but if you need
the best performance you should write a class based on
a specific database driver, using the specific API to
its best advantage.
From SQL to Source
A
Source class works by populating a string buffer
and running a regular expression over it. If you look
at the code for the
IOSource
Source class you’ll see that the code
works by pulling blocks of text, 500 bytes each, from
the file while the parser works its way through the
XML. My
DbSource class does something very similar.
Where the
IOSource version uses the
File#read method, my class uses
fetch_many to retrieve some number of rows. The
row retrieval is wrapped in a method that takes the
record sets and coverts them into XML before returning
the data.
The code for my source class is
similar to the code for
IOSource
Source class. The reason is that, as an
IOSource was used to fill in for a
String, I wanted a database query to fill in as
a
String. I felt it would be easier to go through
the
IOSource
Source class and add/modify code so that the new
source would behave the same.
The
DBSource class derives from the REXML
Source class. This allows the code to make
calls to
super and reuse methods implemented in the base
class. The
initialize method takes a single argument,
presumed to be a
DBI::StatementHandle object. Note that the code
does not have any
require statements for either DBI or REXML.
These are not needed, as the main program where you
would use a
DBSource class would already have the necessary
require statements.
class DBSource
< REXML::Source
def initialize
dbhArg
@firstRead =
true # Have we read any rows before?
@thisMany =
5 # How many rows to retrieve at a time
@currLine =
0 # Where are we?
@eof =
false # More data left?
@dbh =
dbhArg # The DBI statement handle
The method sets some instance
variables, and assigns the statement handle. It then
calls super to execute code in the base class
initialize method. The base class gives us a stringer
buffer to hold the text being processed, and also
tracks the encoding of the source XML.
super
readRows(@thisMany)
readRows is a private method that pulls takes a
specified number of record sets, converts them to XML,
and returns the text.
A variable is then set to track
whether the source XML needs to be converted to UTF8
before being processed
@to_utf =
(@encoding == "UTF16" or @encoding ==
"UNILE")
end
The
current_line method simply returns the current
value of the
@currLine instance variable.
def
current_line
@currLine
end
A
Source class has two methods used to search
chunks of text for markup. The first is
scan. The first parameter is a regular
expression; the second is a Boolean value specifying if
the source XML should be discarded after being parsed.
def scan
pattern, consume=false
mtchdata =
super
The method uses
super to retrieve the
MatchData object returned when
String#scan is called. In this case, the base
method calls
@buffer.scan(pattern). It will return
nil if the buffer is empty. The code for the
base version of scan is brief. That’s because it
is working with a
String; it has all of the text in one place.
However, as with the
IOSource class, the
DBSource scan needs to replenish the buffer so
long as there is more data to process. Since the tasks
are essentially the same, I took the code for
IOSource, and modified it to use
readRows when replenishing the buffer.
if
mtchdata.size == 0
until
@buffer =~ pattern or empty?
begin
s =
readRows(@thisMany)
s =
utf8_enc(s) if s and @to_utf
@buffer << s
rescue
@eof
= true
@dbh.finish
end
end
mtchdata =
super
end
mtchdata
end
The code loops while grabbing more
XML; .it breaks when either the buffer contains a
RegExp match, or the XML source is
depleted.
The
match method is very similar. It, too, was
taken from the
IOSource code, and is a sort of compliment to
the
scan method. Where
scan mimics the
String method of the same name (and which takes
a
RegExp as a parameter),
match fills in for the
RegExp
match method (which takes a
String parameter).
And as with scan, the method calls
super to perform a first pass, then loops to
replenish the working buffer.
def
match(pattern, consume=false)
mtchdata =
pattern.match @buffer
@buffer =
$" if consume and mtchdata
while
!mtchdata and !empty?
begin
s =
readRows(@thisMany)
if
s.length == 0
@eof
= true
else
s =
utf8_enc(s) if s and @to_utf
@buffer << s
mtchdata = pattern.match @buffer
@buffer = $" if consume and mtchdata
end
rescue
@eof =
true
end
end
mtchdata
end
The last public method is
empty?, and it returns the
True if the
@eof variable is
True, and the buffer is depleted.
def empty?
@eof
&& (@buffer.strip!.nil?)
end
Some private methods are defined
to help things along. First, since XML has rules
governing special text, the text pulled from the
database is munged, with certain characters replaced by
entity references:
private
def
textconv(str)
str =
str.to_s.gsub("&",
"&")
str =
str.gsub("\"",
"'")
str =
str.gsub("\"",
""")
str =
str.gsub("<", "<")
str.gsub(">", ">")
end
Where the
IOSource could use the
IO#read method to grab more data, the database
source uses
readRows. It takes a single parameter, which
specifies how many rows to read.
def
readRows(cnt)
xml =
""
if
@firstRead
xml =
"<source>"
@firstRead
= false
end
If this is the first time the
method is called, then the XML emitted needs to include
a one-time beginning tag for the root element. (For
the sake of simplicity I’ve hard-coded the
elements names used for the root, and for each row.
You may prefer to have these as parameters passed to
initialize.)
The code tries to retrieve the
specified number of rows, incrementing the line number
by the number of rows actually returned. If
there’s an error, the code just bails, setting
@eof to true and releasing the statement
handle:
begin
rows =
@dbh.fetch_many(cnt)
@currLine
+= @dbh.rows
rescue
@eof =
true
@dbh.finish
return
""
end
If all went well, the code
iterates over the rows returned, converting the data
into XML. It uses the field names to create elements
for the data:
if rows
&& !@eof
rows.each{
|ro|
xml
<< "<row>"
begin
ro.each_with_name do |val, name|
xml
<< " <#{name}>" +
textconv(val) + "</#{name}>\n"
end
rescue
@eof
= true
@dbh.finish
return
xml
end
xml
<< "</row>"
}
Again, if there is an error, the
code tries to clean up and get out.
If no rows could be fetched then
the code closes up the XML stream by emitting the end
root tag:
else
xml
<< "</source>"
@eof =
true
@dbh.finish
end
Finally, whatever XML has been
constructed is returned:
Putting it Together
Now we can rework the first
application to use a
DBSource source and pipe the data into some
templates using the stream parser.
require
"rexml/document"
require
"ItemBuilderListener"
require
"dbsource"
require
"dbi"
include
REXML
This version needs to
require
dbsource and
dbi to create the needed classes. It begins by
connecting to a MySQL database where article
information is stored. A DBI statement handle is
created by executing a SQL query to pull back all of
the articles. This statement handle is then used to
construct a
DBSource instance:
begin
dbh =
DBI.connect("DBI:Mysql:rubyxml_stuff",
"user", "passwd")
sth =
dbh.execute("SELECT * FROM articles")
dbSrc =
DBSource.new sth
Two
File objects are created that point to the
needed templates:
rowTmpl =
File.new("rowTmpl.html")
mainTmpl =
File.new("sourceTmpl.html")
We use the same listener class as
before, but tell it look for different root and item
elements:
mylistener =
ItemBuilderListener.new "source",
"row", mainTmpl, rowTmpl
The code uses the REXML
Document stream parser to join data and
templates in commingled bliss; the results are then
retrieved using
get_interpolation:
Document.stream_parse(dbSrc, mylistener)
print
mylistener.get_interpolation
end
I’m not showing the
templates and results here, partly to save space, and
partly because they won’t do you much good unless
you have the same table structure I use. They’re
like the first set of templates, though, they consisted
of some basic HTML, with
#{variables} embedded. The variable names match
the names of the fields in the database table. Then,
once you have a
Source class producing XML, the parser and the
listener source perform as they would with a file-based
Source.
Summary
This article explored using an XML
event stream to drive a data transformation process.
We saw how to use the stream parser from REXML, which
works with
StreamListeners and
Sources. We looked at a
StreamListener class that builds an output
string by reading in an XML file and repeatedly
populating a simple template. We also saw that the
REXML stream parser can work with any class that
implements the
Source API, and wrote a
Source class that wraps a database statement
handle.
If you were not familiar with
event-based XML processing, I hope this article piqued
your interest and give you enough information to
explore on your own. Source code may be downloaded
here. As always, please send any comments,
questions, or corrections to jbritt@rubyxml.com
Additional Resources
SAX
The SAX Project homepage: http://www.saxproject.org
REXML
The REXML homepage:
http://www.germane-software.com/~ser/Software/rexml/
"DOM and SAX Are Dead, Long Live DOM and
SAX"
Article by Kendall Grant Clark:
http://www.xml.com/pub/a/2001/11/14/dom-sax.html