MySQL & Ruby Electric XML
James Britt, October 31, 2001
Introduction
When I started rubyxml.com I wanted to
use, well, Ruby and XML to build the site.
In some cases it made sense: the RAA (Ruby application archive)
information is available via an XML-RPC call, and the XML news is from the
xmlhack.com RSS feed. I thought about
using XML for the main articles, and so I started writing the article
information as an RDF file. It occurred
to me, though, that as time went on the front page would not be able to list
all of the available stories. There
would be too many items for a single page, so only the most recent five or ten
items should be displayed. I figured
that a database holding the articles (or at least the RDF data) would be a good
approach. Each entry would have a time
stamp, and the items could be dynamically selected based on the article
dates. It would also make searching for
past articles much easier.
On the other hand, I liked the idea of having an RDF/RSS
file available for anyone interested in tracking what was on the site that
day. I decided that I would continue to
load the main page from an RDF file, but have that file automatically created
from a table in a MySQL database.
This article will describe how to use REXML with MySQL to
create an XML file. I’ll show how to
create a class that provides a simple interface for querying a database and
getting the data back as XML.
MySQL
This site is called rubyxml.com, not rubysql.com, so I’m not going to say too much about
SQL or MySQL other than what’s needed to get us going. I’ve gotten the Ruby MySQL package
(available at http://www.tmtm.org/mysql/ruby/)
running on RedHat Linux 6, but have not yet attempted it on Windows. As a general rule I prefer to avoid packages
that require compiling native code, and was frankly amazed how easy it was to
get MySQL going with Ruby. In fact, I
managed to compile the code on my home PC, and uploaded the final binaries to
the production server (where I have rather limited, roundabout compilation
access) where they pretty much just worked.
Getting MySQL up, running, and playing nicely with Ruby is
not that difficult if you follow the instructions in the readme
file. Here are some hints, though, if
you have problems: First, the make file required to build the MySQL
shared object is created by running extconfig.rb. This script accepts command-line parameters,
and you may need to pass in the paths to your ‘lib’ and ‘include’ directories that the resulting make file knows where to find
required code. If make
complains, though, use ‘locate’ or ‘find’ to see where the unfound objects are,
adjust the parameters to extconfig.rb,
and run it again to regenerate the make file. Also, you may have to add additional path information to your LD_LIBRARY_PATH environment variable so that the
Ruby msql.so code can find libmysqlclient.so.
The Ruby MySQL readme
file list the classes and methods provided in the package, but does not explain
what they actually do. This is no big
deal, though, as the method names map nicely to the API as defined in the
standard MySQL documentation, available on the Web at www.mysql.org.
Ruby Electric XML
Ruby Electric XML (REXML, at http://www.germane-software.com/~ser/Software/rexml/)
is a Ruby package for creating and manipulating XML documents. Unlike libgdome-ruby
or XMLParser, REXML does not require additional
native-code bindings. REXML was written
by Sean Russell, and it was inspired by the Electric XML Java™ package (http://www.themindelectric.com/products/xml/xml.html).
It allows one to create, access, and modify XML documents in
a way that should be fairly intuitive to even beginner Ruby developers. While
an early version was based on a Java™ package, Sean made some changes to the
API to make it more Ruby-like. So, where the Java™ Electric XML provides an
easier-than-the-DOM API, REXML gives us an even-more-easier-than-the-DOM
API.
I highly recommend that you read the documentation for
REXML, including the overview, the tutorial, and the API docs. These come bundled with the package
download. What I'll try to cover here
are details of a specific task, enough to get one started using REXML.
REXML came about form Sean's desire to have a simple, “Ruby
way” to work with XML. The W3C, after
presenting XML, created an API specification for the XML document object model,
or DOM. The spec basically described
the DOM as an ordered set of "node" objects. There was a definition for a base node
interface, and the various DOM parts (elements, attributes, processing
instructions, etc.) implemented the interface in ways appropriate for each
particular type. This was in some ways
a nice, clean abstraction ("everything's a node"), but in practice it
could often be confusing. Yes,
everything was a node, but everything was also its own special brand of
node. So, some nodes had children, some
didn't. Some had attribute lists, some
didn't.
To be fair, the W3C was trying to design a spec that was not
bound to the quirks of any particular language. It would have perhaps been unfair to describe things in a way
that were easy to implement in, say, a weakly-typed, OO language like Ruby that
would have tripped up less flexible languages (though they did provide language
bindings for ECMA Script and Java™).
Having a single API available regardless of the particular
implementation should have made things easier for developers to learn. But the DOM API itself is not always easy to
get one's head around. For example, to
add a new element to a document you first have to use that document object as
an element factory by calling yourDOM.createElement(elementName). Adding text to the element requires creating
a new Text node object, which is then appended to the element object. Similarly, methods for managing attributes
are not directly part of the element object.
It seemed like everything took at least three steps.
I've spent fair amount of time writing code around XML,
using, at different times, VB, Perl, Java, and JavaScript. While a common API was convenient, it
obscured an interesting point: depending on what the code was supposed to do, I
often wasn't interested in the XML DOM per se, just XML. For example, a great deal of the code was
designed to construct XML strings and move them around. Different bits of code would snarf out
content, other code would simply aggregate other document chunks. I rarely cared about DTDs or validation, or
entity references, or PIs. These just
never played a role in the application. So why did I have to wrestle with an
API intended to manage all of that?
On one job I did get to use a custom XML parser that was
lightning quick. It had a stripped down
API designed to do just what was needed in way that let you get the job done
with a minimum of fuss. This didn't
please some purists who felt obligated to follow every recommendation of a
vendor consortium. It did please those who needed to get a job done.
REXML, and Ruby in general, are intended to work much the
same way: let the work get done without getting in the way. The REXML API lets you create an element as
easy as this:
el =
Element.new("someElementName")
el.text = "My new element."
Wasn't that easy?
Creating documents is just as easy:
doc =
Document.new("<myDoc/>")
Installing REXML is also easy. Since the package does not require any native code compilation,
you simply need to download the source archive (http://www.germane-software.com/~ser/Software/rexml/),
extract the files, and run the install.rb installation
script. If for some reason you have a
problem, you can just copy the rexml directory over to
your Ruby lib directory (e.g. /lib/ruby/site_ruby/1.6/rexml).
Some REXML Basics
In the W3C DOM world there's an über-node off which
all other nodes are spawned. Nodes, in
general, only exist in the context of other nodes. In the REXML world there are
various classes representing the different types of DOM nodes; any of these can
be created simply by calling the appropriate new. In a sense it’s a bit like Tinker Toys in
that you can create the parts you need as you need them, then snap things
together to construct a document.
There a few ways you can create a document. Perhaps the
simplest way is to call Document.new with either a String or IO (e.g. File)
object:
require "rexml/document"
include REXML
doc
= Document.new("<doc/>")
or
require "rexml/document"
include REXML
xmlfile = File.new(
"someWellFormedXmlFile.xml" )
doc
= Document.new( xmlfile )
Note that the use of ‘include REXML’
is what allows us to refer to Document rather than REXML::Document.
Once we have a Document we can start
adding content. It’s easy to add a new
element to the Document’s root element:
doc = Document.new("<doc/>")
# - > <doc/>
el
= Element.new "example"
# - > <example/>
rt
= doc.root
rt << el # - > doc now has <doc><example/></doc>
In this example, we first create a new, sparse
document. Next we create a new Element object. Finally, we get a reference to the
document’s root element, and append to it the newly created Element.
If we wanted our new element to have some content, we could
have added it just as easily:
doc
= Document.new("<doc/>") # - > <doc/>
el
= Element.new "example"
# - > <example/>
el.text = "foobar" # - >
<example>foobar</example>
rt
= doc.root
rt << el # - > doc now has <doc><example>foobar</example></doc>
Likewise, we can also add attributes:
doc = Document.new("<doc/>")
# - > <doc/>
el =
Element.new "example" # - >
<example/>
el.text = "foobar" # - > <example>foobar</example>
el.attributes["date"] =
"20011021" # - > <example
date='20011021'>foobar</example>
rt =
doc.root
rt << el # - > doc now has <doc><example date='20011021'>foobar</example></doc>
Once we have our document, we can get to the actual XML like
this:
mystr = ""
doc.write mystr # appends XML to mystr
puts mystr
or
saveTo = File.new("saveme.xml",
"w") # create File
doc.write saveTo # Append XML to file.
By the way, the write method uses the << operator. If you are writing to a
string, the XML will be appended to whatever the variable currently
holds.
Hooking Up with MySQL
We know enough about REXML to get started doing some
database work. The plan is to connect
to a database, query a table, and iterate over the result set, creating an XML
string of the resulting data. We’ll do
this by creating a class to wrap the details of database connection and XML
building.
Our file will be called sqlxml.rb,
and it begins with calls to require some modules:
require "mysql"
require "rexml/document"
include REXML
The file will define a single class, named SqlXml.
There are a number of attributes that will be used to set the database,
user name and password to connect to the database, the name of the host machine
where the database lives:
class SqlXml
attr_accessor :host, :user, :passwd, :db
We also expose an attribute for defining the name to use for
the resulting document’s root element, the name of the element that holds each
row of data, and an attribute for getting at the resulting XML:
attr_accessor :rootEl,
:rowEl
attr_reader :xml
The initialization method is used to set up the required
information to locate and connect to the database, as well as initializing
other variables for use later:
def initialize h, u, pw, d, rtEl, roEl
@host = h
@user = u
@passwd = pw
@db = d
@table = nil
@rootEl = rtEl
@rowEl = roEl
@xml = ""
@doc = Document.new "<#{@rootEl}/>"
end
In general, before putting any data into an XML document,
one needs to be sure it doesn’t contain any “problem” characters. Some characters have special meaning to an
XML parser, so their occurrence must be restricted. For example the & (ampersand)
character is reserved to indicate the start of an entity or character
reference. These are an XML way to
indicate either content substitution (entity) or to refer to a specific UNICODE
character (character reference). For
example,   would refer to the character for
a non-breaking space; © is the entity reference
for the copyright symbol. In an XML document we can indicate a literal
ampersand by using the entity reference: & Note
that in XML, your entities must be defined before being used, with these
exceptions: & (ampersand), < (“less than”), >
(“greater than”), "e;(quote), and '
(apostrophe). Character
references do not require prior declaration, as they use explicit numerical
values to indicate a particular UNICODE character.
So, unless it is part of a known entity, any instance of an
ampersand must be encoded with &. Similarly, the
< (“less than”) character is reserved for the
start of tag markup. Whenever an XML
parser encounters this character it expects to see a tag, and will throw an
error if this is not the case. So, as
with the ampersand, we need to replace the “less-than” character with the
entity reference <.
As it happens, though, REXML will handle this for us. When adding text to a document, REXML
automatically encodes the &, <,
and >
characters. (It ignores the ampersand
if it appears as part of &, <, or >.)
Suppose we create a document, as follows:
doc
= Document.new("<doc/>")
el
= Element.new "foo"
We then ad some text containing a “problem” character:
REXML converts the ampersand character into the
corresponding entity reference:
rt =
doc.root
rt << el # - > doc now has <doc><foo>Drum
& Bass</foo></doc>
However, if your character data includes certain entity
references they will not altered, as this alternative version shows:
el.text = "Drum &
Bass" # Text contains built-in
entity reference
rt =
doc.root
rt << el # - > doc now has <doc><foo>Drum
& Bass</foo></doc>
Important note:
As of when this article was written, REXML version 1.1a1 did not
recognize either ' or " as acceptable entity
references that can be safely ignored.
So, any occurrence of these in the document content data will have the leading
ampersand converted into &. '
and " will be converted to &apos; and &quot;. A
consequence is that you cannot safely put table data into an attribute value
without mangling any occurrence of the apostrophe character. REXML creates
attributes using single-quote marks (i.e., the
apostrophe); that same character
cannot also appear in the attribute value. Apostrophes, therefore, must be
encoded inside attribute values, but will be double-encoded once passed through
REXML.
We’re now ready to write the main method that actually
fetches the data as an XML document. It
will take two arguments: An SQL query string, and a string with the name of the
root element:
def
fetchData(q, rtel)
@rootEl = rtel
@doc = Document.new "<#{@rootEl}/>"
rt = @doc.root
The method begins by setting up the document, creating a
minimal document with a single, empty, root element. It initializes the variable rt
with the document root, and then tries to connect to the database:
$my = Mysql.connect(@host, @user, @passwd)
Once a connection is made, the database is selected and the
SQL executed:
$my.select_db(@db)
resultSet = $my.query(q)
The resultset is stored in resultSet,
over which the code iterates while it builds the XML. resultSet
holds a set of hashes; each hash retrieved is a row, with the column name as
the key.
res.each_hash{ |h|
item = Element.new @rowEl
Each row of data will be stored in element named from the
value of @rowEl.
The field names will become child elements, with the retrieved data as
content:
So, for each column in the row, an element is created using
the key:
The element’s content is the field’s data:
The new element is appended to the “item” element as a child
element:
At the end of each row of data, the “row” element as
appending as a child to the document’s root element:
At the end of the method, the document’s XML is “written” to
a string variable, and returned as the method’s value:
Trying it Out
Of course, we want to see this work, so at the end of class
file we’ll put a simple test. It makes
some assumptions about where the database is, how to log on, etc. Change these values to match your own
setup.
$user = "foo"
$pwd
= "bar"
$host = "localhost"
$db
= "someDb"
$rtElName = "data"
$roElName = "row"
$table
= "Articles"
begin
$q =
" SELECT * FROM #{$table} "
xmlsql = SqlXml.new $host, $user, $pwd, $db, $rtElName, $roElName
puts xmlsql.fetchData $q
end
When you run this, you should get output formatted like
this:
<data>
<row>
<Active>1</Active>
<Title>Greetings</Title>
<CreationDate>20011018213129</CreationDate>
<RecordID>4</RecordID>
<Author>James</Author>
<Body>This site has been set up to provide a focal point for all
things related to Ruby &amp; XML. </Body>
<Deleted>0</Deleted>
<Synopsis>This site has been set up to provide a focal point for
all things related to Ruby + XML.
The site is constructed using Ruby and XML:
It uses a combination of the ruby-tmpl package
(http://sourceforge.net/projects/ruby-tmpl), NQXML, Ruby XML-RPC, and XS</Synopsis>
<Category>Administrivia</Category>
</row>
<row>
<Active>1</Active>
<Title>Ruby objects <-> XML: anyone working on
this?</Title>
<CreationDate>20011018212406</CreationDate>
<RecordID>3</RecordID>
<Author></Author>
<Body>A thread was started (Sept 15, 01) in the ruby-talk mailing
list about serializing Ruby objects as XML. See message 21218 in the ruby-talk
archives. </Body>
<Deleted>0</Deleted>
<Synopsis>A thread was started (Sept 15, 01) in the ruby-talk
mailing list about serializing Ruby objects as XML. See message 21218 in the
ruby-talk archives.</Synopsis>
<Category>Ruby-talk</Category>
</row>
</data>
Of course, the actual element names depend on the field
names in your table.
RSS?
Now, at the beginning I mentioned that I wanted to create an
RSS file from the table data. I must
now confess that, while this intent is true, it has not in fact been
realized. But I felt I would be remiss
if I just ended this article while falling short of that goal.
The problem (or challenge, if you like) is that an RSS file
consists of two parts: There is a one-time “header” section that provides some
details about the RSS source:
<rss version="0.91">
<channel>
<title>RAA updates</title>
<link>http://www.rubyxml.com </link>
<description>Ruby+XML RAA information </description>
<language>us-en</language>
<copyright>Copyright 2001, James Britt</copyright>
<managingEditor></managingEditor>
<webMaster></webMaster>
<image>
<title>Ruby XML</title>
<url>http://www.rubyxml.com/images/logo01.jpg</url>
<width>126</width>
<height>105</height>
<description> Ruby/XML developer information </description>
</image>
<!-- remaining RSS file ... -->
This is followed by some number of item entries:
<item>
<title>
sablot
</title>
<category>XML</category>
<link></link>
<description>
This is the experimental version of a
Ruby interface to
Sablotron, the XSLT processor developed
by Ginger Alliance.
</description>
</item>
The sqlxml code shown so far is
quite capable of creating the set of items, provided the table field names
match the item elements. We would just
need to name the row element “item”, and
create a document with a root element
of … what? Well, it doesn’t
matter, since we can use the item data as a source of elements to insert into a
final RSS document.
RSS files have a root element of rss,
immediately followed by a channel element, which contains
the header data. So, we would first
create a main RSS text file, like this:
<rss version="0.91">
<channel>
<title>rubyxml news</title>
<link>http://www.rubyxml.com </link>
<description>Ruby+XML developer information </description>
<language>us-en</language>
<copyright>Copyright 2001, James Britt</copyright>
<managingEditor>Arthur Gordon Pym</managingEditor>
<webMaster>E.B. White</webMaster>
<image>
<title>Ruby XML</title>
<url>http://www.rubyxml.com/images/logo01.jpg</url>
<width>126</width>
<height>105</height>
<description> RubyXML developer information </description>
</image>
</channel>
</rss>
We’ll call this file rssheader.xml.
Next, we’ll write a
small program that uses this file to create a new REXML Document,
grabs the item data from the database, and inserts each item into the main
REXML document.
The code requires our SqlMml
class, so that part goes first. Next,
we begin the main body of code by defining a SQL query to get the data:
require
'sqlxml'
begin
$q
= " SELECT * FROM articles
WHERE Active = 1 ORDER BY
CreationDate"
We create a new SqlXml object by
specifying the various database and XML parameters:
xmlsql = SqlXml.new "localhost", "root", nil,
"rubyxml_com", "channel", "item"
The item data XML is then retrieved, and used to create a
new Document object:
itemData = xmlsql.fetchData $q
itemDoc = Document.new itemData
We create out main document by passing an IO object to the
Document initialize method:
header = File.new "rssheader.xml"
mainDoc = Document.new header
Now we need to take each item and add it to the main
document. First, we get the insertion
point by selecting the channel element of the main document. REXML provides some XPath power so that we can use path notation to
retrieve an element:
ch =
mainDoc.elements["rss/channel"]
Then we iterate over each child element in the item XML, and add it as a child of the main
document’s channel element:
itemDoc.root.each_element{ |el|
ch.add_element el
}
Finally, we emit the accumulated XML:
xml
= ""
mainDoc.write xml
puts xml
end
Or, we could instead write the XML to a file:
xml
= File.new "rss.xml", "w"
mainDoc.write xml
Summary
This article only scratched the surface of REXML. We saw the
basics: creating a document, dynamically adding elements and data, turning the
document into an XML string. Hopefully
it’s enough to get you started so that you can explore the full API. I want to thank Sean Russell for his work,
and for pointing out some errors in an early version of this article. Any
remaining errors are solely my responsibility.
You can download the example code from here.
Please send any comments, questions, or corrections to jbritt@rubyxml.com