Return to the main page

MySQL & Ruby Electric XML

James Britt, October 31, 2001

 

 

Introduction

When I started rubyxml.com I wanted to use, well, Ruby and XML to build the site.  In some cases it made sense: the RAA (Ruby application archive) information is available via an XML-RPC call, and the XML news is from the xmlhack.com RSS feed.  I thought about using XML for the main articles, and so I started writing the article information as an RDF file.  It occurred to me, though, that as time went on the front page would not be able to list all of the available stories.  There would be too many items for a single page, so only the most recent five or ten items should be displayed.  I figured that a database holding the articles (or at least the RDF data) would be a good approach.  Each entry would have a time stamp, and the items could be dynamically selected based on the article dates.  It would also make searching for past articles much easier.

 

On the other hand, I liked the idea of having an RDF/RSS file available for anyone interested in tracking what was on the site that day.  I decided that I would continue to load the main page from an RDF file, but have that file automatically created from a table in a MySQL database. 

 

This article will describe how to use REXML with MySQL to create an XML file.  I’ll show how to create a class that provides a simple interface for querying a database and getting the data back as XML.

 

MySQL

This site is called rubyxml.com, not rubysql.com, so I’m not going to say too much about SQL or MySQL other than what’s needed to get us going.  I’ve gotten the Ruby MySQL package (available at http://www.tmtm.org/mysql/ruby/) running on RedHat Linux 6, but have not yet attempted it on Windows.  As a general rule I prefer to avoid packages that require compiling native code, and was frankly amazed how easy it was to get MySQL going with Ruby.  In fact, I managed to compile the code on my home PC, and uploaded the final binaries to the production server (where I have rather limited, roundabout compilation access) where they pretty much just worked.

 

Getting MySQL up, running, and playing nicely with Ruby is not that difficult if you follow the instructions in the readme file.  Here are some hints, though, if you have problems: First, the make file required to build the MySQL shared object is created by running extconfig.rb.  This script accepts command-line parameters, and you may need to pass in the paths to your ‘lib’ and ‘include’ directories so that the resulting make file knows where to find required code.  If make complains, though, use ‘locate’ or ‘find’ to see where the unfound objects are, adjust the parameters to extconfig.rb, and run it again to regenerate the make file.   Also, you may have to add additional path information to your LD_LIBRARY_PATH environment variable so that the Ruby msql.so code can find libmysqlclient.so.

 

The Ruby MySQL readme file list the classes and methods provided in the package, but does not explain what they actually do.  This is no big deal, though, as the method names map nicely to the API as defined in the standard MySQL documentation, available on the Web at www.mysql.org.

Ruby Electric XML

Ruby Electric XML (REXML, at http://www.germane-software.com/~ser/Software/rexml/) is a Ruby package for creating and manipulating XML documents.  Unlike libgdome-ruby or XMLParser, REXML does not require additional native-code bindings.  REXML was written by Sean Russell, and it was inspired by the Electric XML Java™ package (http://www.themindelectric.com/products/xml/xml.html). 

 

It allows one to create, access, and modify XML documents in a way that should be fairly intuitive to even beginner Ruby developers. While an early version was based on a Java™ package, Sean made some changes to the API to make it more Ruby-like. So, where the Java™ Electric XML provides an easier-than-the-DOM API, REXML gives us an even-more-easier-than-the-DOM API.

 

I highly recommend that you read the documentation for REXML, including the overview, the tutorial, and the API docs.  These come bundled with the package download.  What I'll try to cover here are details of a specific task, enough to get one started using REXML.

 

REXML came about form Sean's desire to have a simple, “Ruby way” to work with XML.  The W3C, after presenting XML, created an API specification for the XML document object model, or DOM.  The spec basically described the DOM as an ordered set of "node" objects.  There was a definition for a base node interface, and the various DOM parts (elements, attributes, processing instructions, etc.) implemented the interface in ways appropriate for each particular type.  This was in some ways a nice, clean abstraction ("everything's a node"), but in practice it could often be confusing.  Yes, everything was a node, but everything was also its own special brand of node.  So, some nodes had children, some didn't.  Some had attribute lists, some didn't. 

 

To be fair, the W3C was trying to design a spec that was not bound to the quirks of any particular language.  It would have perhaps been unfair to describe things in a way that were easy to implement in, say, a weakly-typed, OO language like Ruby that would have tripped up less flexible languages (though they did provide language bindings for ECMA Script and Java™).

 

Having a single API available regardless of the particular implementation should have made things easier for developers to learn.  But the DOM API itself is not always easy to get one's head around.  For example, to add a new element to a document you first have to use that document object as an element factory by calling yourDOM.createElement(elementName).  Adding text to the element requires creating a new Text node object, which is then appended to the element object.  Similarly, methods for managing attributes are not directly part of the element object.  It seemed like everything took at least three steps.

 

I've spent fair amount of time writing code around XML, using, at different times, VB, Perl, Java, and JavaScript.  While a common API was convenient, it obscured an interesting point: depending on what the code was supposed to do, I often wasn't interested in the XML DOM per se, just XML.  For example, a great deal of the code was designed to construct XML strings and move them around.  Different bits of code would snarf out content, other code would simply aggregate other document chunks.  I rarely cared about DTDs or validation, or entity references, or PIs.  These just never played a role in the application. So why did I have to wrestle with an API intended to manage all of that?

 

On one job I did get to use a custom XML parser that was lightning quick.  It had a stripped down API designed to do just what was needed in way that let you get the job done with a minimum of fuss.  This didn't please some purists who felt obligated to follow every recommendation of a vendor consortium. It did please those who needed to get a job done.

 

REXML, and Ruby in general, are intended to work much the same way: let the work get done without getting in the way.  The REXML API lets you create an element as easy as this:

 

el = Element.new("someElementName")

el.text = "My new element."

 

Wasn't that easy?  Creating documents is just as easy:

 

doc = Document.new("<myDoc/>")

 

Installing REXML is also easy.  Since the package does not require any native code compilation, you simply need to download the source archive (http://www.germane-software.com/~ser/Software/rexml/), extract the files, and run the install.rb installation script.  If for some reason you have a problem, you can just copy the rexml directory over to your Ruby lib directory (e.g. /lib/ruby/site_ruby/1.6/rexml).

Some REXML Basics

In the W3C DOM world there's an über-node off which all other nodes are spawned.  Nodes, in general, only exist in the context of other nodes. In the REXML world there are various classes representing the different types of DOM nodes; any of these can be created simply by calling the appropriate new.  In a sense it’s a bit like Tinker Toys in that you can create the parts you need as you need them, then snap things together to construct a document.

 

There a few ways you can create a document. Perhaps the simplest way is to call Document.new with either a String or IO  (e.g. File) object:

 

require "rexml/document"

include REXML

doc  = Document.new("<doc/>")

 

or

 

require "rexml/document"

include REXML

xmlfile = File.new( "someWellFormedXmlFile.xml" )

doc     = Document.new( xmlfile )

 

Note that the use of ‘include REXML’ is what allows us to refer to Document rather than REXML::Document.

 

Once we have a Document we can start adding content.  It’s easy to add a new element to the Document’s root element:

 

doc = Document.new("<doc/>") # - > <doc/>

el  = Element.new "example"  # - > <example/>

rt  = doc.root                

rt << el                      # - > doc now has <doc><example/></doc>

 

In this example, we first create a new, sparse document.  Next we create a new Element object. Finally, we get a reference to the document’s root element, and append to it the newly created Element

 

If we wanted our new element to have some content, we could have added it just as easily:

 

doc     = Document.new("<doc/>") # - > <doc/>

el      = Element.new "example"  # - > <example/>

el.text = "foobar"               # - > <example>foobar</example>

rt      = doc.root                 

rt << el              # - > doc now has <doc><example>foobar</example></doc>

 

Likewise, we can also add attributes:

 

doc = Document.new("<doc/>") # - > <doc/>

el  = Element.new "example"  # - > <example/>

el.text = "foobar"           # - > <example>foobar</example>

el.attributes["date"] = "20011021" # - > <example date='20011021'>foobar</example>

rt  = doc.root                

rt << el    # - > doc now has <doc><example date='20011021'>foobar</example></doc>

 

Once we have our document, we can get to the actual XML like this:

 

mystr = ""

doc.write mystr  # appends XML to mystr

puts mystr

 

or

 

saveTo = File.new("saveme.xml", "w")  # create File

doc.write saveTo                      # Append XML to file.

 

By the way, the write method uses the << operator. If you are writing to a string, the XML will be appended to whatever the variable currently holds.

Hooking Up with MySQL

We know enough about REXML to get started doing some database work.  The plan is to connect to a database, query a table, and iterate over the result set, creating an XML string of the resulting data.  We’ll do this by creating a class to wrap the details of database connection and XML building. 

 

Our file will be called sqlxml.rb, and it begins with calls to require some modules:

 

require "mysql"

require "rexml/document"

include REXML

 

The file will define a single class, named SqlXml.  There are a number of attributes that will be used to set the database, user name and password to connect to the database, the name of the host machine where the database lives:

 

class SqlXml

 attr_accessor :host, :user, :passwd, :db

 

We also expose an attribute for defining the name to use for the resulting document’s root element, the name of the element that holds each row of data, and an attribute for getting at the resulting XML:

 

 attr_accessor  :rootEl, :rowEl

 attr_reader :xml

 

The initialization method is used to set up the required information to locate and connect to the database, as well as initializing other variables for use later:

 

def initialize h, u, pw, d, rtEl, roEl

    @host = h

    @user = u

    @passwd = pw

    @db = d

    @table = nil

    @rootEl = rtEl

    @rowEl = roEl

    @xml = ""

    @doc = Document.new "<#{@rootEl}/>"

  end

 

In general, before putting any data into an XML document, one needs to be sure it doesn’t contain any “problem” characters.  Some characters have special meaning to an XML parser, so their occurrence must be restricted.  For example the & (ampersand) character is reserved to indicate the start of an entity or character reference.  These are an XML way to indicate either content substitution (entity) or to refer to a specific UNICODE character (character reference).  For example, &#160; would refer to the character for a non-breaking space; &copy; is the entity reference for the copyright symbol. In an XML document we can indicate a literal ampersand by using the entity reference: &amp; Note that in XML, your entities must be defined before being used, with these exceptions: &amp; (ampersand), &lt; (“less than”), &gt; (“greater than”), &quote;(quote), and  &apos; (apostrophe).  Character references do not require prior declaration, as they use explicit numerical values to indicate a particular UNICODE character. 

 

So, unless it is part of a known entity, any instance of an ampersand must be encoded with &amp;. Similarly, the < (“less than”) character is reserved for the start of tag markup.  Whenever an XML parser encounters this character it expects to see a tag, and will throw an error if this is not the case.  So, as with the ampersand, we need to replace the “less-than” character with the entity reference &lt;.

 

As it happens, though, REXML will handle this for us.  When adding text to a document, REXML automatically encodes the &, <, and > characters.  (It ignores the ampersand if it appears as part of &amp;, &lt;, or &gt;.)

 

Suppose we create a document, as follows:

 

doc     = Document.new("<doc/>")

el      = Element.new "foo" 

 

We then ad some text containing a “problem” character:

 

el.text = "Drum & Bass"               

 

REXML converts the ampersand character into the corresponding entity reference:

 

rt  = doc.root                

rt << el              # - > doc now has <doc><foo>Drum &amp; Bass</foo></doc>

 

However, if your character data includes certain entity references they will not altered, as this alternative version shows:

 

el.text = "Drum &amp; Bass"    # Text contains built-in entity reference

rt  = doc.root                

rt << el              # - > doc now has <doc><foo>Drum &amp; Bass</foo></doc>

 

 

Important note:  As of when this article was written, REXML version 1.1a1 did not recognize either &apos; or &quot; as acceptable entity references that can be safely ignored.  So, any occurrence of these in the document content data will have the leading ampersand converted into &amp;.  &apos; and &quot; will be converted to &amp;apos; and &amp;quot;. A consequence is that you cannot safely put table data into an attribute value without mangling any occurrence of the apostrophe character. REXML creates attributes using single-quote marks (i.e., the  apostrophe);  that same character cannot also appear in the attribute value. Apostrophes, therefore, must be encoded inside attribute values, but will be double-encoded once passed through REXML.  

 

We’re now ready to write the main method that actually fetches the data as an XML document.  It will take two arguments: An SQL query string, and a string with the name of the root element:

 

  def fetchData(q, rtel)

    @rootEl = rtel

    @doc = Document.new "<#{@rootEl}/>"

    rt = @doc.root

 

The method begins by setting up the document, creating a minimal document with a single, empty, root element.  It initializes the variable rt with the document root, and then tries to connect to the database:

 

    $my = Mysql.connect(@host, @user, @passwd)

 

Once a connection is made, the database is selected and the SQL executed:

 

    $my.select_db(@db)

    resultSet = $my.query(q)

 

The resultset is stored in resultSet, over which the code iterates while it builds the XML. resultSet holds a set of hashes; each hash retrieved is a row, with the column name as the key.

 

    res.each_hash{ |h|

      item = Element.new @rowEl

 

Each row of data will be stored in element named from the value of @rowEl.  The field names will become child elements, with the retrieved data as content:

 

      h.each {|key, value|

 

So, for each column in the row, an element is created using the key:

 

        col = Element.new key

 

The element’s content is the field’s data:

 

        col.text = value

 

The new element is appended to the “item” element as a child element:

 

        item.elements << col

      }

 

At the end of each row of data, the “row” element as appending as a child to the document’s root element:

 

      rt.elements << item

    }

 

At the end of the method, the document’s XML is “written” to a string variable, and returned as the method’s value:

 

    @doc.write @xml

    @xml

  end

 

Trying it Out

Of course, we want to see this work, so at the end of class file we’ll put a simple test.  It makes some assumptions about where the database is, how to log on, etc.   Change these values to match your own setup.

 

$user = "foo"

$pwd  = "bar"

$host = "localhost"

$db   = "someDb"

$rtElName = "data"

$roElName = "row"

$table  = "Articles"

begin

 $q = " SELECT * FROM #{$table} "

  xmlsql = SqlXml.new $host, $user, $pwd, $db, $rtElName, $roElName

  puts xmlsql.fetchData $q

end

 

When you run this, you should get output formatted like this:

 

<data>

  <row>

    <Active>1</Active>

    <Title>Greetings</Title>

    <CreationDate>20011018213129</CreationDate>

    <RecordID>4</RecordID>

    <Author>James</Author>

    <Body>This site has been set up to provide a focal point for all things related to Ruby &amp;amp; XML. </Body>

    <Deleted>0</Deleted>

    <Synopsis>This site has been set up to provide a focal point for all things related to Ruby +  XML.

The site is constructed using Ruby and XML: It uses a combination of the ruby-tmpl package (http://sourceforge.net/projects/ruby-tmpl), NQXML, Ruby XML-RPC, and XS</Synopsis>

    <Category>Administrivia</Category>

  </row>

  <row>

    <Active>1</Active>

    <Title>Ruby objects &lt;-&gt; XML: anyone working on this?</Title>

    <CreationDate>20011018212406</CreationDate>

    <RecordID>3</RecordID>

    <Author></Author>

    <Body>A thread was started (Sept 15, 01) in the ruby-talk mailing list about serializing Ruby objects as XML. See message 21218 in the ruby-talk archives. </Body>

    <Deleted>0</Deleted>

    <Synopsis>A thread was started (Sept 15, 01) in the ruby-talk mailing list about serializing Ruby objects as XML. See message 21218 in the ruby-talk archives.</Synopsis>

    <Category>Ruby-talk</Category>

  </row>

</data>

 

Of course, the actual element names depend on the field names in your table.

RSS?

Now, at the beginning I mentioned that I wanted to create an RSS file from the table data.  I must now confess that, while this intent is true, it has not in fact been realized.  But I felt I would be remiss if I just ended this article while falling short of that goal.

 

The problem (or challenge, if you like) is that an RSS file consists of two parts: There is a one-time “header” section that provides some details about the RSS source:

 

 <rss version="0.91">

  <channel>

   <title>RAA updates</title>

    <link>http://www.rubyxml.com </link>

    <description>Ruby+XML RAA information </description>

    <language>us-en</language>

    <copyright>Copyright 2001, James Britt</copyright>

    <managingEditor></managingEditor>

    <webMaster></webMaster>

    <image>

      <title>Ruby XML</title>

      <url>http://www.rubyxml.com/images/logo01.jpg</url>

      <width>126</width>

      <height>105</height>

      <description> Ruby/XML developer information </description>

    </image>

 

    <!-- remaining RSS file ... -->

 

This is followed by some number of item entries:

 

<item>

   <title>

     sablot

   </title>

   <category>XML</category>

   <link></link>

   <description>

     This is the experimental version of a Ruby interface to

     Sablotron, the XSLT processor developed by Ginger Alliance.

   </description>

  </item>

 

The sqlxml code shown so far is quite capable of creating the set of items, provided the table field names match the item elements.  We would just need to name the row element “item”,  and create a document with a root element  of … what?  Well, it doesn’t matter, since we can use the item data as a source of elements to insert into a final RSS document.

 

RSS files have a root element of rss, immediately followed by a channel element, which contains the header data.  So, we would first create a main RSS text file, like this:

 

<rss version="0.91">

 <channel>

  <title>rubyxml news</title>

  <link>http://www.rubyxml.com </link>

  <description>Ruby+XML developer information </description>

  <language>us-en</language>

  <copyright>Copyright 2001, James Britt</copyright>

  <managingEditor>Arthur Gordon Pym</managingEditor>

  <webMaster>E.B. White</webMaster>

  <image>

   <title>Ruby XML</title>

   <url>http://www.rubyxml.com/images/logo01.jpg</url>

   <width>126</width>

   <height>105</height>

   <description> RubyXML developer information </description>

  </image>

 </channel>

</rss>

 

We’ll call this file rssheader.xml.

 

Next,  we’ll write a small program that uses this file to create a new REXML Document, grabs the item data from the database, and inserts each item into the main REXML document.

 

 

The code requires our SqlMml class, so that part goes first.  Next, we begin the main body of code by defining a SQL query to get the data:

 

require  'sqlxml'

 

begin

  $q = " SELECT * FROM articles  WHERE  Active = 1 ORDER BY CreationDate"

 

We create a new SqlXml object by specifying the various database and XML parameters:

 

  xmlsql = SqlXml.new "localhost", "root", nil, "rubyxml_com", "channel", "item"

 

The item data XML is then retrieved, and used to create a new Document object:

 

  itemData = xmlsql.fetchData $q

  itemDoc = Document.new itemData

 

We create out main document by passing an IO object to the Document initialize method:

 

  header = File.new "rssheader.xml"

  mainDoc = Document.new header

 

Now we need to take each item and add it to the main document.  First, we get the insertion point by selecting the channel element of the main document.  REXML provides some XPath power so that we can use path notation to retrieve an element:

 

  ch  = mainDoc.elements["rss/channel"]

 

Then we iterate over each child element in the item XML, and add it as a child of the main document’s channel element:

 

  itemDoc.root.each_element{ |el|

     ch.add_element el

  }

 

Finally, we emit the accumulated XML:

 

  xml = ""

  mainDoc.write xml

  puts xml

end

 

Or, we could instead write the XML to a file:

 

  xml = File.new "rss.xml", "w"

  mainDoc.write xml

 Summary

This article only scratched the surface of REXML. We saw the basics: creating a document, dynamically adding elements and data, turning the document into an XML string.  Hopefully it’s enough to get you started so that you can explore the full API.  I want to thank Sean Russell for his work, and for pointing out some errors in an early version of this article. Any remaining errors are solely my responsibility. 

 

You can download the example code from here.

 

Please send any comments, questions, or corrections to jbritt@rubyxml.com

 

 
Please send questions or comments to webmaster@rubyxml.com