Wednesday 30 November 2011

Parsing XML using SAX

SAX, the Simple API for XML, is another take on parsing XML files. But, as opposed to the DOM approach, a SAX parser will not re-construct the entire document tree in memory, but instead operates in an event-driven manner. The SAX API allows clients to register callback functions for the items they are interested in ( which can be text nodes, element nodes, processing instructions or comments ). When the SAX parser will encounter these items, an event will be triggered and the callback function ( if any are registered ) will be called, allowing the client application to take action. Another event will be triggered when an end of any of those XML features is encountered, and eventual functions called.

Another difference between SAX and DOM is that there is no standard for SAX, and therefore the API is not guaranteed to be the same for all parsers. In practice, differences between various implementations are quite significant, requiring users to familiarize themselves with the particular flavor of SAX which the parser employs.

For example, a SAX parser is described by Microsoft’s ISAXXMLReader (http://msdn.microsoft.com/en-us/library/aa924174.aspx ); callback functions can be set using the ISAXContentHandler::startElement method. But fortunately other implementations outside COM ugliness are available for C++; one of the most popular choices is Xerces (http://xerces.apache.org/index.html ), by Apache. It can be used like this:
void foobar_sax_handler::startElement(const XMLCh* const name,
                           AttributeList& attributes)
{
    char* m_name = XMLString::transcode(name);
    cout << "Element encountered: "<< m_name << endl;
    XMLString::release(&m_name);
}

This assumes that we have defined a class named foobar_sax_handler, which inherits from HandlerBase and has a startElement method. We can set this class to handle XML events with the setDocumentHandler method of the SaxParser class. Our customized parser can then be invoked using the parse( xml_file ) method of the SaxParser class, where xml_file represents the name of an XML file.

When the parse method is called, the startElement method will be called for each element, printing it’s name. Of course, inside this method we can perform more specific actions, for example only printing or modifying certain elements.

SAX bindings are also available for other languages as well, for example the jssaxparser ( http://code.google.com/p/jssaxparser/ ) is for JavaScript; a SAX parser is also available for Java ( javax.xml.parsers.SAXParser ). The API varies slightly between all these implementations, but the same principles apply; if one is familiar with using SAX with one programming language, switching to a different one will be relatively easy.

Tuesday 29 November 2011

The Document Object Model


The DOM is an interface which provides a way of accessing and modifying structured documents, such as XML, XHTM and SVG. The DOM is usually accessed via it’s public API, in a platform- and operating system-independent manner.

The DOM has a troubled history, due to the fact that the early browsers market was very competitive, which reflected negatively on the willingness of the companies producing them to cooperate, develop and implement standards. W3C later managed to get companies such as Netscape and Microsoft to collaborate and develop a standard for a scripting language ( ECMAScript ), and a DOM afterwards, in late 1998.

The concept of a DOM is used by browsers, which expose an API to enable JavaScript code to access and modify the DOM. For example, the following code will find an element which has the ‘foo’ id and make it’s background colour green:
document.getElementById('foo').backgroundColor="green";

This example illustrates a few important concepts about the DOM – instead of being a monolithic specification, it is divided into several separate documents, each describing a specific area. For Level 2 DOM, it comprises a ‘Core’ ( http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/ ), and the other recommendations extend on it. For example, the Level 2 HTML DOM extends some components which are part of the ‘Core’ specification, specializing them according to the needs of HTML. The HTML DOM ‘HTMLDocument’ interface, for example ( http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/html.html#ID-26809268 ) was derived from the core ‘Document’ interface, and similarly the HTMLElement interface was derived from the core ‘Element’ interface.

Such extensions of the ‘Core’ DOM allow for a wider range of application, and therefore more uses. The aforementioned HTML DOM allows for HTM-specific manipulation – for example, hiding an element with a certain ID. The DOM Level 2 Style specification allows scripts to dynamically access and update the content of stylesheets, therefore affecting how the document looks.

SVG Images


Scalable Vector Graphics ( SVG ) relies on text for describing an image. It also supports the inclusion of raster data, however the mainstay of SVG is textual description of graphic data. For example, the following SVG document contains a red circle with a black outline:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<svg
   version="1.1"
   width="100"
   height="100"
   id="svg2">

  <g
     transform="translate(0,-952.36218)"
     id="layer1">
    <path
       d="m 90,50 a 40,40 0 1 1 -80,0 40,40 0 1 1 80,0 z"
       transform="translate(0,952.36218)"
       id="path2987"
       style="fill:#ff0000;
              fill-rule:evenodd;
              stroke:#000000;
              stroke-width:1px;
              stroke-linecap:butt;
              stroke-linejoin:miter;
              stroke-opacity:1"
    />
  </g>
</svg>

As it is obvious from the document above, SVG is an application of XML – valid SVG documents are also validate as XML. It also showcases how graphical data is stored inside the SVG file – as XML elements, with graphical properties described via attributes. The circle is represented by the path element – the ‘d‘ element is actually short for ‘Path Data’, and describes it’s shape. ( http://www.w3.org/TR/SVG11/paths.html#PathData ). Other properties of the circle, such as it’s fill colour and stroke width/colour are described via attributes, as visible in the above example.

The SVG specification has multiple profiles; for example, the SVG Mobile Recommendation ( http://www.w3.org/TR/SVGMobile/ ) defines the SVG Tiny and SVG Basic mobile profiles. SVG Tiny is targeted at highly restricted mobile devices, while SVG Basic is meant to be suitable for higher level mobile devices.

One of the major issues concerning SVG is a reduced browser adoption rate. Although the most recent versions of all major browsers have at least some level of support for SVG images, some older, but still widely-used versions do not support SVG; Internet Explorer fully supports the SVG Basic specification only since version 9, released on March 14th 2011.

XLink


XLink is the XML way of tackling the issue of creating links between documents. It is very similar to it’s HTML counterpart, the <a> tag. But having a dedicated element for handling linking is not in XML’s spirit ( it would impose unnecessary restrictions that would hamper flexibility ), therefore this functionality is implemented in a separate namespace ( xmlns:xlink=http://www.w3.org/1999/xlink ), and attributed to elements with by setting certain properties.

It can be used for the roles fulfilled by <a> in HTML – to enable the user to navigate to a different document; but they’re more versatile, and have a number of attributes that control it’s behaviour.

An XLink can be of two types: simple and extended. The simple XLink has similar similar capabilities to traditional HTML links: they provide a uni-directional connection between two resources.

Example of simple XLink usage:
 1 <?xml version="1.0"?>
 2 <links xmlns:xlink="http://www.w3.org/1999/xlink">
 3
 4     <blog xlink:type="simple"
 5         xlink:href="http://mihairotaru.blogspot.com">Ram's blog</blog>
 6
 7     <blog xlink:type="simple"
 8         xlink:href="http://native-dev.blogspot.com">Second blog</blog>
 9
10 </links>


Monday 28 November 2011

Transforming XML documents with XSLT

The XSLT transformation pipeline involves four elements: the source XML document(s) with the XSLT stylesheets, an XSLT processor ( template processing engine ), and the resulting document(s).

The output format of an XSLT transformation ( actually, in the context of XSLT a transformation doesn’t actually transform the source XML document – but instead uses it as input for creating other documents ) can range from PDF files to plain text files; this is due to XSLT’s powerful templating mechanism and versatile XSL Formatting Objects.

Here’s an XSL template which will create an XML file based on movies.xml, and add a column which will represent the ‘value’ of the movie – it’s rating divided by it’s price:
 1 <?xml version="1.0" encoding="ISO-8859-1"?>
 2 <xsl:stylesheet version="1.0"
 4
 5 <xsl:output method="xml" version="1.0" indent="yes"/>
 6     <xsl:template match="/">
 7         <movies>
 8         <xsl:for-each select="movies/movie">
 9             <movie>
10             <xsl:copy-of select="./*"/>
11             <value><xsl:value-of select="rating div price"/></value>
12             </movie>
13         </xsl:for-each>
14         </movies>
15     </xsl:template>
16 </xsl:stylesheet>

Notes
-         this solution is not ideal, since some elements are hard-coded ( ‘movies’ and ‘movie’ ); these should be somehow deduced
-         xsl:for-each is used on L08 to select each ‘movie’ node in turn ( in the order they appear in the original XML file – ‘document order’ )
-         xsl:copy-of is used on L10 to create a copies of all the child nodes of the current node. This statement will copy the ‘title’, ‘year’, etc elements for each ‘movie’ element.
-         on L11, the value element is created. xsl:value-of is used to insert the value resulting from dividing the value of the ‘rating’ element with the value of the ‘price’ element of the current node.

Microsoft’s msxsl tool (http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21714 ) can be used to transform the original XML file into a new one, with the added <value> element:
 msxsl movies.xml transform.xsl -o new_movies.xml

Sorting is accomplished by using the xsl:sort element ( http://www.w3.org/TR/xslt#sorting ). xsl:sort can only appear as a child of an xsl:apply-templates or an xsl:for-each element. To sort the list of movies by the name of the title, we can simply add a line after L08 in the previous XSLT stylesheet:
9         <xsl:sort select="title">

The resulting XML document will have it’ elements sorted corresponding to the alphabetical order of their ‘title’ elements. So, ‘Apocalypto’ will appear the first, and ‘Year One’ as the last one.

Sunday 27 November 2011

Styling XML documents with XSL

This blog post will describe the process of employing XSLT stylesheets for styling an XML document. It is very similar to using CSS – the stylesheet must be created, and the XML document should be told about the stylesheet. But that’s where the similarities stop; the two methods are very different in their approach.

Central to XSLT-based styling is the concept of a `template` - it is XSLT’s equivalent for CSS rules. XSLT will try to match the template’s pattern with the XML document; when it finds a match, it will perform the transformations described in the template body.

Here’s an XSL which would display the movies.xml file as a table:


<?xml version="1.0"?>
<xsl:stylesheet 
version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:template match="/">
<html>
    <body style="font:12px Georgia;">
        <table style="border: 1px outset; background-color: lightgray;">
            <!--Generate table headings-->
            <tr bgcolor="#9acd32">
                <th>Title</th>
                <th>Year</th>
                <th>Rating</th>
            </tr>

            <!--Generate table rows-->
            <xsl:for-each select="movies/movie">
                <tr>
                    <td><xsl:value-of select="title"/></td>
                    <td><xsl:value-of select="year"/></td>
                    <td><xsl:value-of select="rating"/></td>
                </tr>
            </xsl:for-each>
        </table>
    </body>
</html>
</xsl:template>
</xsl:stylesheet>

The XML file needs to know about this XSL stylesheet; so this line needs to be added to the XML file, assuming the XSL has the ‘movies-xsl-style.xsl’:
2 <?xml-stylesheet type="text/xsl" href="movies-xsl-style.xsl"?>

This is how the XML file was rendered by Firefox 8.0:

Which is exactly how I wanted it to look. This example illustrates the advantages and potential drawbacks of using XSL fro styling XML documents.

Among the drawbacks, the most obvious one is the learning curve – although XSL stylesheets are XML documents themselves, they have numerous tags with a specific role ( more precisely, those inside the http://www.w3.org/1999/XSL/Transform namespace ), which means that it is still a technology that one would need to spend considerable amounts of time to learn. It can be considered a fully-fledged programming language, with loops, if’s and variables.

Browser support is also quite limited for XSLT 2.0; however, this doesn’t diminish it’s usefulness on the server side. Instead of serving the XML + XSL documents and relying on the client’s browser to perform the transformations, the transformations could be run on the server ( using tools such as Saxon or AltovaXML ), generating XHTML files which would then be served to the clients.

One of the most important advantages of using XSL to style XML documents is the fact that it respects the principle of keeping the data and presentation separate, while providing the same styling capabilities as with CSS stylesheets, and much more powerful constructs.