Friday 16 December 2011

XML Spy


XML Spy is described by the company developing it as a an advanced XML editor for modeling, editing, transforming, and debugging XML-related technologies." From this description, it is obvious that use case scenarios for this program are numerous and it would be impossible to provide an in-depth review without spending a lot of time using it, while having at least some degree of familiarity with the technologies it works with. 

This review will focus on XML Spy's capacity of aiding XML document authoring. Lately, I accumulated quite a few XML files, due to the fact that I'm writing my blog entries for this module as XML files, which I then use an XSLT template to transform into HTML. Then I post the HTML code to my blog, and paste it into the Word document which is my report. These files will provide sufficient material to compare XML Spy with my current setup - the GVim editor, with a number of plugins to aid with XML editing, and the XML Star command-line utitlies which I use from within GVim to perform tasks such as validating and automatic escaping of characters like < and &. 

XML Spy is not free software; to evaluate the program, one is required to give his email address to receive a key used to actvate the program. After receiving the confirming email, I proceded do download the software. The first thing I found out about it during this process is that it is not lightweight; the installation package weights in at a hefty 55MB; installed, it takes up 250MB and during installation requires at least double that amount, for storing temporary files. 

Ather going through the throes of installing this beast, I fired it up, eager to put it to the test. The first impression I got was that the interface was bloated and outdated. Many of the buttons are too small, and the editor has a lot of windows opened by default. Fortunatelly, they can be closed - but the size of the buttons cannot be increased, which makes them a little bit harder to click. These are small greivances, but they do contribute signifficantly to the overall impression the program makes. 

The first thing I did after taking a close look at the user interface and menu options, was opening a couple of my XML documents. I opened the combine.xsl and one of the posts for the blog, post_35.xml. After browsing the contents of the files and making small changes for a while, to get an impression of how the editor handles such tasks, I was under the impression that the editor is too wastefull with screen space, and has too many little details which make it harder to concentrate on the code. 

I was curious to see how does XML Spy treat illegal characters in XML files - in the editor I use, they are simply hilighted in red. But when I tried to type illegal characters in XML Spy, it simply ignored the keystrokes. I think that is a bad way of handling such issues - even though the character is illegal, the editor should not make decisions for the user, without asking for permission, or even notifying the user. I tried typing such characters later, and the editor accepted the keystrokes, I'm not sure why. When typing an amersand, the editor will show up a pop-up menu, whith the available entities - such as &amp; - but if the user keeps typing, the menu will not update itselft to reflect the new keystrokes. 

XML Spy does offer convenient ways of checking for validity and well-formedness, from the XML menu; it also has keyboard shortcuts for these probably quite common operations - to check for well-formedness, all the user needs to do is press F7. For validity testing, the keyboard shortcut is F8. 

Error messages XML Spy produces when incorrect syntax is encountered can be confusing. For example, after removing the > character which closed a tag, I pressed F7. XML Spy came up with the followin error:

Although the message is correct, it could be more succint; for example, here is the output generated when validating the same document with the xmlstar command line utility:

What I did like about XML Spy, however, is that it automatically checked the xsl file for validity after I fixed the error, and saved it. It wasn't expecting that, but it was helpfull since I actually forgot to check for validity/well-formedness before saving. However, for my other open document - the post_35.xml - when I made some small changes and saved it, it did not perform the validity test; XML Spy does not check the validity of documents which do not declare a DTD/Schema. But it could have performed a well-formedness test. However, it did not do it automatically. 

XML Spy has very user-friendly XPath tools. The "XML -> Evaluate XPaht..." menu option brings up the XPath window, shown below. XPath expressions can be written in the textbox, and the results will be printed in the window below. This window is very convenient, and also features auto-completion of both elements in the document, as well as XPath functions. For example, the value of the'title' element can be accessed like this: 


Pressing "Enter" will result in the value of the title being printed in the "Output" window - in this case, ".doc versus .docx". Clicking the result will result in the window moving to the corresponding element, and hilighting it. Another powerfull feature is that XML Spy allows users to select the version of XPath used, and supports both XPath 1.0 and XPath 2.0. 

Although XML Spy is quite heavy, not portable ( only available on Windows ) and not free, it is also provides a very capable XML editing environment. However, students such as myself are probably not the target audience of this package; it's costly, and complex. Freely available tools do a good enough job for basic XML authoring scenarios, while IDE's such as XML Spy can provide the editing power when it is really needed.

.doc versus .docx

Ever since Microsoft Office 2007 was launched by Microsoft, a new headache began to plague office workers all around the world. The culprit was the new default format used by Microsoft's office suite, .docx. This format is intended to supersede the .doc format, which was perhaps perceived as obsolete by Microsoft.

The transition could have been smoother, if not for the fact that Microsoft Office versions older than 2007 cannot open the new format in their default configuration. It is possible for Office 2003 users to open .docx documents, after installing the Compatibility Kit, but they will be warned that the document might not be displayed as it would in later Office versions, and that some elements not supported by the older Word might not be present. This can lead to users making changes to a document, only to find out later that the document looks different, or has elements missing, when viewed with a different version of the Office suite.

I decided to perform a dissection of a .docx file, and get a glimpse at it's internal workings. I had a .docx, which was a BBC article I downloaded and converted into a .docx file for printing as part of an assignment. I changed it's extension to .zip, and indeed, the archive contains a number of folders, and a [Content_Types].xml file at it's root.

The [Content_Types].xml describes the contents of the other .xml files, scattered throughout the folder hierarchy. For the 'xml' extension, by default the ContentType is set to "application/xml", but certain files override this setting via Override elements. For example, "/word/document.xml" has the "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" content type.

The _rels folder for this document contains a single file, .rels - which doesn't have an xml extension, but is xml because it has the xml declaration as the first line, and is a well-formed xml document. It's root element is Relationships, which contains a number of Relationship elements, each having the Id, Type and Target attributes. I couldn't exactly figure out how are these used.

The docProps folder is a bit more interesting; it has two documents inside it, core.xml and app.xml. The core.xml is quite simple, and it is easy to see that it contains various meta-data about the document ( coreProperties, as the root element is named ).

I found out a number of interesting things - for example, the document had a dc:creator element ( dc is the namespace alias for "http://purl.org/dc/elements/1.1/" ), which has the value of my Middlesex University ID used for logging in to computers on the campus; I did not know Word automatically stores this information inside the file itself. I also noted that this file stores the dates the file was created and modified - therefore, .docx documents store this information independent of the file system. The file also contains information about who last modified the file.

The app.xml file is a bit of a mixed bag - I was expecting it to contain information about the version of Microsoft Word the .docx document was created with - which it did, inside the AppVersion element - but it also contains metadata about the document's contents; such as the number of paragraphs, pages, words and characters. In addition, it has a CompanyName element, which was bestowed the value 'Middlesex University'.

The documents inside the word folder contain the document data. The roles of most of the files inside it are not hard to guess; for example, the fontTable.xml file stores information about each of the fonts used in the document. The settings.xml file contains settings such as the zoom level set when the document was last edited, and the decimal symbol used. Images and other media will be stored inside the 'media' folder; in this case, it contained an image named image1.gif. This hierarchy resembles an XHTML webpage - the document written in a markup language, accompanied by resources ( such as images ) and stylesheets.

The document.xml file is at the core of the .docx folder hierarchy, containing the text and layout information. All the other .xml files are used to describe certain aspects of this file, and to support Word in deciding how to process or display it.


I then saved the file as a simple .doc; after which I renamed it to .zip and tried to open it, but the archiving software gave me an error. I then opened the document with a hex editor ( 010 Editor ), and the reason became clear: the .doc file is a binary format. Some of the text was readable while browsing the hex data, but most of the file was difficult to make sense of.

Is AJAX both a success and a failure?


Since it's emergence as a mainstream technology, AJAX has proven it's usefulness. AJAX technologies are now employed by companies such as Microsoft and Google, and are used by millions of web pages. 

AJAX does, however have drawbacks which take away from it's shine. One of the main concerns with AJAX technologies is with it's impact on usability - certain screen readers and other accessibility devices might not respond well to dynamically updated web pages. 

Important issues emerge when a fast internet connection is not available. This can lead to websites being very difficult to use, because AJAX is often used for functionality which requires responsiveness - such as responding to the keys the user types in a search box, by updating a list of matching items. On a slow internet connection, this might lead to serious usability issues. 

Another problem can arise due to the fact that AJAX relies on JavaScript. But many users don't like JavaScript, and all modern browsers provide some mechanism of preventing JavaScript code from executing. If the user has JavaScript disabled, the website will not be dynamically updated. Although improbable, it is also possible that some users visit the web page with browsers which do not support JavaScript - in which case, again, dynamic content will not be updated. 

On the server side, employing AJAX technologies can lead to very high number of requests to serve, which might be difficult to cope for the servers. But companies usually take this into account, and purchase hardware which can cope with the increased load; this is rarely an issue.

Wednesday 14 December 2011

Is there something wrong with XML ?


It seems somewhat inappropriate to state that there is something wrong with XML; it is, after all, just a technology designed for simple purposes: to provide an extensible markup language, which could be used on it's own, or be used to create other, specialized markup languages. The burgeoning scene of XML-based languages is a testament to XML's success, and therefore it fulfilled its purpose. 

But that is not to say that XML doesn't come without drawbacks, though. One of the most important issues with XML pertains to it's verbosity - XML files can become quite large. This is inherent to XML, and is one of the trade-offs consciously made by it's designers; integrity was deemed more important than file size. This issue can be mitigated to some degree by using compression - an XML file is, after all, just text so it lends itself to compression algorithms. However, this negates one of XML's advantages, since if the archive is damaged it would hardly be possible to salvage any of the data. 

Another issue consists of the relative complexity of an XML parser, as compared to alternatives available under certain conditions. For example, XML is often being used for storing configuration settings for applications. The application then has to include an XML parser, which could significantly increase the size of the application. This is less of an issue with large applications, such as Visual Studio or Microsoft Office, but simpler means of storing configuration files - such as .ini or .conf files - are available for smaller applications. 

XML files can also be cumbersome to author without a specialized editor, again due to the explicit nature of XML. This not a concern when small changes need to be made - for example, to change the resolution for a game in it's config.xml. But authoring can be an extremely tedious process when large chunks of XML need to be added, or when a document is created from scratch. This problem is compounded by the fact that XML documents should be well-formed, and in many cases valid; and small mistakes can render a document unacceptable to parsers due to well-formedness or validity issues. This issue can make XML documents 'read-only' for humans.

XML Security


XML is becoming increasingly common, and therefore security issues are also having an increased importance. Given that XML is frequently used in e-commerce, businesses need ways to ensure that stored data is secure, and prevent unauthorized access. This is especially important due to the textual nature of XML - a simple text editor can be used to view the contents and extract sensitive data.

A number of security-related technologies have emerged, one of the most important ones being XACML - the eXtensible Access Control Markup Language. XACML allows for controlling access to information via rules and policies. Other significant initiatives are XML signature and XML encryption. 

The concept of digital signatures is not new, and mature standards have been in place for quite some time. But existing technologies only allow signing individual files. Given the hierarchical nature of XML, more granularity is highly desirable - in other words, the ability to sign portions of an XML document. This is the role fulfilled by XML signature. The high-level algorithm is quite simple - the element to be signed is hashed, the hash being stored inside a DigestValue element. Another digest ( usually a hash ) is produced from this element and is cryptographically signed. The XML signature is then inserted in the signature element. 

XML encryption operates on similar principles to XML signatures - it allows the encryption of XML documents with a high granularity, enabling the encryption of portions of a document. An additional advantage is that different encryption keys and algorithms can be used for each encrypted portion, allowing precise control over who can use which portions of the document. 

XML encryption is especially important in a business environment. For example, a delivery company might define an XML document type which contains one element which contains information about the client, and billing information, and one element containing information about the contents of the delivered package. The driver will have the key for viewing the client information, while the client might be sent via email the key for him to be able to view information about the content and check if anything is missing.