I Stopped Parsing XML Thanks To XML::Rabbit

A while back, Robin Smidsrød sent me a link to a series of articles starting with Implementing WWW::LastFM. He'd just released a new CPAN module to simplify extracting information from XML documents and wanted to show it off.

I put off looking at it, and then I looked at it, and I was impressed.

Then I had to write ETL code for clinical reports from the US FDA (a regulatory agency which, among other things, requires specific controlled studies before approving drugs to be sold to the public within the US). The ~135,000 reports in my dataset range from a few dozen lines of XML to hundreds of kilobytes, and they all roughly follow the same format. (Some fields are apparently optional, and some fields have started to appear over the years.)

I looked at a four example files and very nearly started to prepare myself to use an XML parser to extract data, and then I remembered Robin's articles and his module XML::Rabbit. (You may have opinions about non-technical names, but I remembered the name, and that's what matters.)

XML::Rabbit is something like an object-XML mapper which uses Moose and XPath. In other words, you declare the attributes of a class as mappings of names to XPath expressions.

Consider an XML document which corresponds to a record like NCT0021366:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="89160">
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <required_header>
    <download_date>Information obtained from ClinicalTrials.gov on October 25, 2012</download_date>
    <link_text>Link to the current ClinicalTrials.gov record.</link_text>
    <url>http://clinicaltrials.gov/show/NCT00210366</url>
  </required_header>
  <id_info>
    <org_study_id>IELSG21</org_study_id>
    <nct_id>NCT00210366</nct_id>
  </id_info>
  <brief_title>Salvage Therapy With Idarubicin in Relapsing CNS Lymphoma</brief_title>
  <official_title>Salvage Therapy With Idarubicin in Immunocompetent Patients With Relapsed or Refractory Primary Central Nervous System Lymphomas</official_title>
  <sponsors>
    <lead_sponsor>
      <agency>International Extranodal Lymphoma Study Group (IELSG)</agency>
      <agency_class>Other</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>International Extranodal Lymphoma Study Group (IELSG)</source>
  <oversight_info>
    <authority>Italy: Ministry of Health</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The main objective of the trial is to assess the therapeutic activity of idarubicin as
      salvage treatment in patients with recurrent or progressive lymphoma in the central nervous
      system.
    </textblock>
  </brief_summary>
  <overall_status>Terminated</overall_status>
  <why_stopped>
    due to slow accrual
  </why_stopped>
  <start_date>November 2004</start_date>
  <primary_completion_date type="Actual">July 2010</primary_completion_date>
  <phase>Phase 2</phase>
  <study_type>Interventional</study_type>
  <study_design>Allocation:  Non-Randomized, Endpoint Classification:  Safety/Efficacy Study, Intervention Model:  Single Group Assignment, Masking:  Open Label, Primary Purpose:  Treatment</study_design>
  <primary_outcome>
    <measure>objective response to treatment</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>duration of response</measure>
  </secondary_outcome>
  <secondary_outcome>
    <measure>overall survival</measure>
  </secondary_outcome>
  <secondary_outcome>
    <measure>acute side effects of idarubicin</measure>
  </secondary_outcome>
  <enrollment type="Anticipated">25</enrollment>
  <condition>Lymphoma, B-Cell</condition>
  <intervention>
    <intervention_type>Drug</intervention_type>
    <intervention_name>Idarubicin</intervention_name>
  </intervention>
  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:

          -  Histological or cytological diagnosis of non-Hodgkin's lymphoma

          -  Disease exclusively localised into the CNS at first diagnosis and failure

          -  Progressive or recurrent disease

          -  Previous treatment with HDMTX containing CHT and/or RT

          -  Presence of at least one target lesion, bidimensionally measurable

          -  Age 18 - 75 years

          -  ECOG performance status &lt; 3 (Appendix 1).

          -  No known HIV disease or immunodeficiency

          -  HBsAg-negative and Ab anti-HCV-negative patients.

          -  Adequate bone marrow function (plt &gt; 100000 mm3, Hb &gt; 9 g/dl, ANC &gt; 2.000 mm3)

          -  Adequate renal function (serum creatinine &lt; 2 times UNL)

          -  Adequate hepatic function (SGOT/SGPT &lt; 3 times UNL, bilirubin and alkaline
             phosphatase &lt; 2 times UNL)

          -  Adequate cardiac function (VEF ≥ 50%)

          -  Absence of any psycological, familial, sociological or geographical condition
             potentially hampering compliance with the study protocol and follow-up schedule

          -  Non-pregnant and non-lactating status for female patients. Adequate contraceptive
             measures during study participation for sexually active patients of childbearing
             potential.

          -  No previous or concurrent malignancies at other sites with the exception of
             surgically cured carcinoma in-site of the cervix and basal or squamous cell carcinoma
             of the skin and of other neoplasms without evidence of disease since at least 5
             years.

          -  No concurrent treatment with other experimental drugs.

          -  Informed consent signed by the patient before registration
      </textblock>
    </criteria>
    <gender>Both</gender>
    <minimum_age>18 Years</minimum_age>
    <maximum_age>75 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>
  <overall_official>
    <last_name>Andres JM Ferreri, MD</last_name>
    <role>Study Chair</role>
    <affiliation>San Raffaele Hospital - HSR Servizio di radiochemioterapia</affiliation>
  </overall_official>
  <location>
    <facility>
      <name>Servizio Radiochemioterapia - Ospedale San Raffaele</name>
      <address>
        <city>Milan</city>
        <zip>20132</zip>
        <country>Italy</country>
      </address>
    </facility>
  </location>
  <location_countries>
    <country>Italy</country>
  </location_countries>
  <link>
    <url>http://www.ielsg.org</url>
    <description>Click here for more information about this study</description>
  </link>
  <verification_date>July 2010</verification_date>
  <lastchanged_date>July 29, 2010</lastchanged_date>
  <firstreceived_date>September 13, 2005</firstreceived_date>
  <responsible_party>
    <name_title>International Extranodal Lymphoma Study Group</name_title>
    <organization>IELSG</organization>
  </responsible_party>
  <is_fda_regulated>No</is_fda_regulated>
  <has_expanded_access>No</has_expanded_access>
  <condition_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Lymphoma</mesh_term>
    <mesh_term>Lymphoma, B-Cell</mesh_term>
  </condition_browse>
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Idarubicin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been released for this study                              -->
</clinical_study>

At a minimum, I might need to extract the NCT number, the start date, the completion date, and the long title from this document. In XML::Rabbit, that's as simple as:

package Study;

use strict;
use warnings;

use XML::Rabbit::Root;

has_xpath_value long_title           => './official_title';
has_xpath_value nct                  => './id_info/nct_id';
has_xpath_value start_date           => './start_date';
has_xpath_value completion_date      => './completion_date';

finalize_class();

1;

That's it.

Seriously, that's it. Parse a document with my $study = Study->new( xml => $xml );. Access the attributes with accessors.

It gets better. I also need to extract contacts from the study so I can associate them with user accounts. XML::Rabbit can create nested objects too:

package Study;

use strict;
use warnings;

use XML::Rabbit::Root;

has_xpath_value long_title      => './official_title';
has_xpath_value nct             => './id_info/nct_id';
has_xpath_value start_date      => './start_date';
has_xpath_value completion_date => './completion_date';

has_xpath_object_list contacts  => './overall_contact|./overall_contact_backup'
                                => 'Study::Contact';

finalize_class();

package Study::Contact;

use strict;
use warnings;

use XML::Rabbit;

has_xpath_value email     => './email';
has_xpath_value last_name => './last_name';
has_xpath_value role      => './role';

finalize_class();

1;

For every node matching the XPath expression provided to contacts, XML::Rabbit will attempt to create a Study::Contact item ad associate it with the parent Study object.

I truly appreciate that these documents are plain old Moose objects. Anything you can do with Moose you can do with these classes—including adding methods (or applying roles or...) to add behavior to the objects.

Updating the parser classes to add a new element to extract takes thirty seconds, and that includes launching a new instance of my text editor. I added a couple of methods to report documents with missing fields and that took about ten minutes (it takes an order of magnitude longer to examine the documents than it did to write my code).

I'm still not sold on XML as a data interchange format, but the combination of XPath and Moose has certainly made my job much, much easier. I almost wish I'd thought of this trick.

6 Comments

Robin Smidsrød | October 30, 2012 12:53 AM

Great writeup! Glad you continue to enjoy my module.
I just wanted to give a few notes on your implementation to avoid any confusion.

1. When you use XML::Rabbit::Root, it automatically uses strict and warnings on the caller, same as Moose, so those two lines are redundant.

2. You don't need 1; at the end of the file, because finalize_class() always returns true. I did this to simplify the boilerplate code of each file. I'm aware it violates a Perl::Critic rule, but I choose to ignore it.

3. Most importantly, nested classes should NOT use XML::Rabbit::Root, they should only use XML::Rabbit. I'm actually a bit surprised your code for Study::Contact actually works. I guess I should probably come up with a test to disallow that kind of usage. I think if you included an XPath query in Study::Contact that went outside <overall_contact/>, it wouldn't resolve the data. If you had used XML::Rabbit instead, it would contain a reference to the root of the XML data structure and you could make references to it with an XPath query that doesn't start with './'.

Anonymous | October 30, 2012 1:00 AM

See XML::Pastor/Corinna for declarative OO XML toolkits that support writing.

oylenshpeegul | October 30, 2012 3:43 PM

Don't you want to keep the "use XML::Rabbit::Root" on package Study? I think Robin's comment 3 only applies to package Study::Contact.

Also, I'm guessing it should be "has_xpath_value last_name => './last_name'" rather than ./email again.

Eric Johnson (kablamo) | November 1, 2012 5:53 AM

Very cool!

I had been thinking about this xpath approach as well. My thoughts were inspired by Web::Scraper which uses a similar approach to parsing web pages. I was wondering if Web::Scraper can also parse XML but maybe I don't have to care anymore.

Toby Inkster | November 1, 2012 10:31 AM

Also perhaps of interest XML::LibXML::Augment which approaches the same problem from a different direction. XLA takes advantage of the fact that the DOM is already a set of objects, albeit with boring, generic methods like getChildrenByTagName, and allows you to supplement the DOM with domain-specific methods.

So you'd create subclasses of XML::LibXML::Element for studies and contacts, with whatever methods you like, and then use XML::LibXML::Augment to associate those subclasses with particular namespaces and element names.

hesco | November 11, 2012 7:25 AM

I wonder if instead of
overall_contact

you actually meant:
overall_official

The code related to parsing contacts does not seem
to be supported by the sample xml you provide.

I Stopped Parsing XML Thanks To XML::Rabbit

Tags:

6 Comments

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry