A while back, Robin Smidsrød sent me a link to a series of articles starting with Implementing WWW::LastFM. He'd just released a new CPAN module to simplify extracting information from XML documents and wanted to show it off.
I put off looking at it, and then I looked at it, and I was impressed.
Then I had to write ETL code for clinical reports from the US FDA (a regulatory agency which, among other things, requires specific controlled studies before approving drugs to be sold to the public within the US). The ~135,000 reports in my dataset range from a few dozen lines of XML to hundreds of kilobytes, and they all roughly follow the same format. (Some fields are apparently optional, and some fields have started to appear over the years.)
I looked at a four example files and very nearly started to prepare myself to use an XML parser to extract data, and then I remembered Robin's articles and his module XML::Rabbit. (You may have opinions about non-technical names, but I remembered the name, and that's what matters.)
XML::Rabbit is something like an object-XML mapper which uses
Moose and XPath. In other words, you declare the attributes of a class as
mappings of names to XPath expressions.
Consider an XML document which corresponds to a record like NCT0021366:
<?xml version="1.0" encoding="UTF-8"?> <clinical_study rank="89160"> <!-- This xml conforms to an XML Schema at: http://clinicaltrials.gov/ct2/html/images/info/public.xsd and an XML DTD at: http://clinicaltrials.gov/ct2/html/images/info/public.dtd --> <required_header> <download_date>Information obtained from ClinicalTrials.gov on October 25, 2012</download_date> <link_text>Link to the current ClinicalTrials.gov record.</link_text> <url>http://clinicaltrials.gov/show/NCT00210366</url> </required_header> <id_info> <org_study_id>IELSG21</org_study_id> <nct_id>NCT00210366</nct_id> </id_info> <brief_title>Salvage Therapy With Idarubicin in Relapsing CNS Lymphoma</brief_title> <official_title>Salvage Therapy With Idarubicin in Immunocompetent Patients With Relapsed or Refractory Primary Central Nervous System Lymphomas</official_title> <sponsors> <lead_sponsor> <agency>International Extranodal Lymphoma Study Group (IELSG)</agency> <agency_class>Other</agency_class> </lead_sponsor> </sponsors> <source>International Extranodal Lymphoma Study Group (IELSG)</source> <oversight_info> <authority>Italy: Ministry of Health</authority> </oversight_info> <brief_summary> <textblock> The main objective of the trial is to assess the therapeutic activity of idarubicin as salvage treatment in patients with recurrent or progressive lymphoma in the central nervous system. </textblock> </brief_summary> <overall_status>Terminated</overall_status> <why_stopped> due to slow accrual </why_stopped> <start_date>November 2004</start_date> <primary_completion_date type="Actual">July 2010</primary_completion_date> <phase>Phase 2</phase> <study_type>Interventional</study_type> <study_design>Allocation: Non-Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment</study_design> <primary_outcome> <measure>objective response to treatment</measure> </primary_outcome> <secondary_outcome> <measure>duration of response</measure> </secondary_outcome> <secondary_outcome> <measure>overall survival</measure> </secondary_outcome> <secondary_outcome> <measure>acute side effects of idarubicin</measure> </secondary_outcome> <enrollment type="Anticipated">25</enrollment> <condition>Lymphoma, B-Cell</condition> <intervention> <intervention_type>Drug</intervention_type> <intervention_name>Idarubicin</intervention_name> </intervention> <eligibility> <criteria> <textblock> Inclusion Criteria: - Histological or cytological diagnosis of non-Hodgkin's lymphoma - Disease exclusively localised into the CNS at first diagnosis and failure - Progressive or recurrent disease - Previous treatment with HDMTX containing CHT and/or RT - Presence of at least one target lesion, bidimensionally measurable - Age 18 - 75 years - ECOG performance status < 3 (Appendix 1). - No known HIV disease or immunodeficiency - HBsAg-negative and Ab anti-HCV-negative patients. - Adequate bone marrow function (plt > 100000 mm3, Hb > 9 g/dl, ANC > 2.000 mm3) - Adequate renal function (serum creatinine < 2 times UNL) - Adequate hepatic function (SGOT/SGPT < 3 times UNL, bilirubin and alkaline phosphatase < 2 times UNL) - Adequate cardiac function (VEF ≥ 50%) - Absence of any psycological, familial, sociological or geographical condition potentially hampering compliance with the study protocol and follow-up schedule - Non-pregnant and non-lactating status for female patients. Adequate contraceptive measures during study participation for sexually active patients of childbearing potential. - No previous or concurrent malignancies at other sites with the exception of surgically cured carcinoma in-site of the cervix and basal or squamous cell carcinoma of the skin and of other neoplasms without evidence of disease since at least 5 years. - No concurrent treatment with other experimental drugs. - Informed consent signed by the patient before registration </textblock> </criteria> <gender>Both</gender> <minimum_age>18 Years</minimum_age> <maximum_age>75 Years</maximum_age> <healthy_volunteers>No</healthy_volunteers> </eligibility> <overall_official> <last_name>Andres JM Ferreri, MD</last_name> <role>Study Chair</role> <affiliation>San Raffaele Hospital - HSR Servizio di radiochemioterapia</affiliation> </overall_official> <location> <facility> <name>Servizio Radiochemioterapia - Ospedale San Raffaele</name> <address> <city>Milan</city> <zip>20132</zip> <country>Italy</country> </address> </facility> </location> <location_countries> <country>Italy</country> </location_countries> <link> <url>http://www.ielsg.org</url> <description>Click here for more information about this study</description> </link> <verification_date>July 2010</verification_date> <lastchanged_date>July 29, 2010</lastchanged_date> <firstreceived_date>September 13, 2005</firstreceived_date> <responsible_party> <name_title>International Extranodal Lymphoma Study Group</name_title> <organization>IELSG</organization> </responsible_party> <is_fda_regulated>No</is_fda_regulated> <has_expanded_access>No</has_expanded_access> <condition_browse> <!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm --> <mesh_term>Lymphoma</mesh_term> <mesh_term>Lymphoma, B-Cell</mesh_term> </condition_browse> <intervention_browse> <!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm --> <mesh_term>Idarubicin</mesh_term> </intervention_browse> <!-- Results have not yet been released for this study --> </clinical_study>
At a minimum, I might need to extract the NCT number, the start date, the
completion date, and the long title from this document. In
XML::Rabbit, that's as simple as:
package Study; use strict; use warnings; use XML::Rabbit::Root; has_xpath_value long_title => './official_title'; has_xpath_value nct => './id_info/nct_id'; has_xpath_value start_date => './start_date'; has_xpath_value completion_date => './completion_date'; finalize_class(); 1;
Seriously, that's it. Parse a document with
my $study = Study->new(
xml => $xml );. Access the attributes with accessors.
It gets better. I also need to extract contacts from the study so I can
associate them with user accounts.
XML::Rabbit can create nested
package Study; use strict; use warnings; use XML::Rabbit::Root; has_xpath_value long_title => './official_title'; has_xpath_value nct => './id_info/nct_id'; has_xpath_value start_date => './start_date'; has_xpath_value completion_date => './completion_date'; has_xpath_object_list contacts => './overall_contact|./overall_contact_backup' => 'Study::Contact'; finalize_class(); package Study::Contact; use strict; use warnings; use XML::Rabbit; has_xpath_value email => './email'; has_xpath_value last_name => './last_name'; has_xpath_value role => './role'; finalize_class(); 1;
For every node matching the XPath expression provided to
XML::Rabbit will attempt to create a
Study::Contact item ad associate it with the parent
I truly appreciate that these documents are plain old Moose objects. Anything you can do with Moose you can do with these classes—including adding methods (or applying roles or...) to add behavior to the objects.
Updating the parser classes to add a new element to extract takes thirty seconds, and that includes launching a new instance of my text editor. I added a couple of methods to report documents with missing fields and that took about ten minutes (it takes an order of magnitude longer to examine the documents than it did to write my code).
I'm still not sold on XML as a data interchange format, but the combination of XPath and Moose has certainly made my job much, much easier. I almost wish I'd thought of this trick.