In organizations, XML format is used to store data for multiple applications. To derive value from XML data, you need to be able to ingest that data into Hadoop and run analytics on it. Fortunately, this task is very easy using Big SQL in BigInsights 3.0.  BigInsights 3.0 includes an all-new Big SQL (referred to here as Big SQL 3.0) and access to the prior version of Big SQL (referred to here as Big SQL 1.0).  For more information see Which version of Big SQL should you use?

Download BigInsights QuickStart Edition

Download the free BigInsights Quick Start Edition to try this tutorial yourself.

Set up BigInsights for ingesting XML data

Download the following JAR files and register them in BigInsights. To do so, follow the steps:

    1. Download the following JAR files:
      hivexmlserde-1.0.0.0.jar (hive-xml SerDe)
      hivexmlserde-vtd-1.0.0.0.jar (vtd-xml parser)
      vtd-xml-2.11.jar (dependency of vtd-xml parser)
    2. Copy the three JAR files to the following directories. You will need to copy all the three JAR into each of the three following listed directories so that each component has a path set to use them.
      Location Component
      /$BIGINSIGHTS_HOME/hive/lib Hive
      /$BIGINSIGHTS_HOME/bigsql/bigsql1/userlib/ Big SQL 1.0
      /$BIGINSIGHTS_HOME/bigsql/userlib/ Big SQL 3.0
    3. Restart the cluster by using the scripts available in $BIGINSIGHTS_HOME/bin
./stop-all (Allow time to stop – probably 5-10 minutes)
./start-all

To ingest XML documents, the Hadoop ecosystem needs to understand XML format and how to interpret it. To achieve that, we need to use a hive-xml SerDe (Serializer Deserializer) so that Hive can understand XML while reading it. To enhance XML parsing, we will be using vtd-xml-parser instead of default Java DOM XML parser. Based on my testing, vtd-xml parser performs five times faster than Java DOM XML parser.

(To do minimal XML processing in Hadoop, you do not need hivexmlserde-vtd and vtd-xml. But doing the processing this way is slow in comparison. Considering the situation of Big Data, where we process multiple terrabytes of data, vtd parser enhances the performance considerably.)

Input data

Download the sample XML data set from ClinicalTrials.gov. (This site provides a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world.) This data set has 500 XML files, which will be used for this exercise. To make sure that you have the expected XML files, sample XML content from this data set is shown below:XML pic

 

 

  1. Download and save the files as /tmp/clinicaltrial.zip.
  2. Create a temp directory and unzip the zip file into the temp directory
mkdir /tmp/clinicaltrial
unzip /tmp/clinicaltrial.zip –d /tmp/clinicaltrial
ls /tmp/clinicaltrial|wc –l

(This should return 500, there are 500 XML files in this directory)

hadoop fs –mkdir /user/biadmin/clinical_study_xml

(This creates the directory clinical_study_xml in the /user/biadmin directory)

hadoop fs –copyFromLocal /tmp/clinicaltrial/*.xml /user/biadmin/clinical_study_xml

(This copies all the XML files in /tmp/clinicaltrial directory into BigInsights)

Ingest to Big SQL 1.0

Create the following SQL commands to create a clinical_study_xml table in Big SQL 1.0. Copy the SQL code below into a file named clinical_study_xml.sql. This SQL statement creates a table with 10 fields with the corresponding XPath highlighted above.

[code language=”SQL”] drop table if exists clinical_study_xml;

create external table clinical_study_xml

url varchar(128),

org_study_id varchar(20),

secondary_id ARRAY<varchar(20)>,

nct_id varchar(20),

lead_sponsor STRUCT<agency:varchar(60), agency_class:varchar(20)>,

phase ARRAY<varchar(20)>,

study_type varchar(20),

enrollment_type varchar(20),

enrollment BIGINT,

condition ARRAY<varchar(20)>)

row format serde ‘com.ibm.spss.hive.serde2.xml.XmlSerDe’

with serdeproperties (

"xml.processor.class"="com.ximpleware.hive.serde2.xml.vtd.XmlProcessor",

"column.xpath.url" = "/clinical_study/required_header/url/text()",

"column.xpath.org_study_id" = "/clinical_study/id_info/org_study_id/text()",

"column.xpath.secondary_id" = "/clinical_study/id_info/secondary_id/text()",

"column.xpath.nct_id" = "/clinical_study/id_info/nct_id/text()",

"column.xpath.lead_sponsor" = "/clinical_study/sponsors/lead_sponsor/*",

"column.xpath.phase" = "/clinical_study/phase/text()",

"column.xpath.study_type" = "/clinical_study/study_type/text()",

"column.xpath.enrollment_type" = "/clinical_study/enrollment/@type",

"column.xpath.enrollment" = "/clinical_study/enrollment/text()",

"column.xpath.condition" = "/clinical_study/condition/text()"
)

stored as

inputformat ‘com.ibm.spss.hive.serde2.xml.XmlInputFormat’

outputformat ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’

location ‘/user/biadmin/clinical_study_xml’

tblproperties (

"xmlinput.start" = "<clinical_study",

"xmlinput.end" = "</clinical_study>"
);[/code]

Execute the following command to create the table in Big SQL 1.0.

$JSQSH_HOME/bin/jsqsh –user=biadmin –password=biadmin –input-file=clinical_study_xml.sql

Note that you need to replace the user and password values for your situation. The above command creates the table with the external table located at /user/biadmin/clinical_study_xml
Execute the following SQL statements to explore the table inside JSqsh:

$JSQSH_HOME/bin/jsqsh
\connect bigsql1

[biadmin] 1> select count(*) from clinical_study_xml;

+-----+

||

+-----+

| 500 |

+-----+

1 row in results(first row: 2.60s; total: 2.60s)

[bivm.ibm.com][biadmin] 1> select * from clinical_study_xml limit 2;

+------------+------------+------------+------------+------------+-----------+------------+------------+------+------------+

| url| org_study_ | secondary_ | nct_id| lead_spons | phase| study_type | enrollment | enro | condition|

|| id| id|| or||| _type| llme ||

|||||||||nt ||

+------------+------------+------------+------------+------------+-----------+------------+------------+------+------------+

| http://cli | P60| [P60AR0206 | NCT0000041 | {Boston| [N/A]| Interventi | Actual|242 | [Rheumatoi |

| nicaltrial | AR20613| 13,| 6| University || onal||| d|

| s.gov/show || NIAMS-012, || , Other}||||| Arthritis, |

| /NCT000004 || EEHRS||||||| Systemic|

| 16|| Project 3] ||||||| Lupus|

|||||||||| Erythemato |

|||||||||| sus,|

|||||||| || Osteoarthr |

|||||||||| itis,|

|||||||||| Knee,|

|||||||||| Ankylosing |

|||||||||| Spondyliti |

|||||||||| s]|

| http://cli | P60| [P60AR0363 | NCT0000041 | {Brigham| [Phase 2] | Interventi | [NULL]|150 ||

| nicaltrial | AR36308| 08]| 7| and|| onal||| Lupus|

| s.gov/show | NIAMS-014||| Women's||||| Erythemato |

| /NCT000004 |||| Other}||||| sus]|

| 17||||||||||

+------------+------------+------------+------------+------------+-----------+------------+------------+------+------------+

2 rows in results(first row: 0.31s; total: 0.32s)

[biadmin] 1> select enrollment_type, count(*) "Count" from clinical_study_xml group by enrollment_type;

+-----------------+-------+

| enrollment_type | Count |

+-----------------+-------+

| [NULL]|67 |

| Anticipated|226 |

| Actual|207 |

+-----------------+-------+

3 rows in results(first row: 2.61s; total: 2.61s)

[biadmin] 1> select enrollment from clinical_study_xml limit 5;

+------------+

| enrollment |

+------------+

|242 |

|150 |

|350 |

|350 |

|154 |

+------------+

5 rows in results(first row: 0.40s; total: 0.40s)

[biadmin] 1> select lead_sponsor from clinical_study_xml limit 5; +-------------------------------------------------+ | lead_sponsor| +-------------------------------------------------+ | {Boston University, Other}| | {Brigham and Women's Hospital, Other}| | {New York University School of Medicine, Other} | | {New York University School of Medicine, Other} | | {New York University School of Medicine, Other} | +-------------------------------------------------+ 5 rows in results(first row: 0.32s; total: 0.32s)

[biadmin] 1> select lead_sponsor.agency, lead_sponsor.agency_class from clinical_study_xml limit 5; +----------------------------------------+--------------+ | agency| agency_class | +----------------------------------------+--------------+ | Boston University| Other| | Brigham and Women's Hospital| Other| | New York University School of Medicine | Other| | New York University School of Medicine | Other| | New York University School of Medicine | Other| +----------------------------------------+--------------+ 5 rows in results(first row: 0.33s; total: 0.33s)

Ingest to Big SQL 3.0

Creating a table in Big SQL 3.0 requires a little syntactical difference from Big SQL 1.0. Create the following SQL commands to create a clinical_study_xml_3 table in Big SQL 3.0. Copy the following SQL code into a file named clinical_study_xml_3.sql.

[code language=”SQL”]drop table if exists clinical_study_xml_3;

create hadoop table clinical_study_xml_3
(
url VARCHAR(1000),
org_study_id VARCHAR(1000),
nct_id VARCHAR(100),
study_type VARCHAR(1000),
enrollment_type VARCHAR(100),
enrollment BIGINT
)
row format serde ‘com.ibm.spss.hive.serde2.xml.XmlSerDe’
with serdeproperties (
‘column.xpath.url’ = ‘/clinical_study/required_header/url/text()’,
‘column.xpath.org_study_id’ = ‘/clinical_study/id_info/org_study_id/text()’,
‘column.xpath.nct_id’ = ‘/clinical_study/id_info/nct_id/text()’,
‘column.xpath.study_type’ = ‘/clinical_study/study_type/text()’,
‘column.xpath.enrollment_type’ = ‘/clinical_study/enrollment/@type’,
‘column.xpath.enrollment’ = ‘/clinical_study/enrollment/text()’
)
stored as
inputformat ‘com.ibm.spss.hive.serde2.xml.XmlInputFormat’
outputformat ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
location ‘/user/biadmin/clinical_study_xml’
tblproperties (
‘xmlinput.start’ = ‘<clinical_study’,
‘xmlinput.end’ = ‘</clinical_study>’
);[/code]

Execute the following command to create the table in Big SQL 3.0.

$JSQSH_HOME/bin/jsqsh – user=biadmin –password=biadmin –input-file=clinical_study_xml_3.sql

Note that you need to replace the user and password values for your situation. The command above creates the table with external table located at /user/biadmin/clinical_study_xml

Execute the following SQL statements to explore the table inside JSqsh:

$JSQSH_HOME/bin/jsqsh
\connect bigsql

Password: ********

WARN [State:]

[Code: 0]: Statement processing was successful.. SQLCODE=0, SQLSTATE=, DRIVER=3.67.33
[bigdata.ibm.com][bigsql] 1> select count(*) from clinical_study_xml_3;

+-----+

|1 |

+-----+

| 500 |

+-----+

1 row in results(first row: 10.57s; total: 10.57s)

[bigdata.ibm.com][bigsql] 1> select * from clinical_study_xml_3 fetch first 2 rows only;

+--------------------------------------------+---------------+-------------+----------------+-----------------+------------+

| URL| ORG_STUDY_ID| NCT_ID| STUDY_TYPE| ENROLLMENT_TYPE | ENROLLMENT |

+--------------------------------------------+---------------+-------------+----------------+-----------------+------------+

| http://clinicaltrials.gov/show/NCT01858792 | 115030| NCT01858792 | Observational| Actual|1 |

| http://clinicaltrials.gov/show/NCT01861561 | 125/2556(EC2) | NCT01861561 | Interventional | Anticipated|70 |

+--------------------------------------------+---------------+-------------+----------------+-----------------+------------+

2 rows in results(first row: 9.36s; total: 9.36s

[bigdata.ibm.com][bigsql] 1> select enrollment_type, count(*) "COUNT" from clinical_stu dy_xml_3 group by enrollment_type; +-----------------+-------+ | ENROLLMENT_TYPE | Count | +-----------------+-------+ | Actual|207 | | Anticipated|226 | | [NULL]|67 | +-----------------+-------+ 3 rows in results(first row: 9.30s; total: 9.30s)

Conclusion

This article has demonstrated how easily XML data can be ingested into BigInsights 3.0. The article also demonstrates how you can query the ingested XML data in Big SQL 1.0 and 3.0. We also showed queries for accessing Hive data types like UNION and STRUCT with Big SQL 1.0. Big SQL 3.0 will support complex data types (STRUCT, ARRAY) in the next release.

Note: This Article has been jointly authored by Vinayak Agrawal(vagrawal@us.ibm.com) and P S Aravind(psaravind@us.ibm.com)

1 comment on"Processing XML data in BigInsights 3.0"

Join The Discussion

Your email address will not be published. Required fields are marked *