Many systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs.
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Efficient processing of large and complex XML documents in Hadoop
1. Efficient
processing
of
large
and
complex
XML
documents
in
Hadoop
Sujoe
Bose
Senior
Principal,
Sabre
Holdings
June,
2013
2. Presenta.on
Outline
§ MoBvaBon
§ ETL
vs.
ELT
§ Avro
Format
§ Mapping
from
XML
to
Avro
§ Interfaces
to
access
Avro
§ Performance
and
Storage
consideraBons
§ Other
types
of
storage/processing
formats
confidenBal
2
3. You
will
learn
about
…
§ A
method
to
store
and
process
complex
XML
data
in
Hadoop
as
Avro
files
§ Interfaces
to
access
and
analyze
data
in
Avro
from
Hive,
Java
and
Pig
§ VariaBons
of
the
method
and
their
relaBve
trade-‐offs
in
storage
and
processing
confidenBal
3
4. Mo.va.on
§ Prevalence
of
XML
and
its
derivaBves
– Spurred
by
WebServices
and
SOA
– Preferred
communicaBon
format
unBl
newer
formats
entered
– Data
and
logs
represented
in
XML
§ XML
–
metadata
combined
data
– Flexibility
vs.
Complexity
§ Could
be
arbitrarily
nested
and
large
§ Volumes
of
documents
–
Big
Data
confidenBal
4
5. Challenges
§ Parsing
XML
is
CPU
Intensive
§ Certain
parsers/parsing
methods
result
in
more
memory
consumpBon
§ Repeated
parsing
for
each
query
§ Large
and
deeply
nested
XMLs
makes
problem
worse
§ Presence
of
tags
in
data
result
in
high
I/O
due
to
storage
size
§ Special
handling
of
opBonal
fields
confidenBal
5
6. ETL
vs.
ELT
confidenBal
6
§ Hadoop
generally
built
for
EL
–
T
– aka
Schema-‐on-‐Read
– Load
as-‐is
– Transform
on
Access/Query
§ Compare
with
Data
Warehouse
ETL
– Aka
Schema-‐on-‐Write
– Transform
and
Load
– Queries
are
lot
simpler
– TransformaBon
and
cleansing
done
a
priori
7. Mix
of
ETL
and
ELT
§ Generally
beaer
in
Flexibility
§ More
suitable
for
simpler
and
well-‐defined
formats
§ More
applicable
for
experimentaBon
§ XML
data
parsed
on
demand
for
every
query
confidenBal
7
§ Generally
beaer
in
Performance
§ More
suitable
when
substanBal
cleansing
and
reformacng
is
needed
§ RepeBBve
queries
and
producBon
workloads
§ XML
Data
pre-‐parsed
to
minimize
resource
usage
ELT
ETL
11. XML
Pre-‐parsing
§ Nested
Elements
and
Aaributes
§ RepresentaBon
of
parsed
XML
Structure
§ Enter
Avro!
confidenBal
11
12. Avro
§ Data
serializaBon
system
§ Specifically
designed
for
Hadoop,
but
used
in
other
environments
also
§ Rich
data
structures:
Arrays,
Records,
Maps
etc.
§ Compact,
fast,
binary
data
format
§ Metadata
stored
at
file
level
–
not
record
level
§
Split-‐able
–
Ideal
for
Map-‐Reduce
confidenBal
12
13. Avro
APIs
§ Generic
Objects
and
Pre-‐generated
Objects
– Easy
API
including
simple
gets
and
puts
§ APIs
in
several
languages
– Java
– C#
– C/C++
– Python
– Ruby
confidenBal
13
14. Use-‐case
§ FIXML
–
Financial
InformaBon
eXchange
– hap://www.fixprotocol.org/specificaBons/
§ XML
Database
Benchmark
– hap://tpox.sourceforge.net/
§ Provides
sample
data
for
benchmarking
§ Data
Generator
for
generaBng
large
and
predictable
datasets
confidenBal
14
15. FIXML
§ XML
Data
Generator
– hap://tpox.sourceforge.net/tpoxdata.htm
§ Order:
Buy
and
sell
order
of
securiBes
confidenBal
15
16. Simple
mapping
confidenBal
16
XML
Avro
Pig
Elements
with
repeated
nested
elements
Array
Bag
Elements
with
aaributes
and
text
elements
Record
Tuple
Aaributes
and
Text
Elements
Field
Field
19. Avro
–
Access
Methods
§ Direct
support
for
access
from
Hive
(using
SerDe)
CREATE EXTERNAL TABLE <TableName>!
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT!
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’!
LOCATION ‘location-of-avro-files’!
TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-
file.avsc')
§ Access
via
Pig
-‐
AvroStorage
§ Avro
API
-‐
Java
MapReduce
confidenBal
19
20. Test
Data
§ Base
SecuriBes
Order
file
500,000
records
§ Replicated
for
volume
– 15x
-‐
7.5
million
records
– 30x
-‐
15
million
records
– 45x
-‐
22.5
million
records
– 60x
–
30
million
records
– 75x
–
37.5
million
records
confidenBal
20
24. Test
Environment
§ 18
Nodes
§ Node
configuraBon:
– 12
cores
per
node
– 48GB
memory
–
36
TB
with
12
disks
of
3TB
each
§ CDH
4.1.2
confidenBal
24
25. Sample
Query
§ Security
Orders
per
Account
order_records
=
LOAD
'$AVRO_INPUT'
using
AVRO_LOAD
AS
(
-‐-‐-‐-‐-‐-‐-‐
Pig
Schema
goes
here
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
);
order_projecBon
=
FOREACH
order_records
GENERATE
Order.Acct
as
Account,
Order.OrdQty.Qty
as
QuanBty;
order_group
=
GROUP
order_projecBon
BY
Account;
order_count
=
FOREACH
order_group
GENERATE
group,
SUM(order_projecBon.QuanBty);
STORE
order_count
INTO
'$PIG_OUTPUT'
Using
PigStorage(',');
confidenBal
25
26. Run
Types
§ Pre-‐parsed
approach:
– XML
to
Avro
materializaBon:
xml-‐to-‐avro
• XML
to
Avro
is
run
only
once
on
the
data
– Avro
to
Pig
via
UDF:
avro-‐to-‐pig
§ Parse
on
demand
– XML
parsing
using
Pig
UDF:
xml-‐to-‐pig
confidenBal
26
27. confidenBal
27
Run
.me
in
Seconds
Analysis
on
raw
XML:
XML
to
Pig
Pre-‐parsing
XML:
XML
to
Avro
Analysis
on
parsed
XML:
Avro
to
Pig
28. confidenBal
28
CPU
Usage
Comparison
Analysis
on
raw
XML:
XML
to
Pig
Pre-‐parsing
XML:
XML
to
Avro
Analysis
on
parsed
XML:
Avro
to
Pig
29. confidenBal
29
confidenBal
29
Memory
Usage
Comparison:
Total
Memused
(GB)
Analysis
on
raw
XML:
XML
to
Pig
Pre-‐parsing
XML:
XML
to
Avro
Analysis
on
parsed
XML:
Avro
to
Pig
30. Results
§ Analysis
on
pre-‐parsed
data
compared
raw
XML
– RunBme
reducBon
by
more
than
50%
– Memory
and
CPU
consumpBon
reduced
by
about
50%
§ Pre-‐parsing
stage
takes
more
resources
and
Bme
than
on-‐demand
parsing
§ RepeBBve
queries
will
benefit
from
one-‐Bme
pre-‐
parsing
confidenBal
30
31. Caveats
§ Not
all
fields
were
extracted
from
the
XML
input
(opBonal
elements)
§ Challenge
in
keeping-‐up
with
versions/changes
of
XML
§ Performance
numbers
can
depend
on
the
type
of
data
and
the
mapping
used
confidenBal
31
32. Alterna.ves
§ Formats
other
than
Avro
may
be
more
suitable
§ Record
Columnar
formats
(RC
Files
&
ORC
Files)
§ Trevni:
a
column
file
format
supporBng
Avro
§ Parquet:
another
columnar
storage
for
Hadoop
confidenBal
32
33. Mo.va.on
for
Columnar
Format
§ Map
Reduce
capability
§ Column
ProjecBons
reduce
I/O
§ Column
Compression
due
to
similarity
of
data
further
reduces
I/O
confidenBal
33
34. Summary
§ Materialized
version
well-‐suited
for
repeated
queries
§ For
ad-‐hoc/experimental
queries
parse-‐on-‐demand
is
beaer
§ Mapping
from
XML
to
Avro
can
be
automated
§ Hive,
Pig
and
MapReduce
Interfaces
to
access
Avro
Files
§ RelaBve
trade-‐offs
between
flexibility
and
performance/storage
confidenBal
34