A Scalable Data Transformation Framework using Hadoop Ecosystem

A Scalable Data Transformation Framework
@ Penton using the Hadoop Ecosystem
Raj Nair
Director–Data Platform
Kiru
Pakkirisamy CTO

AGENDA
• About Penton and Serendio Inc
• Data Processing at Penton
• PoC Use Case
• Functional Aspects of the Use Case
• Big Data Architecture, Design and Implementation
• Lessons Learned
• Conclusion
• Questions

About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture Transportation Natural
Products
Infrastructure Industrial Design
& Manufacturing
Success Stories
EquipmentWatch.com Govalytics.com
Prices, Specs, Costs, Rental Analytics around Gov’t capital spending
down to county level
SourceESB NextTrend.com
Vertical Directory, electronic parts Identify new product trends in the natural
products industry

About Serendio
Serendio provides Big Data Science
Solutions & Services for
Data-Driven Enterprises.
www.serendio.com

What got us thinking?
• Business units process data in silos
• Heavy ETL
– Hours to process, in some cases days
• Not even using all the data we want
• Not logging what we needed to
• Can’t scale for future requirements

Data Processing Pipeline New
features
New
Insights
New
Products
Biz ValueAssembly Line
processing
Data Processing Pipeline
The Data Processing Pipeline

Penton examples
• Daily Inventory data, ingested throughout the day
(tens of thousands of parts)
• Auction and survey data gathered daily
• Aviation Fleet data, varying frequency
Ingest, store
Clean, validate
Apply Business Rules
Map
Analyze
Report
Distribute
Slow Extract, Transform and Load = Frustration + missed business SLAs
Won’t scale for future
Various data formats, mostly unstructured

What were our options?
Adopt Hadoop Ecosystem
- M/R: Ideal for Batch Processing
- Flexible for storage
- NoSQL: scale, usability and flexibility
Expand RDBMS options
- Expensive
- Complex
HBASE Oracle
SQL
Server
Drools

Primary Use Case
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Build Aggregates (dynamic)
Issue: Mapping time

Data Scrubbing
• Standardized names for fields/columns
• Example - Country
– Unites States of America -> USA
– United States -> USA

Data Mapping
• Converting Fields - > Ids
– Manufacturer - Caterpillar -> 25
– Model - Caterpillar/Front Loader -> 300
• Requires the use of lookup tables and partial/fuzzy
matching strings

Data Exporting
• Move scrubbed/mapped data to main RDBMS

Current Design
• Survey data loaded as CSV files
• Data needs to be scrubbed/mapped
• All CSV rows loaded into one table
• Once scrubbed/mapped data is loaded into main tables
• Not all rows are loaded, some may be used in the future

Key Pain Points
• CSV data table continues to grow
• Large size of the table impacts operations on rows in a single
file
• CSV data could grow rapidly in the future

Criteria for New Design
• Ability to store an individual file and manipulate it easily
– No join/relationships across CSV files
• Solution should have good integration with RDBMS
• Could possibly host the complete application in future
• Technology stack should possibly have advanced analytics
capabilities
NoSQL model would allow to quickly retrieve/address
individual file and manipulate it

Solution Architecture
Existing Business
Applications
REST API
CSV and Rule Management Endpoints
HBASE
HADOOP HDFS
CSV
Files
Master database of
Products/ Parts
Current
Oracle
Schema
Push
Updates
Insert
Accepted
Data
RDB-> Data Upload UI
API Calls
MR Jobs
Launch
Survey
RESTDrools
Use HBase
as a store for
CSV files
Data manipulation
APIs exposed
through REST layer
Drools – for
rule based
data
scrubbing
Operations on
individual files in
UI through Hbase
Get/Put
Operations on
all/groups of files
using MR jobs

Hbase Schema Design
• One row per HBase row
• One file per HBase row
– One cell per column qualifier (simple and started the development
with this approach)
– One row per column qualifier (more performant approach)

Hbase Rowkey Design
• Row Key
– Composite
• Created Date (YYYYMMDD)
• User
• FileType
• GUID
• Salting for better region splitting
– One byte

Hbase Column Family Design
• Column Family
– Data separated from Metadata into two or more
column families
– One cf for mapping data (more later)
– One cf for analytics data (used by analytics
coprocessors)

M/R Jobs
• Jobs
– Scrubbing
– Mapping
– Export
• Schedule
– Manually from UI
– On schedule using Oozie

Sqoop Jobs
• One time
– FileDetailExport (current CSV)
– RuleImport (all current rules)
• Periodic
– Lookup Table Data import
• Manufacture
• Model
• State
• Country
• Currency
• Condition
• Participant

Application Integration - REST
• Hide HBase AP/Java APIs from rest of
application
• Language independence for PHP front-end
• REST APIs for
– CSV Management
– Drools Rule Management

Performance Benefits
• Mapping
– 20000 csv files, 20 million records
– Time taken – 1/3rd of RDBMS processing
• Metrics
– < 10 secs vs (Oracle Materialized View)
• Upload a file
– < 10 secs
• Delete a file
– < 10 secs

Hbase Tuning
• Heap Size for
– RegionServer
– MapReduce Tasks
• Table Compression
– SNAPPY for Column Family holding csv data
• Table data caching
– IN_MEMORY for lookup tables

Application Design Challenges
• Pagination – implemented using intermediate REST layer and
scan.setStartRow.
• Translating SQL queries
– Used Scan/Filter and Java (especially on coprocessor)
– No secondary indexes - used FuzzyRowFilter
– Maybe something like Phoenix would have helped
• Some issues in mixed mode. Want to move to 0.96.0 for
better/individual column family flushing but needed to 'port'
coprocessors (to protobuf)

Hbase Value Proposition
• Better response in UI for CSV file operations - Operations
within a file (map, import, reject etc) not dependent on the
db size
• Relieve load on RDBMS - no more CSV data tables
• Scale out batch processing performance on the cheap (vs
vertical RDBMS upgrade)
• Redundant store for CSV files
• Versioning to track data cleansing

Roadmap
• Benchmark with 0.96
• Retire Coprocessors in favor of Phoenix (?)
• Lookup Data tables are small. Need to find a better alternative
than HBase
• Design UI for a more Big Data appropriate model
– Search oriented paradigm, than exploratory/ paginative
– Add REST endpoints to support such UI

Conclusion
• PoC demonstrated
– value of the Hadoop ecosystem
– Co-existence of Big data technologies with current solutions
– Adoption can significantly improve scale
– New skill requirements

Thank You
Rajesh.Nair@Penton.com
Kiru@Serendio.com

A Scalable Data Transformation Framework using Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to A Scalable Data Transformation Framework using Hadoop Ecosystem

Similar to A Scalable Data Transformation Framework using Hadoop Ecosystem (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

A Scalable Data Transformation Framework using Hadoop Ecosystem