2. Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Hadoop Evolution
Use Cases
The Hadoop Ecosystem, from open source to vendor solutions
Tooling, implementation and skillset challenges
Real-World Case Studies
Future of Hadoop
Q&A
2
3. Syncsort Confidential and Proprietary - do not copy or distribute
Our Guest – Chida from OpenOsmium
20+ years of Enterprise Application Development Experience Focused on Big
Data & Cloud
Founder of Big Data Solution Provider – OpenOsmium
DC Tech Community Organizer of Meetups
– Google Developer Group, Tech Breakfast, NoVA Hadoop User Group
Open Source, Big Data and Cloud Advocate
703-568-7426, chida@openosmium.com
3
5. Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Data Volumes are Growing
5
6. Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Key Events
6
Next?2000 2004
Search Engine Problem
@ Google
3 White Papers: GFS,
MapReduce, BigTable
MapReduce: Simplified Data
Processing on Large Clusters
Yahoo!
HDFS, MapReduce,
Hbase
2008 2010 2012 2013
MapR
Hortonworks
HHadoop 2.0
Cloudera
7. Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliability of a Mainframe, The
Massive Performance at Scale of an
MPP appliance, The Storage
Capacity of a SAN, All at a
Disruptively Low Price Point
7
8. Syncsort Confidential and Proprietary - do not copy or distribute
The Economics of Data
8
Cost of managing 1TB of data
Mainframe EDW Hadoop
$20,000 – $100,000 $15,000 – $80,000 $250 – $2,000
Scalability
Performance
Reliability
Agility
Skills Supply
But there’s more…
9. Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop - The Big Picture
9
Unified computation
provided by
MapReduce
distributed computing
framework
Unified storage
provided by
distributed file
system called HDFS
Commodity
Hardware
Hardware contains
bunch of disks and
cores
Physical
Logical
Storage
Computation
10. Syncsort Confidential and Proprietary - do not copy or distribute
MapReduce – Football Stadium Analogy
10
14. Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
14
Data Lake
Offload Mainframe Data
& Batch Workloads
Machine Data
Cyber Security
Fraud Detection
Offload ELT from Data WarehouseClickstream / Weblogs, EMR
Social Media Data
Geo Spatial Analyzing
Video and Audio Analytics
Real-Time Processing
Predictive Analytics
Unstructured Data
Active Archive
Multi-media
Leverage “Dark Data”
Sentiment Analysis
Enterprise Data Hub
15. Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
A Roadmap for Hadoop Success
– Offload batch & ELT workloads from
data warehouse and mainframe
systems into Hadoop
– Develop and active archive, shed
light on dark data
– Build your Enterprise Data Hub
(Data Lake!)
– Leverage new data sources
– Extend BI with data discovery &
exploration
– Deliver next-generation analytics
15
16. Syncsort Confidential and Proprietary - do not copy or distribute
Sample Use Case: Offload
Phase III:
Optimize & Secure
Phase II:
Offload
Phase I:
Identify
• Identify data & workloads most
suitable for offload
• Focus on those that will deliver
maximum savings &
performance
• Access and move virtually any
data to Hadoop with one tool
• Easily replicate existing
workloads in Hadoop using a
graphical user interface
• Deploy and optimize the
new environment
• Manage & secure all your
data with business class
tools
16
17. Syncsort Confidential and Proprietary - do not copy or distribute
Phase 2: Deliver ‘Next-generation’ Applications
Advanced – ‘Next-gen’ – Applications for Hadoop
– Semi-structured data analytics
• Clickstream/Weblog, Electronic Medical Records
– Unstructured data analytics
• video, audio, documents, text, social
• Predictive modeling
– Geospatial analysis
– Real-Time Processing
17
18. Syncsort Confidential and Proprietary - do not copy or distribute
Use Cases Across Industries
Vertical Refine Explore Enrich
Retail & Web
• Log Analysis/Site
Optimization
• Loyalty Program
Optimization
• Brand and Sentiment Analysis
• Market basket analysis
• Dynamic Pricing
• Session & Content
Optimization
• Product recommendation
Telco • Customer profiling • Equipment failure prediction • Location based advertising
Government • Threat Identification • Person of Interest Discovery • Mission work
Finance
• Risk Modeling & Fraud
Identification
• Trade Performance Analytics
• Surveillance and Fraud
Detection
• Customer Risk Analysis
• Real-time upsell, cross sales
marketing offers
Energy
• Smart Grid: Production
Optimization
• Grid Failure Prevention
• Smart Meters
• Individual Power Grid
Manufacturing • Supply Chain Optimization • Customer Churn Analysis
• Dynamic Delivery
• Replacement parts
Healthcare
• Electronic Medical Records
(EMPI)
• Clinical decision support
• Clinical Trials Analysis
• Insurance Premium
Determination
18
19. Syncsort Confidential and Proprietary - do not copy or distribute
IMPLEMENTATION & SKILLSET
CHALLENGES
19
20. Syncsort Confidential and Proprietary - do not copy or distribute
Overview of Hadoop Challenges
Hardware??
Skills??
Training??
Rapid change of Hadoop
Ecosystem?
20
21. Syncsort Confidential and Proprietary - do not copy or distribute
Example 1 - ETL in Hadoop
21
COLLECT PROCESS DISTRIBUTE
Sort
JoinAggregate Copy
Merge
•FS Shell Put
Command•Flume
•Sqoop
HARD
•Pig •HiveQL•Java
HARDER
•Sqoop •FS Shell Get
Command
HARD
22. Syncsort Confidential and Proprietary - do not copy or distribute 22
Images: http://monkeestv.tripod.com/BatMonkee/
Perception: Just Call the Mainframe Guy…
Example 2 – Mainframe Data Ingestion
23. Syncsort Confidential and Proprietary - do not copy or distribute
Reality
Example 2 – Mainframe Data Ingestion
23
Every Change = Time, Cost
SMS
Compression
DB Tables,
Flat Files
Filtering ,
Reformatting
Copy, Sort,
Join,
Aggregation
EBCDIC to
ASCII
Cobol
copybooks
Call MF GuySMS
Compression
DB Tables,
Flat Files
Filtering ,
Reformatting
Copy, Sort,
Join,
Aggregation
EBCDIC to
ASCII
Cobol
copybooks
Call MF GuySMS
Compression
DB Tables,
Flat Files
Filtering ,
Reformatting
Copy, Sort,
Join,
Aggregation
EBCDIC to
ASCII
Cobol
copybooks
Image: bottletales.com
24. Syncsort Confidential and Proprietary - do not copy or distribute
Big Data Team
24
Senior Linux/Unix Admin Hadoop Administrators
Infrastructure Engineers
Java Developers Hadoop Developers
Object Oriented Developers Hadoop Developers
Data Analysts
Functional Users Hadoop Analytics Users
Project Managers!
Chief Data Officer
Executive Management
25. Syncsort Confidential and Proprietary - do not copy or distribute
Enterprise Adoption Approach
Agile
Ideal Use Case for the company
Proof-of-concept or Pilot
Tech Heavy
Aware of Available Options – Many..
Work with Solution Architects
Infrastructure Analysis
Security Options
Testing.. Testing..
Integrating with current Stack
Cost.. Cost..
Promises Vs Reality
25
26. Syncsort Confidential and Proprietary - do not copy or distribute
THE HADOOP ECOSYSTEMS –
FROM OPEN SOURCE TO VENDOR TOOLS
26
28. Syncsort Confidential and Proprietary - do not copy or distribute 28
Vendor Landscape
Distributions / Platforms
Data Integration/ETL
Search
Document Store
Database / Data Warehouse
Social Operational
XML Database
Graphs
30. Syncsort Confidential and Proprietary - do not copy or distribute
Understanding Mainframe Data at Major US Bank
30
Customer hit a wall after months of manual
effort migrating Mainframe data
• Difficult to find data errors. No Mainframe
application logic that matches Copybook
• Large and complex Copybooks
• Depends on Mainframe team to provide data
• Very manual-intensive ; inadequate
documentation
• Not scalable. Only a few Java + Mainframe
experts could do the work
• Easy to validate Copybooks and find data errors
• Ability to pull data directly from Mainframe
without relying on Mainframe team
• No coding. No scripting. Easier to document,
maintain & reuse
• Enables developers with a broader set of skills
to build complex migration jobs.
+( )
86-page copybook
?Weeks 4 hrs
Before: Manual Effort After: DMX-h + CDH
86-page copybook
30
31. Syncsort Confidential and Proprietary - do not copy or distribute
Social Security Administration
The Challenge:
– The SSA has an expensive problem with fraudulent claims for benefits,
and they need more and better data to prevent and punish that fraud.
The Office of the Inspector General for the SSA reports that:
– “Nationally, in Fiscal Year 2011, there were more than 103,000
allegations of Social Security fraud, with more than 7,000 criminal
investigations resulting in 1,374 convictions and more than $410 million
in recoveries, fines, restitution, judgments, settlements, and savings.”
Why Hadoop?
– Data Processing Time – 30 hrs on the MF and PoC cluster completed in
2 hrs
– Accuracy – Obituary data is likely more accurate over social media than
current death file
31
32. Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing the EDW at Large Teradata Customer
32
• Offload ELT processing from Teradata into
CDH using DMX-h
• Implement flexible architecture for staging
and change data capture
• Ability to pull data directly from Mainframe
• No coding. Easier to maintain & reuse
• Enable developers with a broader set of skills
to build complex ETL workflows0
100
200
300
400
ElapsedTime(m)
HiveQL
360 min
DMX-h
15 min
0 4 8 12 16
Development Effort (Weeks)
DMX-h 4 Man weeks
HiveQL 12 Man weeks
Impact on Loans Application Project:
Cut development time by 1/3
Reduced complexity. From 140 HiveQL scripts to
12 DMX-h graphical jobs
Eliminated need for Java user defined functions
24x faster!
+
34. Syncsort Confidential and Proprietary - do not copy or distribute
Video - Placemeter
34
http://vimeo.com/69091237
35. Syncsort Confidential and Proprietary - do not copy or distribute
What to do next
No one is impartial, but it’s still worth talking to:
– Vendors
– Industry Analysts
– Industry Peers
– People at Meetups
– Practitioners like Chida
35
36. Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliability of a Mainframe, The
Massive Performance at Scale of an
MPP appliance, The Storage
Capacity of a SAN, All at a
Disruptively Low Price Point
36