Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to Increase Enterprise Adoption
1. Journey to the Data Lake:
How Progressive Paved a Faster, Smoother
Path to Increase Enterprise Adoption
Krishna Potluri – Big Data Tech Lead, Progressive
@TenduYogurtcu – CTO, Syncsort
2. Data Liberation, Integrity & Integration for Next-Generation Analytics
Marquee global customer base of leaders and
emerging businesses across all major industries
Trusted Industry Leadership
We provide unique data management solutions
and expertise to over 2,500 large enterprises worldwide
with an unmatched focus on customer success & value
Best Quality, Top Performance, Lower Costs
Our proven software efficiently delivers all critical enterprise
data assets with the highest integrity to Big Data
environments, on premise or in the cloud
Highly Acclaimed & Award Winning
• Data Quality “Leader” in Gartner Magic Quadrant
• IT World Awards® 2016 “Innovations in IT” Gold Winner
• Database Trends & Applications “Companies That Matter
Most in Data”
• Mainframe Access & Integration
for Application Data
• High-Performance ETL
Data Access & Transformation
• Mainframe Access & Integration
for Machine Data
Data Infrastructure
Optimization
Data Quality
• Big Data Quality & Integration
• Data Enrichment & Validation
• Data Governance
• Customer 360
• Enterprise Data Warehouse
Optimization
• Application Modernization
• Mainframe Optimization
4. Syncsort + Hortonworks Solution
Enabling the Enterprise Data Lake
Fast. Secure. Enterprise Grade.
ETL Onboarding Mainframe Access & Integration
• Free up your budget: dramatically reduce EDW
costs and avoid continual upgrades
• Get fresher data: from right-time to real-time
• Keep all data as long as you want: from weeks
to months, years and beyond
• Optimize your data warehouse: faster database
queries for faster analysis
• Blend new data, fast: from structured to
unstructured. From mainframe to the Cloud
• Deliver powerful new insights: combine
mainframe data with Big Data
• Securely access mainframe data when you need
it: directly access and translate mainframe data
• Reduce storage costs: Go from $100K/TB to
$2K/TB by migrating data to HDFS
• Use your MIPS wisely: Save an average of up to
$7K per MIPS when offloading batch workloads to
Hadoop.
5. Syncsort + Hortonworks Reference Architecture
• Apache Ambari Integration
• Deploy DMX-h across cluster
• Monitor DMX-h jobs
• Process in MapReduce or Spark
• Source relational and non relational data (including
mainframes)
• Syncsort DMX-h & HDF – batch and real time
• Out-of-the-box integration, interoperability &
certifications
• Kerberos-secured clusters
• Apache Sentry/Ranger security certified
• Early beta, release certification
• Metadata lineage export from DMX
• Atlas integration
6. Our Strategy: Simplify Big Data Integration
• Deploy on premise or in the cloud
• Choose among multiple execution frameworks – Hadoop, Spark, Spark 2.0,
Linux, Unix, Windows
• Integrate streaming and batch data with a single data pipeline for
innovative applications, like IoT
• Future-proof applications to avoid re-writing jobs in order to take
advantage of innovations in new execution frameworks
• Access and integrate ALL enterprise data sources – including mainframe –
for advanced analytics
7. How Progressive Paved a Faster, Smoother
Path to Increase Enterprise Adoption
Krishna Potluri
Big Data Tech Lead
JOURNEY TO THE DATA LAKE:
12. 1. Custom ETL in Pig and Hive:
• Tied up resources
• Maintenance and knowledge transfer issues
2. Ingestion projects:
• No consistency in code
• Reconciliation and Balancing Issues
• Each developer had their own pattern
3. Projects were queued up!
Again, Support Issues
13. Requirements Tools Selection
• Industrial Strength
• Run natively on Hadoop
• All Data Sources Support
• Usable by Non-Java Programmers
• Security/Cloud/IDE
• Pricing/Stability
• Enterprise Support
• Ability to customize frameworks
• At pace with Hortonworks
• Fit with Internal Build and Elevate
Syncsort
Talend
Informatica
Actian
CDAP
Cascading
Syncsort
Ingestion/ETL Tool Selection
15. Data Node
Name Node
Hortonworks Cluster
Secondary
Name Node
DMX-h
Data Node
DMX-h
Data Node
DMX-h
Data Node
DMX-h
…
Edge
Node
Edge
Node
DMX-h
Daemon
DMX-h
Daemon
Firewall
Developer
Machine
DMX-h Client
Non-Prod
Prod
Build
Machine
Interactive
Deploy
TFS
Syncsort Implementation
16. Sqoop Syncsort DMX-h
Connectivity Database Driver Jars Data Direct
In-Memory Transformations Not Supported Supported
Parallelism Supported Supported
Edge-Node Ingestion Not Supported Supported
Sqoop vs. Syncsort DMX-h
17. Syncsort DMX-h DTL Syncsort DMX-h GUI
Complexity High Low
Development Time High Medium
Flexibility High Medium
Reuse High Medium
Choosing the Right Interface for the Job
19. No CDC
Incremental
Extract
Full Extract
CDC On
Source
Hive
(ORC)
Hive
(Text)
7 Days
Rolling
Partition
Hive
(ORC)
Daily
Partitioned
Staging
Raw
History
Target
Current View
DMX-h
DTL
I/U/D I/U
CDC Patterns – Syncsort DMX-h DTL
DMX-h
DTL
23. • One Code Base for Ingestion
• Choice of CDC Patterns
• Every Table is ingested the same way
• Jobs don’t need to be recoded for new framework - DMX-h Shields us
• Every table is reconciled and balanced
• Capture metrics around run time and row counts
• Ingestion time remains the same no matter the table count
“ Ingestion has gone from days to hours, and Progressive IT has
a single entry point and code base for Data Lake Ingestion ”
Hadoop Ingestor + Syncsort DMX-h (Benefits)
- Ingestion Team
25. Learn More!
Visit us at Booth #1102
Get a demo & pick up your
Data Liberator t-shirt!
Notas do Editor
Move Access and Transformation to the top
We focus on two main use cases:
ETL or ELT onboarding– Moving data processing from expensive platforms like Teradata and moving it to Hadoop
One of Our biggest use case is what we call mainframe access and integration
Syncsort/Hortonworks reference architecture
Deployed by Ambari
On every node
Data movement and transformation
MapReduce or Spark
Syncsort’s data integration has always delivered the ability to process large data volumes, in less time, with fewer resources. However, performance and efficiency are just are starting points. It became apparent in speaking with our customers a few years ago – particularly when “Big Data” and Hadoop took off – that they were facing new challenges that had a common theme of complexity.
The rapid evolution of Big Data technologies presents several challenges on its own:
New technologies require new specialized skills that continue to be in short supply, and very expensive if you can find them
Because execution frameworks continue to be improved, customers don’t want to feel locked in. They don’t want to have to redevelop all their jobs if they want to take advantage of innovative new frameworks. A great example is MapReduce v1 to MapReduce v2 to Spark.
New sources and types of data add to the complexity as well – with streaming sources as a recent example. Connectivity and skills challenges.
Many of our customers are large enterprises that still have a significant reliance on the mainframe – and these companies found it very difficult to leverage the mainframe with the rest of the Big Data integration strategy
Organizations were building data lakes but then struggling to fill them. We heard from customers and partners alike that ingesting data from all its enterprise sources – Mainframe, data warehouse, etc. -- into Hadoop was a big problem to solve.
So, our product strategy has focused on not only delivering data integration products with exceptional performance and efficiency and lower TCO – but to also simplify the data integration process for all enterprise data sources and across all platforms – Linux, Unix, Windows, Hadoop, Spark – on premise, or in the cloud.