Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Â
Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN
1. Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 1
Experiences in migration of large analytics
platform from MPP database to Hadoop
YARN
Srinivas Nimmagadda Roopesh Varier
Technical Director, CPE Director, CPE
2. Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier
3. Overview
âą Symantec Cloud Platform Engineering (CPE)
â Build consolidated cloud infrastructure and platform services
for next generation data powered Symantec applications.
â Open source components as building blocks
âą Hadoop and Openstack
âą Bridge capability gaps and contribute back
âą A big data platform for batch and stream analytics
integrated with Openstack.
â Security, multi-tenancy, and reliability
âą Using large scale data analytics for security and data
management work loads
â Analytics â Reputation based security, Managed Security
Services, Fraud Detection, Dial home application logs
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 3
4. Big Data Challenge
âą Hundreds of millions of users
âą Billions of files
â File good or not?
âą Millions of URLs
â URL safe or not?
âą Hundreds of thousands of applications
â Stable or Crashed
âą Constant feed of information
â Real time
â Across the global
â From our applications and appliances
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 5
5. Value from Volume
âą Volume of data
â Multi-petabyte historical datasets
â Multi-terabyte daily incremental datasets
â Wide variety of input data formats
â How do we manage?
âą Variety of workloads
â ETL jobs
â Batch applications
â Interactive ad-hoc analysis
âą How to extract value from volume near real-time?
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 6
6. Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier
7. MPP Platform
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 8
ETL Cluster
Platform
Services
Raw Data
Store
Data Sources Applications
Batch
Interactive
MPP DB Engine
8. Legacy MPP Analytics Solution
âą Custom Platform Services
â Task/Job management (DAG based, Fault-tolerant)
â Functional and performance monitoring
â Automatic data lifecycle management
â Inter cluster data transfers
â Cluster tenancy management
âą ETL cluster
âą RDS (raw data store) on NAS
âą MPP (Massively Parallel Processing) DB engine at the core
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 9
9. Key Challenges
âą Scalability
â Supporting rapid data growth
â No support for heterogeneous hardware.
âą Operational costs
â OpEx and Software licenses
âą Supporting new use models
â Not Only SQL patterns in analytics (columnar storage, search, streaming)
âą Cluster operational challenges
â Limited resource management (limits/quotas, utilization throttling)
â Load balancing across multi-mode and multi-tenant workloads
â Integrated secure tenancy services
â HA and DR
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 10
10. Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier
11. MPP Platform
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 12
Raw Data
Store
ETL Cluster
Platform
Services
Data Sources Applications
Batch
Interactive
MPP DB Engine
12. 7: YARN/HDFS
6: DistCP, Falcon
5: DAG: Oozie
MPP DB Engine
3: HDFS
MPP to Big Data Platform
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 13
Raw Data
Store
Platform
Services
Data Sources Applications
Batch
Interactive
1: Commodity Hardware
2: Hadoop Cluster
4: YARN
ETL
Job Management
State Transfer
Tenancy Guard
ETL Cluster
Batch
Interactive
Interactive Batch
YARN
13. Big Data Platform
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 14
Multi-tenant
Data Sources Applications
Batch
Interactive
1: Cluster Infrastructure 2: Hadoop 2.x Stack
3: HDFS
5: Oozie
4: YARN
ETL
Interactive Batch
Raw Data Store ETL Jobs Batch Interactive
Ad-hoc
workloads
Role-based provisioning Unified Logging
API
14. Agenda
Introduction1
Big Data Needs2
MPP Platform and Challenges3
New Platform based on Hadoop/YARN4
Lessons learned during transition to Hadoop5
2Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier
15. Cluster Build Experiences
âą Node selection
â Single Node SKU, use commodity hardware components
â Memory will be cheap, keep expansion options open
â Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs)
âą Balance mixed workloads using YARN
â Large clusters are better for effective resource utilization
â Balance between ETL, Batch, Interactive jobs with YARN
âą Platform features and best practices
â Central monitoring, log aggregation, and alerting metrics (ELK stack)
â Role based automated deployment of OS and Hadoop configuration
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 16
16. Journey to Hadoop
âą Goals
â Open Source platform
â Scalable Distributed Processing
âą Existing app base built around SQL
âą Many technology choices in Hadoop ecosystem
â Technology choices: Distributed Query Engines vs. fast MR
â Evaluation with multi-PB data sets using 15 of our representative
workloads.
âą e.g., complex joins (data shuffle), queries with variety of data
â Criteria: Scale, Functionality, Stability, Performance, Integration with
other open source ecosystem
â Hive was the only technology able to scale and provide easy migration
from our SQL workloads.
â With Tez we had an acceptable performance trade off.
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 17
17. RDS and ETL Process
âą Platform features for ETL
â File ingestion and Job management APIs
â Secure tenancy, Replication
âą Conversion of 5 GB log file(.gz to .bz2)
1. Single node outside Hadoop: ~28 mins
2. In Hadoop, single mapper, parallel read and write approach: ~5 mins
âą A parallel RDS and ETL using YARN
â Source file ingested from remote location
â Converted to bz2 and stored in HDFS Raw Data Store (Passive data)
â Data is transformed and loaded into Hive (Active data in ORC format)
â Mix âactiveâ and âpassiveâ datasets in HDFS
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 18
ï Use YARN for managing ETL
A
P
I
NN
DN
DN
DN
DN
DN
DN
Local .gz->bz2
MR based .gz->bz2
1
2
18. Large Cluster YARN Performance Modeling
âą Multi-mode:
â ETL jobs: Guaranteed throughput â window computing
â Ad-hoc queries â Low latency, fast execution
â Batch analytics applications â Throughput
âą Multi-level
â Departments/Projects, Users
âą How do we model and use YARN for above workloads?
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 19
22. YARN Queue Model - 3
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 23
ETL
Queue
Batch
Queue
Root
Queue
Ad-hoc
Project
Queues
Jobs
Project
Queues
Step 3: Run jobs, iterate thruâ models and pick optimal
Cluster Utilization
Avg Wait Time
Throughput (jobs):ïŒ
23. Right Balance
âą Optimal solution is about right balance
â Cluster infrastructure
â Use the right software stack from Hadoop ecosystem
â Data management
â Application design and workload balancing with YARN
â Good tools for monitoring and management
âą Approach
â Start small and iterate faster
â When in doubt, experiment and get data to make decisions.
â Keep up customer use cases in perspective.
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 24
24. Summary
â Incremental transition from MPP to Big Data
â A journey towards open source distributed computing
â Uniform Computing!
âą Infrastructure building blocks
âą Single large YARN cluster for variety of compute and storage loads
â Open source â use and contribute
âą Work with community to address gaps
â Share your ideas
Hadoop Summit 2014 â Srinivas Nimmagadda & Roopesh Varier 25