PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
2. 2
Speaker Bio
Teradata
• Over 12 years of experience building Data Warehouse Solutions.
• 9+ years of experience working with Teradata based Warehouse Applications
2016
• Engineering - Foundational Big Data Applications for PayPal’s Data Lake
• Apache Spark Based Ingestion Framework, Data Quality Framework, Dedup & Snapshot processes for Transactional Datasets.
2017
• Big Data Platform Engineering: to build PayPal’s Gimel - Data Platform
• Gimel powers Data Lake vision & many other Big Data Engineering Use-cases in PayPal Inc.
3. 3
Agenda
• PayPal at Scale
• Business
• Technology
• PayPal - Datalake
• Compute
• Challenges
• Solution
• Data Access
• Challenges
• Solution
• PayPal Gimel - Bigger Picture
• PayPal Gimel – Next Steps
5. 5
PayPal Scale | Business
$354B
Total Payment
up 28% YoY
$102B
Mobile Payment
up 55% YoY
6.1B
Total Transactions
up 24% YoY
2.0B
Mobile Transactions
up 43% YoY
• One of the world’s largest internet payment companies
• PayPal platform includes Braintree, Venmo, Paydiant, PP Credit and Xoom
$3.24B
Revenue up 22% YoY
218M
Active Customer Accounts
up 14% YoY
1.9B
Payment Transactions
up 26% YoY
Q3 2017 Results
2016 Full Year Results
6. 6
PayPal Scale | Technology
40,000+
Yarn Jobs Per Day
15+
Hadoop Clusters
70+PB
Data
95+ PB
Data
5+
Compute
6+ compute
Supported - Hive,Pig, MR, Spark, Beam, Presto
Others - Flink
9. 9
Data Lake | Recap | Complexities
• PayPal, Braintree, Venmo, Xoom, Credit, Tio, Paydient, More to Come in future
• Each Property – Unique set of infrastructure & Storages
• Integration & Data Pipelines - involves Multiple Storages such as Oracle, MySQL, Kafka,
Cloud Bases systems.
• Multiple Clusters – Each has a purpose
• Large Scale / Complex Data Systems
• Variety of Use Cases - Stream, Batch, Interactive, ML
• Multiple Compute Options – Hive, Spark, Flink, Beam, Presto, Map-Reduce
• Data & Operational Security !
Cross Property Integration
Compute / Processing Capabilities
10. 10
Data Lake | Recap | Complexities
• Puzzle
• Awareness - in what form & where data is available?
• “Storage-specific” dataset creation
• Dataset onboarding – various ways
• Data assets – Lacking “Centralized” monitoring capabilities
• Multiple Storages – Cloud, No-SQL, RDBMS, Messaging, In-Memory
• 100’s of APIs to access different storages – SQL & Programmatic (Python, Scala, Java…)
• Complexity – Many-fold if one considers different compute engines – spark, hive, flink
• Time to develop and Time to Market – goes through ”learning curves”
Data Engineering
Data Catalog
12. 12
Customer-
Centric,
Single Source
of Truth
Produce
OIS
Metadata
Event Partition Certify
Configuration Quality Logging
Alerting/
Monitoring
Metadata
Data
Storage
Compute
Ingest/
Transform
Services
DMR-PAZ | Core Data Highway
Kafka
Self-Serving
Portal
SQL
Streaming
Data Lake Services |2016
Key Features
• Configuration/Metadata driven apps
• Spark based framework
• Self-Service Portal
• Data Quality Monitoring
• Metrics Collection
• Operational Dashboards
Benefits
• Scalability
• User Experience
• High Speed Data Ingestion
• Converged Data Sources (PayPal, Braintree
etc.)
• Secure and Governed Data
Quick Recap - ArchitectureSimple, Governed, Unified.
13. 13
Data Lake Services |2016 | Recap Challenges
Technology & Skillset
• Emerging/New technologies. Example - Spark, Flink, Beam, Kylo, …
• Talent & Learning Curve
Users Experience
• Data Engineers
• Analysts
• Scientists
Data Lineage
• Most common Challenge !
Data Governance
• Security
• Metadata
• Data Quality & Metrics
Use cases
• Rapid growth rate of new business
use cases
• Onboarding
Data Models & Best Practices
• Data Layout
• Data Models
Data across Property - Integration
• PayPal, Credit, Braintree,
Venmo, Paydiant, TIO, Xoom
Best Practices & Common Frameworks
• Logging
• Monitoring
• Alerting
• Auditing
• Operational Dashboards
Administration
• Hosts Management
• Software Management
15. 15
Spark on Yarn| Typically
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Deployment
16. 16
Data Lake | Challenges | Administrators
• Need extensive support and maintenance for CLI
• Need to deploy entire stack of software
• Need to sync configurations across systems
• Need extensive testing of jobs before any upgrade
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Administrators
17. 17
Data Lake | Challenges | Developers
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Developers
• No REST-friendly API
• No low-latency/sub-seconds execution
• No cache sharing across jobs
• No modularity and easy-restartability
18. 18
Data Lake | Challenges | Interactive Usecases
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Analysts/Scientists
• No easy way of interactive applications
• No multi-tenancy support and private workspace
• No direct spark sql execution
• No Kerberos integration
19. 19
Data Lake | Challenges | Ops & Security
• Different ways of jobs execution and coding standards
• No uniform logging, monitoring and alerting
• Limited audit and control
• No statement level history or metrics
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Operations/Security
21. 21
Data Lake | Compute Platform
Batch
Job
LIVY GRID
Job Server
Livy
API
NAS
Batch
Interactive Interactive
In InIn
In
Interactive
Interactive
Sparkling
Water
Interactive
Interactive
Log
Log
Indexing
Search
Metrics
History Server
Administrators
Less maintenance on CLI
Deploy software stack only on Job Server
Configurations at one place
Easy platform/software upgrade
Developers
REST-friendly and Docker-friendly
Low-latency/sub-seconds execution
Sharing cache across jobs
Modularity and easy restartability
Analysts/Scientists
User friendly interactive applications
Multi-tenancy and Private workspace
Direct spark sql execution
Kerberos Support
Operations/Security
Standardized coding and unified execution
Uniformed logging, monitoring and alerting
Fine-grained audit
Complete statement level history and metrics
24. 24
Data Platform | Data Access | Current State
Spark Read From Elastic Search
25. 25
Data Platform | Data Access | Current State
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
26. 26
Data Access| Bigger Picture | Today
MAP
REDUCE
HIVE PIG SPARK FLINK BEAM PRESTO
HDFS GCP
ALLUXI
O
HBASE BIGQUERY
AEROSPIK
E
CASSANDR
A DRUID
ELASTI
C
DATAFLOW
SHERLOCK KAFKA
TERADAT
A MYSQL
BIGQUERY
27. 27
Data Platform | Today
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Engine Changed
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************
Data Application Lifecycle - Current State
29. 29
com.paypal.pcatalog.Dataset
Data Platform | Data Access | With Data API
MAPREDUCE HIVE PIG SPARK FLINK BEAM PRESTO
CENTRAL
CATALOG
HDFS GCP ALLUXIO HBASE
BIGQUER
Y
AEROSPIKE CASSANDRA DRUID ELASTIC
DATAFLOW
SHERLOC
K
KAFKA
TERADAT
A
MYSQL
BIGQUERY
DATA
API
30. 30
Data Platform | Objective
• Data API
• Unified Data Access API
• Access from any Compute Engine
• Access any Storage
• Data Catalog
• Global Catalog - Explore at one place
• Onboard at one place
• Dataset Management- Track/Approve Onboard Request
• Dataset Stats
31. 31
Data Platform | Data Access | Data API
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
32. 32
Data Platform | Data Access | Gimel SQL
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Gimel SQL
✔
33. 33
Data Platform | Express everything as Gimel SQL
Batch SQL
Streaming
SQL
%%gimel-batch
insert into pcatalog.hdfs_dataset
partition(yyyy,mm,dd,hh,mi)
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else
"30" end as mi
from pcatalog.kafka_dataset kafka_ds;
%%gimel-stream
insert into pcatalog.hdfs_dataset
partition(yyyy,mm,dd,hh,mi)
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else
"30" end as mi
from pcatalog.kafka_dataset kafka_ds;
All Kind of Hive SQLs Supported !
34. 34
Data Platform | Data Application Lifecycle | With Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
35. 35
Data Platform | Data API | Unlocking Additional Benefits
• Explorer
• View sample data
• Add capacity reservations
• Approval and Tracking
• Discovery
• Auto-discovery of datasets and importing into
PCatalog
• Dashboard
• Operation dashboard: Metrics, statistics, trends,
refresh status
• Admin dashboard: Capacity/growth/space statistics
and action
• Alert
• User alerts to notify user datasets refresh delay and
quality issue
• Admin alerts to notify datasets capacity and access
violation
• Query/BI Integration
• Integrate catalog with Jupyter, Zeppelin and query any
dataset
• Out-of-the-box Benefits
• Connector dependencies maintained by BDPE
• Change component version at one place. E.g. Hbase
version
• Easy cluster upgrade
• Abstract storage from user. Just Prod, QA and Dev
• Add new compute engine to stack easily
• Add new storage system to stack easily
• Time to Develop
• Time to Market
38. 38
POC Phase-1 Phase-2 Phase-3 Phase-4 Phase-5
Data Platform: Plan
Compute Engine:
Spark
Hive
Presto
Storage:
TERADATA
HDFS
Hbase
Elastic Search
Kafka
Aerospike
Sherlock
Catalog:
Web UI
1Time Onboard
Catalog API
Compute Engine:
Spark
Hive
Presto
Storage:
Druid
Cassandra
Catalog:
Web UI
Auto-Onboard
Catalog API
Compute Engine:
Pig/MR
Beam
Flink
Storage:
Alluxio
Oracle
Mysql
Catalog:
Metrics/Trend/Status
Discovery/Alerts
Onboard Approval
Jira Integration
Rally Integration
Storage Integration
Compute Engine:
Big Query
Storage:
GS FS
Big Query
Big Table
Catalog:
Enhancements
Improvements
Deployment
Cont. Onboarding
Post-Prod Support
Maintenance
Analysis
Evaluation
Show & Tell
2017 ========== 2018 ==========
Open Source – Data API
39. Thank You!
Rate This Session #
with the PARTNERS Mobile App
0348
https://www.linkedin.com/in/deepakmc/LinkedIn:
Questions/Comments
Email: dmohanakumarchan@paypal.com
41. 41
References
Approver
Google Search Links Used for Logos of open source technologies :
https://www.google.com/search?q=big+data+stack+icons&source=lnms&tbm=isch&sa=X&ved=0ahUKE
wiwnevLztzWAhVHsFQKHZp7CukQ_AUICigB&biw=1369&bih=739
PayPal Q3 2017 & 2016 Full Year Results :
https://www.paypal.com/us/webapps/mpp/about
42. 42
Data Platform: Process Flow
• Sample Portal: http://pcatalog.weebly.com
Onboard
Fill Meta &
Submit Approval
Request
Requestor
Approver
Approved Create
Dataset API
PCatalog
Storage
User/Developer
Submit Job Create Dataset Meta
on PCatalog
Create Catalog
on Storage
Compute
(Data API)
Data
Get Dataset Meta
Access
1 2
3
4
5
6
1 2
3
4
Auto-Approve2
RESTAPI
43. 43
Data Platform | Today
• Data Access – Other challenges
Data access tied
to compute and
data store
versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in duplication
and increased
latency
No audit trail
for dataset
access
No standards for
on-boarding data
sets for others to
discover
No statistics on
data set usage
and access
trends
Additional Benefits
Explorer
View sample data
Add capacity reservations
Approval and Tracking
Discovery
Auto-discovery of datasets and importing into PCatalog
Dashboard
Operation dashboard: Monitor datasets metrics, statistics, trends visualization and load/refresh status
Admin dashboard: Monitor datasets growth/space/capacity statistics and take action (Drop/Stop/Hold)
Alert
User alerts to notify user datasets refresh delay and quality issue
Admin alerts to notify datasets capacity and access violation
Query/BI Integration
User alerts to notify user datasets refresh delay and quality issue
Additional Benefits
Explorer
View sample data
Add capacity reservations
Approval and Tracking
Discovery
Auto-discovery of datasets and importing into PCatalog
Dashboard
Operation dashboard: Monitor datasets metrics, statistics, trends visualization and load/refresh status
Admin dashboard: Monitor datasets growth/space/capacity statistics and take action (Drop/Stop/Hold)
Alert
User alerts to notify user datasets refresh delay and quality issue
Admin alerts to notify datasets capacity and access violation
Query/BI Integration
User alerts to notify user datasets refresh delay and quality issue