SlideShare a Scribd company logo
1 of 44
© 2015 Teradata
Deepak Mohanakumar Chandramouli, Big Data Platform Eng
PayPal Data Lake journey
2
Speaker Bio
Teradata
• Over 12 years of experience building Data Warehouse Solutions.
• 9+ years of experience working with Teradata based Warehouse Applications
2016
• Engineering - Foundational Big Data Applications for PayPal’s Data Lake
• Apache Spark Based Ingestion Framework, Data Quality Framework, Dedup & Snapshot processes for Transactional Datasets.
2017
• Big Data Platform Engineering: to build PayPal’s Gimel - Data Platform
• Gimel powers Data Lake vision & many other Big Data Engineering Use-cases in PayPal Inc.
3
Agenda
• PayPal at Scale
• Business
• Technology
• PayPal - Datalake
• Compute
• Challenges
• Solution
• Data Access
• Challenges
• Solution
• PayPal Gimel - Bigger Picture
• PayPal Gimel – Next Steps
Business &
Technology
PayPal at Scale
5
PayPal Scale | Business
$354B
Total Payment
up 28% YoY
$102B
Mobile Payment
up 55% YoY
6.1B
Total Transactions
up 24% YoY
2.0B
Mobile Transactions
up 43% YoY
• One of the world’s largest internet payment companies
• PayPal platform includes Braintree, Venmo, Paydiant, PP Credit and Xoom
$3.24B
Revenue up 22% YoY
218M
Active Customer Accounts
up 14% YoY
1.9B
Payment Transactions
up 26% YoY
Q3 2017 Results
2016 Full Year Results
6
PayPal Scale | Technology
40,000+
Yarn Jobs Per Day
15+
Hadoop Clusters
70+PB
Data
95+ PB
Data
5+
Compute
6+ compute
Supported - Hive,Pig, MR, Spark, Beam, Presto
Others - Flink
Bigger Picture
Complexities
PayPal - Datalake
8
Data Lake | Bigger Picture
9
Data Lake | Recap | Complexities
• PayPal, Braintree, Venmo, Xoom, Credit, Tio, Paydient, More to Come in future
• Each Property – Unique set of infrastructure & Storages
• Integration & Data Pipelines - involves Multiple Storages such as Oracle, MySQL, Kafka,
Cloud Bases systems.
• Multiple Clusters – Each has a purpose
• Large Scale / Complex Data Systems
• Variety of Use Cases - Stream, Batch, Interactive, ML
• Multiple Compute Options – Hive, Spark, Flink, Beam, Presto, Map-Reduce
• Data & Operational Security !
Cross Property Integration
Compute / Processing Capabilities
10
Data Lake | Recap | Complexities
• Puzzle
• Awareness - in what form & where data is available?
• “Storage-specific” dataset creation
• Dataset onboarding – various ways
• Data assets – Lacking “Centralized” monitoring capabilities
• Multiple Storages – Cloud, No-SQL, RDBMS, Messaging, In-Memory
• 100’s of APIs to access different storages – SQL & Programmatic (Python, Scala, Java…)
• Complexity – Many-fold if one considers different compute engines – spark, hive, flink
• Time to develop and Time to Market – goes through ”learning curves”
Data Engineering
Data Catalog
2016
Beginning of the Data Lake Services
12
Customer-
Centric,
Single Source
of Truth
Produce
OIS
Metadata
Event Partition Certify
Configuration Quality Logging
Alerting/
Monitoring
Metadata
Data
Storage
Compute
Ingest/
Transform
Services
DMR-PAZ | Core Data Highway
Kafka
Self-Serving
Portal
SQL
Streaming
Data Lake Services |2016
Key Features
• Configuration/Metadata driven apps
• Spark based framework
• Self-Service Portal
• Data Quality Monitoring
• Metrics Collection
• Operational Dashboards
Benefits
• Scalability
• User Experience
• High Speed Data Ingestion
• Converged Data Sources (PayPal, Braintree
etc.)
• Secure and Governed Data
Quick Recap - ArchitectureSimple, Governed, Unified.
13
Data Lake Services |2016 | Recap Challenges
Technology & Skillset
• Emerging/New technologies. Example - Spark, Flink, Beam, Kylo, …
• Talent & Learning Curve
Users Experience
• Data Engineers
• Analysts
• Scientists
Data Lineage
• Most common Challenge !
Data Governance
• Security
• Metadata
• Data Quality & Metrics
Use cases
• Rapid growth rate of new business
use cases
• Onboarding
Data Models & Best Practices
• Data Layout
• Data Models
Data across Property - Integration
• PayPal, Credit, Braintree,
Venmo, Paydiant, TIO, Xoom
Best Practices & Common Frameworks
• Logging
• Monitoring
• Alerting
• Auditing
• Operational Dashboards
Administration
• Hosts Management
• Software Management
Compute
Challenges Faced with…
15
Spark on Yarn| Typically
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Deployment
16
Data Lake | Challenges | Administrators
• Need extensive support and maintenance for CLI
• Need to deploy entire stack of software
• Need to sync configurations across systems
• Need extensive testing of jobs before any upgrade
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Administrators
17
Data Lake | Challenges | Developers
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Developers
• No REST-friendly API
• No low-latency/sub-seconds execution
• No cache sharing across jobs
• No modularity and easy-restartability
18
Data Lake | Challenges | Interactive Usecases
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Analysts/Scientists
• No easy way of interactive applications
• No multi-tenancy support and private workspace
• No direct spark sql execution
• No Kerberos integration
19
Data Lake | Challenges | Ops & Security
• Different ways of jobs execution and coding standards
• No uniform logging, monitoring and alerting
• Limited audit and control
• No statement level history or metrics
Batch
Edge Nodes
Interactive
Edge Nodes
Job
Operations/Security
Gimel
Compute Platform
Solution !
21
Data Lake | Compute Platform
Batch
Job
LIVY GRID
Job Server
Livy
API
NAS
Batch
Interactive Interactive
In InIn
In
Interactive
Interactive
Sparkling
Water
Interactive
Interactive
Log
Log
Indexing
Search
Metrics
History Server
Administrators
 Less maintenance on CLI
 Deploy software stack only on Job Server
 Configurations at one place
 Easy platform/software upgrade
Developers
 REST-friendly and Docker-friendly
 Low-latency/sub-seconds execution
 Sharing cache across jobs
 Modularity and easy restartability
Analysts/Scientists
 User friendly interactive applications
 Multi-tenancy and Private workspace
 Direct spark sql execution
 Kerberos Support
Operations/Security
 Standardized coding and unified execution
 Uniformed logging, monitoring and alerting
 Fine-grained audit
 Complete statement level history and metrics
Data Access
Challenges Faced with…
23
Data Platform | Data Access | Current State
Spark Read From Hbase
24
Data Platform | Data Access | Current State
Spark Read From Elastic Search
25
Data Platform | Data Access | Current State
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
26
Data Access| Bigger Picture | Today
MAP
REDUCE
HIVE PIG SPARK FLINK BEAM PRESTO
HDFS GCP
ALLUXI
O
HBASE BIGQUERY
AEROSPIK
E
CASSANDR
A DRUID
ELASTI
C
DATAFLOW
SHERLOCK KAFKA
TERADAT
A MYSQL
BIGQUERY
27
Data Platform | Today
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Engine Changed
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************
Data Application Lifecycle - Current State
Gimel
Data Platform
Solution !
29
com.paypal.pcatalog.Dataset
Data Platform | Data Access | With Data API
MAPREDUCE HIVE PIG SPARK FLINK BEAM PRESTO
CENTRAL
CATALOG
HDFS GCP ALLUXIO HBASE
BIGQUER
Y
AEROSPIKE CASSANDRA DRUID ELASTIC
DATAFLOW
SHERLOC
K
KAFKA
TERADAT
A
MYSQL
BIGQUERY
DATA
API
30
Data Platform | Objective
• Data API
• Unified Data Access API
• Access from any Compute Engine
• Access any Storage
• Data Catalog
• Global Catalog - Explore at one place
• Onboard at one place
• Dataset Management- Track/Approve Onboard Request
• Dataset Stats
31
Data Platform | Data Access | Data API
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
32
Data Platform | Data Access | Gimel SQL
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Gimel SQL
✔
33
Data Platform | Express everything as Gimel SQL
Batch SQL
Streaming
SQL
%%gimel-batch
insert into pcatalog.hdfs_dataset
partition(yyyy,mm,dd,hh,mi)
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else
"30" end as mi
from pcatalog.kafka_dataset kafka_ds;
%%gimel-stream
insert into pcatalog.hdfs_dataset
partition(yyyy,mm,dd,hh,mi)
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else
"30" end as mi
from pcatalog.kafka_dataset kafka_ds;
All Kind of Hive SQLs Supported !
34
Data Platform | Data Application Lifecycle | With Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
35
Data Platform | Data API | Unlocking Additional Benefits
• Explorer
• View sample data
• Add capacity reservations
• Approval and Tracking
• Discovery
• Auto-discovery of datasets and importing into
PCatalog
• Dashboard
• Operation dashboard: Metrics, statistics, trends,
refresh status
• Admin dashboard: Capacity/growth/space statistics
and action
• Alert
• User alerts to notify user datasets refresh delay and
quality issue
• Admin alerts to notify datasets capacity and access
violation
• Query/BI Integration
• Integrate catalog with Jupyter, Zeppelin and query any
dataset
• Out-of-the-box Benefits
• Connector dependencies maintained by BDPE
• Change component version at one place. E.g. Hbase
version
• Easy cluster upgrade
• Abstract storage from user. Just Prod, QA and Dev
• Add new compute engine to stack easily
• Add new storage system to stack easily
• Time to Develop
• Time to Market
PayPal Gimel
Data Platform
+
Compute Platform
Presenting ….
37
PayPal GIMEL
AnalyticsDataPlatform
38
POC Phase-1 Phase-2 Phase-3 Phase-4 Phase-5
Data Platform: Plan
Compute Engine:
 Spark
 Hive
 Presto
Storage:
 TERADATA
 HDFS
 Hbase
 Elastic Search
 Kafka
 Aerospike
 Sherlock
Catalog:
 Web UI
 1Time Onboard
 Catalog API
Compute Engine:
 Spark
 Hive
 Presto
Storage:
 Druid
 Cassandra
Catalog:
 Web UI
 Auto-Onboard
 Catalog API
Compute Engine:
 Pig/MR
 Beam
 Flink
Storage:
 Alluxio
 Oracle
 Mysql
Catalog:
 Metrics/Trend/Status
 Discovery/Alerts
 Onboard Approval
 Jira Integration
 Rally Integration
 Storage Integration
Compute Engine:
 Big Query
Storage:
 GS FS
 Big Query
 Big Table
Catalog:
 Enhancements
 Improvements
 Deployment
 Cont. Onboarding
 Post-Prod Support
 Maintenance
 Analysis
 Evaluation
 Show & Tell
2017 ========== 2018 ==========
Open Source – Data API
Thank You!
Rate This Session #
with the PARTNERS Mobile App
0348
https://www.linkedin.com/in/deepakmc/LinkedIn:
Questions/Comments
Email: dmohanakumarchan@paypal.com
Appendix
41
References
Approver
Google Search Links Used for Logos of open source technologies :
https://www.google.com/search?q=big+data+stack+icons&source=lnms&tbm=isch&sa=X&ved=0ahUKE
wiwnevLztzWAhVHsFQKHZp7CukQ_AUICigB&biw=1369&bih=739
PayPal Q3 2017 & 2016 Full Year Results :
https://www.paypal.com/us/webapps/mpp/about
42
Data Platform: Process Flow
• Sample Portal: http://pcatalog.weebly.com
Onboard
Fill Meta &
Submit Approval
Request
Requestor
Approver
Approved Create
Dataset API
PCatalog
Storage
User/Developer
Submit Job Create Dataset Meta
on PCatalog
Create Catalog
on Storage
Compute
(Data API)
Data
Get Dataset Meta
Access
1 2
3
4
5
6
1 2
3
4
Auto-Approve2
RESTAPI
43
Data Platform | Today
• Data Access – Other challenges
Data access tied
to compute and
data store
versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in duplication
and increased
latency
No audit trail
for dataset
access
No standards for
on-boarding data
sets for others to
discover
No statistics on
data set usage
and access
trends
44
POC Phase-1 Phase-2 Phase-3 Phase-4 Phase-5
Data Platform: Plan Proposal
Compute Engine:
 Spark
 Hive
 Presto
Storage:
 HDFS
 Hbase
 Elastic Search
 Kafka
Catalog:
 Web UI
 1Time Onboard
 Catalog API
Compute Engine:
 Spark
 Hive
 Presto
Storage:
 Aerospike
 Druid
 Sherlock
 Cassandra
Catalog:
 Web UI
 Auto-Onboard
 Catalog API
Compute Engine:
 Pig/MR
 Beam
 Flink
Storage:
 Alluxio
 Teradata
 Oracle
 Mysql
Catalog:
 Metrics/Trend/Status
 Discovery/Alerts
 Onboard Approval
 Jira Integration
 Rally Integration
 Storage Integration
Compute Engine:
 Big Query
Storage:
 GS FS
 Big Query
 Big Table
Catalog:
 Enhancements
 Improvements
 Deployment
 Cont. Onboarding
 Post-Prod Support
 Maintenance
 Analysis
 Evaluation
 Show & Tell

More Related Content

What's hot

Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company PresentationAndrewJiang18
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - ObservabilityAraf Karsh Hamid
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...confluent
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQAraf Karsh Hamid
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 

What's hot (20)

Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 

Similar to PayPal datalake journey | teradata - edge of next | san diego | 2017 october | Gimel

How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessAli Hodroj
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresKangaroot
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...StampedeCon
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesPaul Van Siclen
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical OverviewRaheel Retiwalla
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 

Similar to PayPal datalake journey | teradata - edge of next | san diego | 2017 october | Gimel (20)

How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of Business
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
 
Scale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | GimelScale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | Gimel
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

PayPal datalake journey | teradata - edge of next | san diego | 2017 october | Gimel

  • 1. © 2015 Teradata Deepak Mohanakumar Chandramouli, Big Data Platform Eng PayPal Data Lake journey
  • 2. 2 Speaker Bio Teradata • Over 12 years of experience building Data Warehouse Solutions. • 9+ years of experience working with Teradata based Warehouse Applications 2016 • Engineering - Foundational Big Data Applications for PayPal’s Data Lake • Apache Spark Based Ingestion Framework, Data Quality Framework, Dedup & Snapshot processes for Transactional Datasets. 2017 • Big Data Platform Engineering: to build PayPal’s Gimel - Data Platform • Gimel powers Data Lake vision & many other Big Data Engineering Use-cases in PayPal Inc.
  • 3. 3 Agenda • PayPal at Scale • Business • Technology • PayPal - Datalake • Compute • Challenges • Solution • Data Access • Challenges • Solution • PayPal Gimel - Bigger Picture • PayPal Gimel – Next Steps
  • 5. 5 PayPal Scale | Business $354B Total Payment up 28% YoY $102B Mobile Payment up 55% YoY 6.1B Total Transactions up 24% YoY 2.0B Mobile Transactions up 43% YoY • One of the world’s largest internet payment companies • PayPal platform includes Braintree, Venmo, Paydiant, PP Credit and Xoom $3.24B Revenue up 22% YoY 218M Active Customer Accounts up 14% YoY 1.9B Payment Transactions up 26% YoY Q3 2017 Results 2016 Full Year Results
  • 6. 6 PayPal Scale | Technology 40,000+ Yarn Jobs Per Day 15+ Hadoop Clusters 70+PB Data 95+ PB Data 5+ Compute 6+ compute Supported - Hive,Pig, MR, Spark, Beam, Presto Others - Flink
  • 8. 8 Data Lake | Bigger Picture
  • 9. 9 Data Lake | Recap | Complexities • PayPal, Braintree, Venmo, Xoom, Credit, Tio, Paydient, More to Come in future • Each Property – Unique set of infrastructure & Storages • Integration & Data Pipelines - involves Multiple Storages such as Oracle, MySQL, Kafka, Cloud Bases systems. • Multiple Clusters – Each has a purpose • Large Scale / Complex Data Systems • Variety of Use Cases - Stream, Batch, Interactive, ML • Multiple Compute Options – Hive, Spark, Flink, Beam, Presto, Map-Reduce • Data & Operational Security ! Cross Property Integration Compute / Processing Capabilities
  • 10. 10 Data Lake | Recap | Complexities • Puzzle • Awareness - in what form & where data is available? • “Storage-specific” dataset creation • Dataset onboarding – various ways • Data assets – Lacking “Centralized” monitoring capabilities • Multiple Storages – Cloud, No-SQL, RDBMS, Messaging, In-Memory • 100’s of APIs to access different storages – SQL & Programmatic (Python, Scala, Java…) • Complexity – Many-fold if one considers different compute engines – spark, hive, flink • Time to develop and Time to Market – goes through ”learning curves” Data Engineering Data Catalog
  • 11. 2016 Beginning of the Data Lake Services
  • 12. 12 Customer- Centric, Single Source of Truth Produce OIS Metadata Event Partition Certify Configuration Quality Logging Alerting/ Monitoring Metadata Data Storage Compute Ingest/ Transform Services DMR-PAZ | Core Data Highway Kafka Self-Serving Portal SQL Streaming Data Lake Services |2016 Key Features • Configuration/Metadata driven apps • Spark based framework • Self-Service Portal • Data Quality Monitoring • Metrics Collection • Operational Dashboards Benefits • Scalability • User Experience • High Speed Data Ingestion • Converged Data Sources (PayPal, Braintree etc.) • Secure and Governed Data Quick Recap - ArchitectureSimple, Governed, Unified.
  • 13. 13 Data Lake Services |2016 | Recap Challenges Technology & Skillset • Emerging/New technologies. Example - Spark, Flink, Beam, Kylo, … • Talent & Learning Curve Users Experience • Data Engineers • Analysts • Scientists Data Lineage • Most common Challenge ! Data Governance • Security • Metadata • Data Quality & Metrics Use cases • Rapid growth rate of new business use cases • Onboarding Data Models & Best Practices • Data Layout • Data Models Data across Property - Integration • PayPal, Credit, Braintree, Venmo, Paydiant, TIO, Xoom Best Practices & Common Frameworks • Logging • Monitoring • Alerting • Auditing • Operational Dashboards Administration • Hosts Management • Software Management
  • 15. 15 Spark on Yarn| Typically Batch Edge Nodes Interactive Edge Nodes Job Deployment
  • 16. 16 Data Lake | Challenges | Administrators • Need extensive support and maintenance for CLI • Need to deploy entire stack of software • Need to sync configurations across systems • Need extensive testing of jobs before any upgrade Batch Edge Nodes Interactive Edge Nodes Job Administrators
  • 17. 17 Data Lake | Challenges | Developers Batch Edge Nodes Interactive Edge Nodes Job Developers • No REST-friendly API • No low-latency/sub-seconds execution • No cache sharing across jobs • No modularity and easy-restartability
  • 18. 18 Data Lake | Challenges | Interactive Usecases Batch Edge Nodes Interactive Edge Nodes Job Analysts/Scientists • No easy way of interactive applications • No multi-tenancy support and private workspace • No direct spark sql execution • No Kerberos integration
  • 19. 19 Data Lake | Challenges | Ops & Security • Different ways of jobs execution and coding standards • No uniform logging, monitoring and alerting • Limited audit and control • No statement level history or metrics Batch Edge Nodes Interactive Edge Nodes Job Operations/Security
  • 21. 21 Data Lake | Compute Platform Batch Job LIVY GRID Job Server Livy API NAS Batch Interactive Interactive In InIn In Interactive Interactive Sparkling Water Interactive Interactive Log Log Indexing Search Metrics History Server Administrators  Less maintenance on CLI  Deploy software stack only on Job Server  Configurations at one place  Easy platform/software upgrade Developers  REST-friendly and Docker-friendly  Low-latency/sub-seconds execution  Sharing cache across jobs  Modularity and easy restartability Analysts/Scientists  User friendly interactive applications  Multi-tenancy and Private workspace  Direct spark sql execution  Kerberos Support Operations/Security  Standardized coding and unified execution  Uniformed logging, monitoring and alerting  Fine-grained audit  Complete statement level history and metrics
  • 23. 23 Data Platform | Data Access | Current State Spark Read From Hbase
  • 24. 24 Data Platform | Data Access | Current State Spark Read From Elastic Search
  • 25. 25 Data Platform | Data Access | Current State Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid
  • 26. 26 Data Access| Bigger Picture | Today MAP REDUCE HIVE PIG SPARK FLINK BEAM PRESTO HDFS GCP ALLUXI O HBASE BIGQUERY AEROSPIK E CASSANDR A DRUID ELASTI C DATAFLOW SHERLOCK KAFKA TERADAT A MYSQL BIGQUERY
  • 27. 27 Data Platform | Today Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Learn Code Optimize Build Deploy RunCompute Engine Changed Learn Code Optimize Build Deploy RunCompute Version Upgraded Learn Code Optimize Build Deploy RunStorage API Changed Learn Code Optimize Build Deploy RunStorage Connector Upgraded Learn Code Optimize Build Deploy RunStorage Hosts Migrated Learn Code Optimize Build Deploy RunStorage Changed Learn Code Optimize Build Deploy Run********************* Data Application Lifecycle - Current State
  • 29. 29 com.paypal.pcatalog.Dataset Data Platform | Data Access | With Data API MAPREDUCE HIVE PIG SPARK FLINK BEAM PRESTO CENTRAL CATALOG HDFS GCP ALLUXIO HBASE BIGQUER Y AEROSPIKE CASSANDRA DRUID ELASTIC DATAFLOW SHERLOC K KAFKA TERADAT A MYSQL BIGQUERY DATA API
  • 30. 30 Data Platform | Objective • Data API • Unified Data Access API • Access from any Compute Engine • Access any Storage • Data Catalog • Global Catalog - Explore at one place • Onboard at one place • Dataset Management- Track/Approve Onboard Request • Dataset Stats
  • 31. 31 Data Platform | Data Access | Data API Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid With Data API ✔
  • 32. 32 Data Platform | Data Access | Gimel SQL Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid With Gimel SQL ✔
  • 33. 33 Data Platform | Express everything as Gimel SQL Batch SQL Streaming SQL %%gimel-batch insert into pcatalog.hdfs_dataset partition(yyyy,mm,dd,hh,mi) select kafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4) as yyyy ,substr(commit_timestamp,6,2) as mm ,substr(commit_timestamp,9,2) as dd ,substr(commit_timestamp,12,2) as hh ,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else "30" end as mi from pcatalog.kafka_dataset kafka_ds; %%gimel-stream insert into pcatalog.hdfs_dataset partition(yyyy,mm,dd,hh,mi) select kafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4) as yyyy ,substr(commit_timestamp,6,2) as mm ,substr(commit_timestamp,9,2) as dd ,substr(commit_timestamp,12,2) as hh ,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else "30" end as mi from pcatalog.kafka_dataset kafka_ds; All Kind of Hive SQLs Supported !
  • 34. 34 Data Platform | Data Application Lifecycle | With Data API Learn Code Optimize Build Deploy RunOnboarding Big Data Apps RunCompute Engine Changed Compute Version Upgraded Storage API Changed Storage Connector Upgraded Storage Hosts Migrated Storage Changed ********************* Run Run Run Run Run Run
  • 35. 35 Data Platform | Data API | Unlocking Additional Benefits • Explorer • View sample data • Add capacity reservations • Approval and Tracking • Discovery • Auto-discovery of datasets and importing into PCatalog • Dashboard • Operation dashboard: Metrics, statistics, trends, refresh status • Admin dashboard: Capacity/growth/space statistics and action • Alert • User alerts to notify user datasets refresh delay and quality issue • Admin alerts to notify datasets capacity and access violation • Query/BI Integration • Integrate catalog with Jupyter, Zeppelin and query any dataset • Out-of-the-box Benefits • Connector dependencies maintained by BDPE • Change component version at one place. E.g. Hbase version • Easy cluster upgrade • Abstract storage from user. Just Prod, QA and Dev • Add new compute engine to stack easily • Add new storage system to stack easily • Time to Develop • Time to Market
  • 36. PayPal Gimel Data Platform + Compute Platform Presenting ….
  • 38. 38 POC Phase-1 Phase-2 Phase-3 Phase-4 Phase-5 Data Platform: Plan Compute Engine:  Spark  Hive  Presto Storage:  TERADATA  HDFS  Hbase  Elastic Search  Kafka  Aerospike  Sherlock Catalog:  Web UI  1Time Onboard  Catalog API Compute Engine:  Spark  Hive  Presto Storage:  Druid  Cassandra Catalog:  Web UI  Auto-Onboard  Catalog API Compute Engine:  Pig/MR  Beam  Flink Storage:  Alluxio  Oracle  Mysql Catalog:  Metrics/Trend/Status  Discovery/Alerts  Onboard Approval  Jira Integration  Rally Integration  Storage Integration Compute Engine:  Big Query Storage:  GS FS  Big Query  Big Table Catalog:  Enhancements  Improvements  Deployment  Cont. Onboarding  Post-Prod Support  Maintenance  Analysis  Evaluation  Show & Tell 2017 ========== 2018 ========== Open Source – Data API
  • 39. Thank You! Rate This Session # with the PARTNERS Mobile App 0348 https://www.linkedin.com/in/deepakmc/LinkedIn: Questions/Comments Email: dmohanakumarchan@paypal.com
  • 41. 41 References Approver Google Search Links Used for Logos of open source technologies : https://www.google.com/search?q=big+data+stack+icons&source=lnms&tbm=isch&sa=X&ved=0ahUKE wiwnevLztzWAhVHsFQKHZp7CukQ_AUICigB&biw=1369&bih=739 PayPal Q3 2017 & 2016 Full Year Results : https://www.paypal.com/us/webapps/mpp/about
  • 42. 42 Data Platform: Process Flow • Sample Portal: http://pcatalog.weebly.com Onboard Fill Meta & Submit Approval Request Requestor Approver Approved Create Dataset API PCatalog Storage User/Developer Submit Job Create Dataset Meta on PCatalog Create Catalog on Storage Compute (Data API) Data Get Dataset Meta Access 1 2 3 4 5 6 1 2 3 4 Auto-Approve2 RESTAPI
  • 43. 43 Data Platform | Today • Data Access – Other challenges Data access tied to compute and data store versions Hard to find available data sets Storage-specific dataset creation results in duplication and increased latency No audit trail for dataset access No standards for on-boarding data sets for others to discover No statistics on data set usage and access trends
  • 44. 44 POC Phase-1 Phase-2 Phase-3 Phase-4 Phase-5 Data Platform: Plan Proposal Compute Engine:  Spark  Hive  Presto Storage:  HDFS  Hbase  Elastic Search  Kafka Catalog:  Web UI  1Time Onboard  Catalog API Compute Engine:  Spark  Hive  Presto Storage:  Aerospike  Druid  Sherlock  Cassandra Catalog:  Web UI  Auto-Onboard  Catalog API Compute Engine:  Pig/MR  Beam  Flink Storage:  Alluxio  Teradata  Oracle  Mysql Catalog:  Metrics/Trend/Status  Discovery/Alerts  Onboard Approval  Jira Integration  Rally Integration  Storage Integration Compute Engine:  Big Query Storage:  GS FS  Big Query  Big Table Catalog:  Enhancements  Improvements  Deployment  Cont. Onboarding  Post-Prod Support  Maintenance  Analysis  Evaluation  Show & Tell

Editor's Notes

  1. Remove this
  2. Remove this
  3. Additional Benefits Explorer View sample data Add capacity reservations Approval and Tracking Discovery Auto-discovery of datasets and importing into PCatalog Dashboard Operation dashboard: Monitor datasets metrics, statistics, trends visualization and load/refresh status Admin dashboard: Monitor datasets growth/space/capacity statistics and take action (Drop/Stop/Hold) Alert User alerts to notify user datasets refresh delay and quality issue Admin alerts to notify datasets capacity and access violation Query/BI Integration User alerts to notify user datasets refresh delay and quality issue
  4. Additional Benefits Explorer View sample data Add capacity reservations Approval and Tracking Discovery Auto-discovery of datasets and importing into PCatalog Dashboard Operation dashboard: Monitor datasets metrics, statistics, trends visualization and load/refresh status Admin dashboard: Monitor datasets growth/space/capacity statistics and take action (Drop/Stop/Hold) Alert User alerts to notify user datasets refresh delay and quality issue Admin alerts to notify datasets capacity and access violation Query/BI Integration User alerts to notify user datasets refresh delay and quality issue