Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando

Agenda Introduction
PayPal Scale
Data Scientist Challenges
Gimel – Data Access Simplified
PayPal Notebooks – Analytics For Everyone
Open Source
Q&A
©2018 PayPal Inc. Confidential and proprietary.

Romit Mehta Product manager, data processing products at
PayPal
20 years in data and analytics across
networking, semi-conductors, telecom, security
and fintech industries
Data warehouse developer, BI program
manager, Data product manager
romehta@paypal.com
https://www.linkedin.com/in/romit-mehta

From: PayPal’s Q3 2018 Investor
Update
PayPal Customers, Transactions and Growth

PayPal Data & Analytics Ecosystem

Volume, Velocity, Variety
Over 250PB of
data
Operate across many zones and regions
Compute choice
5,000+ analytics
users
One of the largest deployments
of Oracle, Aerospike, Teradata
and Hortonworks
250,000+ batch jobs a
day
8,000+ replications
Polyglot Datastores

Dataset Challenges
Data access tied to
compute and data
store versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in duplication
and increased
latency
No audit trail
for dataset
access
No standards for on-
boarding data sets
for others to discover
No statistics on
data set usage
and access
trends
Datasets

Application development challenges
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Engine Changed
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************

Analytics lifecycle challenges
Reduce Time to Market for our
end customers by reducing data
latency, simplifying access,
increase discoverability of data
sets and streamlining
development.
Objective
Data Latency Data Access Development
Before
Now
Latency: Hours to days
Onboarding: Weeks
Latency: Near real-time
Onboarding: Minutes
Discoverability: Minimal
Data Access: Fragmented
Discoverability: 100% data sets
cataloged instantly
Data Access: Unified API and SQL
Access to all data
CLI based interactive access
Edge-nodes based development
Access to near real-time data
REST-based job servers
Days → Sec
Consumption
xDiscovery
Metadata
Services
Unified Data
Catalog
Gimel
PayPal
Notebooks
Gimel
SDK
BI Tools
SQL
Clients
Control
Plane
Streaming
Services
Consumer Custom
Router
CDHSources
Data Access Processing

Data processing ecosystem
Elastic compute: Intelligent compute
including dynamic environments
Hybrid dataset: Intelligent data
persistence
In-memory store and cache: Reduce
connection flood on underlying systems
Unified Data Catalog: Find datasets
across data stores
Cross-cluster Data API: Eliminate ad-
hoc data movement
Self-service notebooks deployment:
DevOps for analysts
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools

We’re not in SQL-land anymore
Spark Read From Hbase

More stores? More complexity!
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid

Gimel Data API
With Data API
✔

SQL everywhere
With Data API
✔

New data development lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run

Gimel – a powerful enabler
 Single unified data API to access any data store
 SQL capabilities against any data store
 Switch between interactive, batch and streaming modes
 Centralized metadata catalog (Unified Data Catalog) to abstract the
physical complexities of accessing data
 Open sourced: gimel.io and UnifiedDataCatalog.io
 Integrated with Jupyter notebooks through GSQL
 Dataset browser in notebooks powered by Unified Data Catalog

From Jupyter to PayPal Notebooks
Jupyter
deployed
PayPal Notebooks
Beta
PayPal Notebooks
Generally Available
PayPal Notebooks Today
Q3 2016 Q3 2017
~50 users
Feb 2018
~100 users
~1,500 users
SQL, Spark/PySpark,
Python, R
2016
Zeppelin
Individual use

Notebooks deployed as a platform
Highly available
JupyterHub
GPU integration
Standalone
Docker
• Enable deep learning through
notebooks
• Distributed TensorFlow training enabled
with dynamic GPU resource
management
• Container image with all
PPExtensions
• Required to deploy across various
security zones at PayPal
• Foundation for open sourcing
PPExtensions
• Grid of JupyterHub hosts
• Highly available and distributed
SSO + 2FA
integrationKerberos + LDAP integration

PPExtensions: PPMagics
• Query data from Hive (or Teradata)
• Insert data using csv/dataframes
• Publish to Tableau
%hive, %teradata
• Run any notebook from another notebook
• Run multiple notebooks in parallel
• Execute a pipeline of notebooks
%run, %run_pipeline
• Run SQL on csv files%csv
• Query data from Presto
• Publish to Tableau%presto
• Query data from Spark Thrift Server
• Includes progress bar for SQL execution%sts

Collaboration, Publishing, Deployment
Github sharing Project collaboration & ML
Tableau publishing Deployment/scheduling
• Push notebook to
common org-wide repo
• View full fidelity notebook
on Github
• Share link to notebook
instead of .ipynb file or
code snippets
• Seamlessly publish to
Tableau
• Download TDE to use
Tableau Desktop, or directly
publish as a data source
• Share notebook to personal
and team repos
• Resolve conflicts between
remote and local notebooks
with nbdime
• Integrated Tensorflow for
distributed model training
• Enabled Tensorflow with
GPU
• Integrate with Airflow
• Set up frequency, alerts,
optionally push to Github
after every run
• Add Celery executor for
scalability

Data is truly democratized
• The next-generation development experience
• Rich, interactive data exploration and analysis
• Support for over 40 languages including SQL,
Python, Scala and R
• Big data and machine learning support built-in
• Built for PayPal: Integrated with SSO+2FA,
Kerberos, GitHub, Secure File Transfer, Tableau,
Gimel
PayPal Notebooks
• Single unified data API to access any data store
• SQL capabilities against any data store
• Switch between interactive, batch and streaming
modes
• Centralized metadata catalog to abstract the
physical complexities of accessing data
• Can be used in standalone or cluster mode
Gimel
More information:
ppextensions.io
gimel.io
UnifiedDataCatalog.io

Gimel & PPExtensions Open Sourced
Open Source
Home
ppextensions.io
gimel.io
Google Groups
groups.google.com/d/forum/ppextensions
groups.google.com/d/forum/gimel-dev
Slack
ppextensions.slack.com [Invite link]
gimel-dev.slack.com [Invite link]
Install
pip install ppextensions
try.gimel.io

Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando

Semelhante a Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando (20)

Último

Último (20)

Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando