This is my presentation at TDWI Leadership Summit. It talks about how products like Gimel, Unified Data Catalog and PayPal Notebooks help improve data scientist productivity and enable machine learning at scale at PayPal.
7. Volume, Velocity, Variety
Over 250PB of
data
Operate across many zones and regions
Compute choice
5,000+ analytics
users
One of the largest deployments
of Oracle, Aerospike, Teradata
and Hortonworks
250,000+ batch jobs a
day
8,000+ replications
Polyglot Datastores
9. Dataset Challenges
Data access tied to
compute and data
store versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in duplication
and increased
latency
No audit trail
for dataset
access
No standards for on-
boarding data sets
for others to discover
No statistics on
data set usage
and access
trends
Datasets
11. Analytics lifecycle challenges
Reduce Time to Market for our
end customers by reducing data
latency, simplifying access,
increase discoverability of data
sets and streamlining
development.
Objective
Data Latency Data Access Development
Before
Now
Latency: Hours to days
Onboarding: Weeks
Latency: Near real-time
Onboarding: Minutes
Discoverability: Minimal
Data Access: Fragmented
Discoverability: 100% data sets
cataloged instantly
Data Access: Unified API and SQL
Access to all data
CLI based interactive access
Edge-nodes based development
Access to near real-time data
REST-based job servers
Days → Sec
Consumption
xDiscovery
Metadata
Services
Unified Data
Catalog
Gimel
PayPal
Notebooks
Gimel
SDK
BI Tools
SQL
Clients
Control
Plane
Streaming
Services
Consumer Custom
Router
CDHSources
Data Access Processing
13. Data processing ecosystem
Elastic compute: Intelligent compute
including dynamic environments
Hybrid dataset: Intelligent data
persistence
In-memory store and cache: Reduce
connection flood on underlying systems
Unified Data Catalog: Find datasets
across data stores
Cross-cluster Data API: Eliminate ad-
hoc data movement
Self-service notebooks deployment:
DevOps for analysts
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
14. We’re not in SQL-land anymore
Spark Read From Hbase
15. More stores? More complexity!
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
16. Gimel Data API
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
17. SQL everywhere
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
18. New data development lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
19. Gimel – a powerful enabler
Single unified data API to access any data store
SQL capabilities against any data store
Switch between interactive, batch and streaming modes
Centralized metadata catalog (Unified Data Catalog) to abstract the
physical complexities of accessing data
Open sourced: gimel.io and UnifiedDataCatalog.io
Integrated with Jupyter notebooks through GSQL
Dataset browser in notebooks powered by Unified Data Catalog
21. From Jupyter to PayPal Notebooks
Jupyter
deployed
PayPal Notebooks
Beta
PayPal Notebooks
Generally Available
PayPal Notebooks Today
Q3 2016 Q3 2017
~50 users
Feb 2018
~100 users
~1,500 users
SQL, Spark/PySpark,
Python, R
2016
Zeppelin
Individual use
22. Notebooks deployed as a platform
Highly available
JupyterHub
GPU integration
Standalone
Docker
• Enable deep learning through
notebooks
• Distributed TensorFlow training enabled
with dynamic GPU resource
management
• Container image with all
PPExtensions
• Required to deploy across various
security zones at PayPal
• Foundation for open sourcing
PPExtensions
• Grid of JupyterHub hosts
• Highly available and distributed
SSO + 2FA
integrationKerberos + LDAP integration
23. PPExtensions: PPMagics
• Query data from Hive (or Teradata)
• Insert data using csv/dataframes
• Publish to Tableau
%hive, %teradata
• Run any notebook from another notebook
• Run multiple notebooks in parallel
• Execute a pipeline of notebooks
%run, %run_pipeline
• Run SQL on csv files%csv
• Query data from Presto
• Publish to Tableau%presto
• Query data from Spark Thrift Server
• Includes progress bar for SQL execution%sts
24. Collaboration, Publishing, Deployment
Github sharing Project collaboration & ML
Tableau publishing Deployment/scheduling
• Push notebook to
common org-wide repo
• View full fidelity notebook
on Github
• Share link to notebook
instead of .ipynb file or
code snippets
• Seamlessly publish to
Tableau
• Download TDE to use
Tableau Desktop, or directly
publish as a data source
• Share notebook to personal
and team repos
• Resolve conflicts between
remote and local notebooks
with nbdime
• Integrated Tensorflow for
distributed model training
• Enabled Tensorflow with
GPU
• Integrate with Airflow
• Set up frequency, alerts,
optionally push to Github
after every run
• Add Celery executor for
scalability
26. Data is truly democratized
• The next-generation development experience
• Rich, interactive data exploration and analysis
• Support for over 40 languages including SQL,
Python, Scala and R
• Big data and machine learning support built-in
• Built for PayPal: Integrated with SSO+2FA,
Kerberos, GitHub, Secure File Transfer, Tableau,
Gimel
PayPal Notebooks
• Single unified data API to access any data store
• SQL capabilities against any data store
• Switch between interactive, batch and streaming
modes
• Centralized metadata catalog to abstract the
physical complexities of accessing data
• Can be used in standalone or cluster mode
Gimel
More information:
ppextensions.io
gimel.io
UnifiedDataCatalog.io