2. Dato Confidential
Hello my name is
Neel Kishan
Technical Sales Lead
(former neuroscientist, GPU programmer,
Eagle Scout, Chicago sports fan)
2
neel@dato.com
Let’s Schedule a Time to Talk:
https://calendly.com/dato-neel
3. Dato Confidential
We empower developers to
create intelligent applications with
real-time machine learning services
quickly and easily.
Intelligent
Applications
Dato
Platform
GraphLab
Create
Dato
Predictive
Services
Machine
Learning
Lifecycle
4. Dato Confidential4
Teams have found ways to build
intelligent applications…
Recommenders
Lead Scoring
Churn Prediction
Multi-channel Targeting
Auto-Summarization
Fraud detection
Intrusion Detection
Demand Forecasting
Data Matching
Failure Prediction
5. Dato Confidential5
Why do these projects take so long?
• Lengthy code rewrites for scalable production services
• Mundane tasks to integrate libraries, transform data to
specific formats, fill in missing values, etc.
• Many tools are just slow
6. Dato Confidential6
Challenges for developing intelligent apps
• Algorithm-centric APIs create confusion and a steep
learning curve
• Understanding models has been a craft passed only
through tribal knowledge
• Production services are hard to maintain and manage
7. Dato Confidential
Intuitive APIs
Easy to learn with smart defaults so your first application comes together fast
Deploy instantly as REST
Eliminates the lengthy rewrites to integrate and serve live, at scale
Integrated libraries for any data
Deep learning, graphs, text, and images on a common scalable data structure eliminates all the
glue code and context switching
Dato Machine Learning
Built to rapidly deliver intelligent applications
9. Dato Confidential
The Dato Machine Learning Platform
Deploy
Models
Feedback
GraphLab Create &
Dato Distributed
TrainDevelop
Experiments
Dato Predictive Services
Serve
(REST API)
Monitor
www.
on your infrastructure:
GraphLab Create &
Dato Distributed
• Creating models
• Data engineering
• Evaluation &
Visualization
Predictive Services
• Serving models
• Live experimentation
• Model management
10. Dato Confidential10
Scalable Data Structures for Machine Learning
User Com.
Title Body
User Disc.
SFrame - on-disk, columnar & partitioned table
SGraph – graph structure composed of multiple tables
TimeSeries – table with a time index
11. Dato Confidential
High performance machine learning
11
0.60%
0.65%
0.70%
0.75%
0.80%
0.85%
0 2 4 6 8 10 12
TestError
Time(hr)
H2O.ai:
10 machines/80 cores
recommenders deep learning & images graph analytics
Faster algorithms accelerate teams
Fails to complete on other systems!
12. Dato Confidential12
Intuitive API – Easily create a live machine learning service
import graphlab as gl
data = gl.SFrame.read_csv('my_data.csv')
model = gl.recommender.create(
data,
user_id='user',
item_id='movie’,
target='rating')
recommendations = model.recommend(k=5)
cluster = gl.deploy.load(‘s3://path’)
cluster.add(‘servicename’, model)
Create a Recommender
5 lines of code
Toolkit w/auto selection
Deploy in minutes
13. Dato Confidential13
Dato Machine Learning Toolkits
Applications
• recommender
• sentiment_analysis
• churn_predictor
• data_matching
• pattern_mining
• anomaly_detection
Fundamentals
• regression
• classifier
• nearest_neighbors
• clustering
• deeplearning
• text_analytics
• graph_analytics
Utilities
• model_parameter_search
• cross_validation
• evaluation
• comparison
• feature_engineering
Join us April 7th for a webinar on Deep Learning: Image Similarity and Beyond
18. Dato Confidential
Dato is becoming the backbone of intelligent applications for 80+ customers
• Commercialization of Carnegie Mellon ML Project founded by Professor
Carlos Guestrin in 2013
• Vibrant user community numbering 40,000+ from Coursera and open
source projects
• Major customers in retail, finance, media, and software
18
20. Dato Confidential
Machine Learning Deployment Options
20
Dato Predictive Services
Batch write of predictions
Embedded process or script
Export (e.g. PMML)
21. Dato Confidential
Pricing
• Subscription license
which includes support
and and upgrades
• Licensed by user for
Create & by machine for
production use
• Training & technical
services also available
21
24. Dato Confidential
Quantifying the value – Fastest to Production & Reduced Operational Cost
Built a 90% accurate sentiment analyzer for hotel reviews after 30 minutes of trying Dato’s
GraphLab Create
Created an efficient (40 mins in Dato vs. 33 days in R) pipeline with 46% lift in accuracy
“[Dato’s] GraphLab CreateTM gives us easy access to some of the most advanced machine
learning and this lets us iterate on our ideas faster”
24
Simplify the process to develop and deploy internal services for SalesForce PDS and adjacent teams
Reduced hundreds of tools to manage, complexity of solution, and development time
Achieved in 2 days with Dato’s GraphLab Create what took 2 weeks in R
Dropped concept to deployment from months to minutes
Replace a heuristic heavy job ranking system to improve job search relevance
Developed in weeks with significant increase in clickthrough after years of no growth
25. Dato Confidential
Fraud Detection and Security
“Merchant intelligence for safer, more profitable commerce.”
Others like Alan & G2 Web Services:
Alan Krumholz, Principal Data Scientist
Score merchants based on their web presence and actions to help their
banking customers identify fraudulent merchants.
Accelerate business decisions, reducing manual intervention required
and minimizing false positives.
Achieved in 2 days with GraphLab Create what took two weeks in R.
Dropped deployment from months to minutes.
WHO:
INSPIRATION:
VALUE:
OUTCOME:
Customer Success Story
25
26. Dato Confidential
Data Matching
Customer Success Story
“Fast, free, thorough home search.”
Others like Nick & Zillow:
Nicholas McClure, Senior Data Scientist
Build a service that matches property listings across many inbound data
feeds and collapses to a most accurate listing.
Data & listing quality is critical to Zillow’s core product.
Created an efficient (40 mins in GLC vs. 33 day R pipeline) pipeline with
much higher accuracy (95% up from 65%).
WHO:
INSPIRATION:
VALUE:
OUTCOME:
26
27. Dato Confidential
Recommenders
Customer Success Story
They are the site for “Advice and support on pregnancy and parenting.”
Others like Shelley & BabyCenter:
Shelley Klopp, DBA & Chief Architect
Build and deploy their first recommender to increase session engagement
by recommending relevant content
Initial model increased average session by multiple page views
First prototype built in < 1 week
Ongoing model experimentation is increasing engagement
WHO:
INSPIRATION:
VALUE:
OUTCOME:
27
28. Dato Confidential
Sentiment and Text Analysis
Customer Success Story
“Get hired. Love your job.”
Others like Marcos and Glassdoor:
Marcos Sainz, Lead Machine Learning Engineer
Replace a heuristic heavy job ranking system with an ML driven system
to improve job search relevance
More relevant jobs led to happier users and higher clickthrough
Concept to production in weeks
WHO:
INSPIRATION:
VALUE:
OUTCOME:
28
29. Dato Confidential
Image analytics and Deep features
Customer Success Story
“Smart waste management.”
Others like Ben & Compology:
Ben Chehebar, Co-founder/Lead of Product
Use machine learning to predict how full dumpsters are.
This allows them to augment their human classification using mechanical
turk and allows them to scale their operations.
Concept to deployed service in less than a month with accuracy as good
or better than the humans.
WHO:
INSPIRATION:
VALUE:
OUTCOME:
29
Scalable, performant production services
Disconnect between DS and Eng
I have struggled to present this. It is really difficult to explain what this is.
Only recent that I figured out the reason.
It is not 1 thing.
It is really 3 or 4 things.
- Python API, heavy Pandas inspired. Does a ton of stuff. Also has a rather nice scalable graph datastructure to go with it
- A physical storage layer. Heavy compressed column store with type-specific compression routines. Especially aggressive for numeric types. It comes with a file system abstraction (for C++ people fstream, general_fstream) that can read from many places.
A special “cache” filesystem which basically is an “in memory file” that dumps to disk when memory gets full. This is how we get compressed in memory performance
- And I am not even talking about our Graph Datastructure either. But talk to me if you want to hear more.