SlideShare uma empresa Scribd logo
1 de 48
½ S L using to turn
into
Semi-Supervised Learning on
Hadoop to understand user
behaviors
Hadoop Summit Amsterdam
2-3 Avril 2014
Florian Douetteau
@fdouetteau
www.dataiku.com
Data Science
Studio
Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
• Users
– Content
What does matter on a web site ?
Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase
The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisfaction
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– User Satisfaction
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– User Satisfaction
– In-App Purchase
U
S
E
R
Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?
The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
each and everyday
People that loves
to comment
Foreigners Robots
People fond of
sport section only
…. …..
BEHAVIOUR DIVERSITY
THE AVERAGED
METRICS WOULD
HIDE
IMPORTANT
VARIATION ON
SPECIFIC SEGMENTS
SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segment per
transformed or not
• Segment per country
Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
 log(Number of page
views) + Number of actions
• Search engine
Click Ratio
Click ratio
• E-Commerce
 Transformation Ratio
Limits
Hard Segments
 MISSING PART OF
THE REALITY
Hard Metrics
 ARGUING BETWEEN
TEAM
Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Model
Model
Model
½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge
 Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)
½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription
Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with parameter X
– Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs
– Supervised learning with a costs function that capture a
distance between point related to the unlabeled data
structure
• Ad-hoc : Combine unsupervised, then supervised
Clustering+Supervised in practice
Unlabeled training data points in grey
Labeled training data points in color
Supervised Learning Only
½ SL : Fit to the underlying structure
Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !
Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or Python
custom
Step 4
Build user-level
stats
Hive
RAW DATA
READY FOR ML
Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device  Select only one per log
– URL, Event  Create an ordered array that
represents the sequence of events in the session
Step 2 : Basic Feature
• IP Address  Location, City
• User-Agent  Device
• Timestamp  User Time  Day or night ?
Python + Hadoop Streaming
Option 1 Option 2
Extracted DataORIGINAL
ORIGINAL
ORIGINAL
NEW!!
NEW!!
NEW!!
Country From IP Device From User-AgentHour from
Country & Time
Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
 It might not help that much to differentiate
behaviour
More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
Web Sessions Page View [/home , /products, /trynow,
/blog]
/home, /products, /trynow,
/blog
/home /products, /products
/trynow, /trynow /blog
/home-/products-/trynow,
/products-/trynow-/blog
Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:
• Incorporate the first referrer / marketing campaign as FIRST TOKEN
• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain
How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed
/ non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session
can be described by a non-detailed n-gram
Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Average session time
– …
• Reintegrate this information
• Easily achieved with a HiveQL query
Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python
(Scikit+Pandas+…
)
Out for training /
In for apply
~ 20 available
(including bi-
clustering)
Medium (10GB)
1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)
CLUSTER RAM
Open Source R +
Hadoop
Varies Varies Varies Varies
Open Source R +
Pattern
(Casacding)
Out for training
/ In for apply
> 3 Medium (1GB)
1 Server RAM in
R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB)
CLUSTER RAM
How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
stats
RAW DATA
READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with
10 Millions Unique Visitors per month
10 GB5TB
Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical
3. Kmeans().fit()
4. Ipython to draw some graphs
5. Enjoy
or
Session Data
Clustering
Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid
Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this guy was
doing !!!
Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Independently from the clusters, used the
trained examples in order to classify each
session in the predefined segments
Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a model
• Save the model in HDFS
(python pickle)
• Run the model against all the
data (Hadoop Streaming)
We’ve got a tool to help you
do that in Data Science Studio
He’s called the Doctor and he’s
fun to use !
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opinion
– Does he like satisfied
from its behaviour ?
Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satisfaction score
for non-trained session
Session Data
Feedbacks
Scored
Session
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS
THEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37
SATISFACTION SCORE 0.28
SATISFACTION SCORE 0.12
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE
Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM
stats
GROUP BY segment
These factors determine
whether your smooth a lot
or not, and over how many days
Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Follow Statisfaction Metric Per Segment
Damn
our latest
release
has diverging
effects
on segments
Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com

Mais conteúdo relacionado

Mais procurados

MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...
MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...
MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...MITRE - ATT&CKcon
 
Trusts You Might Have Missed
Trusts You Might Have MissedTrusts You Might Have Missed
Trusts You Might Have MissedWill Schroeder
 
A Risk Based Approach to Security Detection and Investigation by Kelby Shelton
A Risk Based Approach to Security Detection and Investigation by Kelby SheltonA Risk Based Approach to Security Detection and Investigation by Kelby Shelton
A Risk Based Approach to Security Detection and Investigation by Kelby SheltonJohn Billings CISSP
 
Threat modelling(system + enterprise)
Threat modelling(system + enterprise)Threat modelling(system + enterprise)
Threat modelling(system + enterprise)abhimanyubhogwan
 
Windows Threat Hunting
Windows Threat HuntingWindows Threat Hunting
Windows Threat HuntingGIBIN JOHN
 
Automation: The Wonderful Wizard of CTI (or is it?)
Automation: The Wonderful Wizard of CTI (or is it?) Automation: The Wonderful Wizard of CTI (or is it?)
Automation: The Wonderful Wizard of CTI (or is it?) MITRE ATT&CK
 
How to Build an Insider Threat Program in 30 Minutes
How to Build an Insider Threat Program in 30 Minutes How to Build an Insider Threat Program in 30 Minutes
How to Build an Insider Threat Program in 30 Minutes ObserveIT
 
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...MichaelOLeary82
 
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01MW_Arch Fastest_way_to_hunt_on_Windows_v1.01
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01Michael Gough
 
Windows attacks - AT is the new black
Windows attacks - AT is the new blackWindows attacks - AT is the new black
Windows attacks - AT is the new blackChris Gates
 
What is Penetration Testing?
What is Penetration Testing?What is Penetration Testing?
What is Penetration Testing?btpsec
 
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptxChi En (Ashley) Shen
 
PowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingPowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingNikhil Mittal
 
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Alex Pinto
 
Vulnerability Assessment and Penetration Testing Framework by Falgun Rathod
Vulnerability Assessment and Penetration Testing Framework by Falgun RathodVulnerability Assessment and Penetration Testing Framework by Falgun Rathod
Vulnerability Assessment and Penetration Testing Framework by Falgun RathodFalgun Rathod
 
Container security
Container securityContainer security
Container securityAnthony Chow
 
All You Need is One - A ClickOnce Love Story - Secure360 2015
All You Need is One -  A ClickOnce Love Story - Secure360 2015All You Need is One -  A ClickOnce Love Story - Secure360 2015
All You Need is One - A ClickOnce Love Story - Secure360 2015NetSPI
 

Mais procurados (20)

MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...
MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...
MITRE ATT&CKcon 2018: Playing Devil’s Advocate to Security Initiatives with A...
 
Threat Modeling Using STRIDE
Threat Modeling Using STRIDEThreat Modeling Using STRIDE
Threat Modeling Using STRIDE
 
Trusts You Might Have Missed
Trusts You Might Have MissedTrusts You Might Have Missed
Trusts You Might Have Missed
 
A Risk Based Approach to Security Detection and Investigation by Kelby Shelton
A Risk Based Approach to Security Detection and Investigation by Kelby SheltonA Risk Based Approach to Security Detection and Investigation by Kelby Shelton
A Risk Based Approach to Security Detection and Investigation by Kelby Shelton
 
Threat modelling(system + enterprise)
Threat modelling(system + enterprise)Threat modelling(system + enterprise)
Threat modelling(system + enterprise)
 
I hunt sys admins 2.0
I hunt sys admins 2.0I hunt sys admins 2.0
I hunt sys admins 2.0
 
Windows Threat Hunting
Windows Threat HuntingWindows Threat Hunting
Windows Threat Hunting
 
Automation: The Wonderful Wizard of CTI (or is it?)
Automation: The Wonderful Wizard of CTI (or is it?) Automation: The Wonderful Wizard of CTI (or is it?)
Automation: The Wonderful Wizard of CTI (or is it?)
 
How to Build an Insider Threat Program in 30 Minutes
How to Build an Insider Threat Program in 30 Minutes How to Build an Insider Threat Program in 30 Minutes
How to Build an Insider Threat Program in 30 Minutes
 
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
Operationalizing Multi Cluster Istio_ Lessons Learned and Developing Ambient ...
 
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01MW_Arch Fastest_way_to_hunt_on_Windows_v1.01
MW_Arch Fastest_way_to_hunt_on_Windows_v1.01
 
Windows attacks - AT is the new black
Windows attacks - AT is the new blackWindows attacks - AT is the new black
Windows attacks - AT is the new black
 
What is Penetration Testing?
What is Penetration Testing?What is Penetration Testing?
What is Penetration Testing?
 
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx
[HITCON 2020 CTI Village] Threat Hunting and Campaign Tracking Workshop.pptx
 
PowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingPowerShell for Practical Purple Teaming
PowerShell for Practical Purple Teaming
 
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
 
Android Task Hijacking
Android Task HijackingAndroid Task Hijacking
Android Task Hijacking
 
Vulnerability Assessment and Penetration Testing Framework by Falgun Rathod
Vulnerability Assessment and Penetration Testing Framework by Falgun RathodVulnerability Assessment and Penetration Testing Framework by Falgun Rathod
Vulnerability Assessment and Penetration Testing Framework by Falgun Rathod
 
Container security
Container securityContainer security
Container security
 
All You Need is One - A ClickOnce Love Story - Secure360 2015
All You Need is One -  A ClickOnce Love Story - Secure360 2015All You Need is One -  A ClickOnce Love Story - Secure360 2015
All You Need is One - A ClickOnce Love Story - Secure360 2015
 

Semelhante a Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892mercedes calderon
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 

Semelhante a Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours (20)

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Paris ML meetup
Paris ML meetupParis ML meetup
Paris ML meetup
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Python ml
Python mlPython ml
Python ml
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
kdd2015
kdd2015kdd2015
kdd2015
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 

Mais de Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 

Mais de Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

  • 1. ½ S L using to turn into
  • 2. Semi-Supervised Learning on Hadoop to understand user behaviors Hadoop Summit Amsterdam 2-3 Avril 2014
  • 4. Motivation • CxO – Pages Views, Unique Visitors, Dollars, Subscription • Editor / Product Manager – Time Spent, Comments • Users – Content What does matter on a web site ?
  • 5. Key Usage Metrics • Publisher – Time Spent on Page – Number of pages seen – Number of comments – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – In-App Purchase
  • 6. The Quest for the Missing Proxy • Publisher – Time Spent on Page – Number of pages seen – Number of comments – User Satisfaction – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – User Satisfaction – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – User Satisfaction – In-App Purchase U S E R
  • 7. Question How to measure and drive user satisfaction on a large web sites with very diverse usage patterns ?
  • 8. The Problem New Comers From Google News People Coming from twitter and Facebook Posts People coming to the website almost each and everyday People that loves to comment Foreigners Robots People fond of sport section only …. ….. BEHAVIOUR DIVERSITY THE AVERAGED METRICS WOULD HIDE IMPORTANT VARIATION ON SPECIFIC SEGMENTS
  • 9. SubProblem 1: Hard Segments • Segments Users per Number of visits per month – > 20 days per month -> Engaged Users • Segment per transformed or not • Segment per country
  • 10. Subproblem 2: Hard Metrics • Newspaper Time Spent on the website  log(Number of page views) + Number of actions • Search engine Click Ratio Click ratio • E-Commerce  Transformation Ratio
  • 11. Limits Hard Segments  MISSING PART OF THE REALITY Hard Metrics  ARGUING BETWEEN TEAM
  • 12. Semi-Supervised Learning All Labeled Data All Unlabeled Data Some Labeled Data Lots of Unlabeled Data Training Data Supervised Learning Unsupervised Learning Semi- Supervised Learning Model Model Model
  • 13. ½ SL – Natural Language Processing I hope I’ll enjoy Amsterdam, and not only because of Hadoop Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop Statistical Knowledge  Text Structure (Unsupervised) Aligned Corpus (Supervised)
  • 14. ½ SL Applied to Web Sessions Lots of customer sessions Not so many concrete customer feedbacks Subscription
  • 15. Semi-Supervised Learning 3 Approaches • Generative Models, e.g. gaussian fits – All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and unlabeled data • Fits with costs – Supervised learning with a costs function that capture a distance between point related to the unlabeled data structure • Ad-hoc : Combine unsupervised, then supervised
  • 16. Clustering+Supervised in practice Unlabeled training data points in grey Labeled training data points in color
  • 18. ½ SL : Fit to the underlying structure
  • 19. Our Approach 1. (Lots of ) Data preparation to build miningful user session 2. Clustering sessions and validate/tag those clusters by end users 3. Create Predictive User Satisfaction Metrics 4. Follow those metrics !
  • 20. Data Prep: Overview Step 1 Build Sessions Pig Step 2 Parse IP/Time/.. Custom Python (or ) Step 3 Parse Sequences Hive or Python custom Step 4 Build user-level stats Hive RAW DATA READY FOR ML
  • 21. Step 1. Build Session • Use Hive ( Or Pig) • Group into “Session” • Depending on the variable – IP, Device  Select only one per log – URL, Event  Create an ordered array that represents the sequence of events in the session
  • 22. Step 2 : Basic Feature • IP Address  Location, City • User-Agent  Device • Timestamp  User Time  Day or night ? Python + Hadoop Streaming Option 1 Option 2
  • 23. Extracted DataORIGINAL ORIGINAL ORIGINAL NEW!! NEW!! NEW!! Country From IP Device From User-AgentHour from Country & Time
  • 24. Step 3: Session Signals • Simple Signals – Number of Page Views – Time Spent ….. – Etc… • Limitation  It might not help that much to differentiate behaviour
  • 25. More Elaborate: N-Grams Model Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote
  • 26. N-Grams Model For Sessions Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote Web Sessions Page View [/home , /products, /trynow, /blog] /home, /products, /trynow, /blog /home /products, /products /trynow, /trynow /blog /home-/products-/trynow, /products-/trynow-/blog
  • 27. Session N-Grams Analytics Campaign / URL / Event Detailed Token Simple Token utm=google_search google-search-my-site google-search /home home home /search?q=baseball search-baseball search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player-comming sport /search?q=Mick+JONES search-mick+jones search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player/comming sport /politics/home politics-home politics Important Tricks: • Incorporate the first referrer / marketing campaign as FIRST TOKEN • Build two level of tokens: detailed, and category only N-Grams Fine Grain N-Grams Coarse Grain
  • 28. How To In Practice • Hive query using the n-grams UDF • Compute the LLR (Least-Likehood Ratio) Metrics • Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session • Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
  • 29. Step 4. Cohort-like data • Per cookie compute metrics – Nb. Days since first visit – Nb visits in the last 30 days – Average session time – … • Reintegrate this information • Easily achieved with a HiveQL query
  • 30. Machine Learning for HDFS Data Kind Algorithms for clustering Simplicity TRAIN set size Apache Mahout MapReduce ~ 10 available Expert TERABYTES Python (Scikit+Pandas+… ) Out for training / In for apply ~ 20 available (including bi- clustering) Medium (10GB) 1 SERVER RAM H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB) CLUSTER RAM Open Source R + Hadoop Varies Varies Varies Varies Open Source R + Pattern (Casacding) Out for training / In for apply > 3 Medium (1GB) 1 Server RAM in R Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
  • 31. How Big is out data here ? Step 1 Build Sessions Step 2 Parse IP/Time/.. Step 3 Parse Sequences Step 4 Build user-level stats RAW DATA READY FOR ML Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month 10 GB5TB
  • 32. Clustering With Scikit on HDFS 1. Use Pydoop to get data on train server 2. Use pandas to read data transform to numerical 3. Kmeans().fit() 4. Ipython to draw some graphs 5. Enjoy or
  • 35. Clustering & Cluster Sampling Take a balanced number of samples in each cluster, close to the centroid
  • 36. Labelling 0’ 00 0’ 12 1’ 04 1’ 45 3’ 02 Visualizing Sessions Search for a specific Topic Labelling I can guess what this guy was doing !!!
  • 37. Labelling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 38. What if ? Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 39. Supervised Learning Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
  • 40. Supervised Learning : e.g. in python • Load the data and the label in python (Pandas) • Fit the labeled sessions against a model • Save the model in HDFS (python pickle) • Run the model against all the data (Hadoop Streaming) We’ve got a tool to help you do that in Data Science Studio He’s called the Doctor and he’s fun to use !
  • 41. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs
  • 42. User Satisfaction Metrics • Future-Based Metrics – Will the user most likely subscribe/pay in the future ? • Expressed-Opinion – Does he like satisfied from its behaviour ?
  • 43. Opinion-Based Training For User Satisfaction User Feedbacks as “Labels” to build a model on satisfaction “Predict” a satisfaction score for non-trained session Session Data Feedbacks Scored Session HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS THEY HAVE SIMILAR USER SATISFACTION LEVELS (100 Million Sessions) (10.000 feedbacks)
  • 44. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs SATISFACTION SCORE 0.87§ SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
  • 45. Data in Time: Smoothing In Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT FROM DAYS TO DAYS IT WILL SCARE PEOPLE
  • 46. Exponential Smoothing In Hive SELECT segment moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’)) FROM stats GROUP BY segment These factors determine whether your smooth a lot or not, and over how many days
  • 47. Final : Follow Smoothed Satisfaction Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Follow Statisfaction Metric Per Segment Damn our latest release has diverging effects on segments
  • 48. Thank You ! Florian Douetteau @fdouetteau Questions now or later: florian.douetteau@dataiku.com dataiku.com