SlideShare uma empresa Scribd logo
1 de 86
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
22 June 2014 Big Data Everywhere Conference #DataIsrael
© 2014 MapR Technologies 3
http://www.wired.com/wiredenterprise/2012/12/mahout/
Recommendation: Widely Used Machine Learning
Example: Open source Apache Mahout used in production
© 2014 MapR Technologies 4
Recommendations
– Data: interactions between people taking action (users) and items
• Data used to train recommendation model
– Goal is to suggest additional interactions
– Example applications: movie, music or map-based restaurant choices;
suggesting sale items for e-stores or via cash-register receipts
© 2014 MapR Technologies 5
Google maps: restaurant recommendations
© 2014 MapR Technologies 6
Google maps: tech recommendations
© 2014 MapR Technologies 7
Tutorial Part 1:
How recommendation works,
or “I want a pony”…
© 2014 MapR Technologies 9
First question:
Are you using the right data?
© 2014 MapR Technologies 10
Recommendation
Behavior of a crowd
helps us understand
what individuals will do
© 2014 MapR Technologies 11
Recommendations
Alice got an apple and
a puppyAlice
© 2014 MapR Technologies 12
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
© 2014 MapR Technologies 13
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 14
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob What else would Bob like?
© 2014 MapR Technologies 15
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 16
You get the idea of how
recommenders can work…
© 2014 MapR Technologies 17
By the way, like me, Bob also
wants a pony…
© 2014 MapR Technologies 18
Recommendations
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend for
new user Amelia?
© 2014 MapR Technologies 19
Recommendations
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s
not a very good indicator of
what to else predict...
© 2014 MapR Technologies 20
Problems with Raw Co-occurrence
• Very popular items co-occur with everything or why it’s not very
helpful to know that everybody wants a pony…
– Examples: Welcome document; Elevator music
• Very widespread occurrence is not interesting to generate indicators
for recommendation
– Unless you want to offer an item that is constantly desired, such as
razor blades (or ponies)
• What we want is anomalous co-occurrence
– This is the source of interesting indicators of preference on which to
base recommendation
© 2014 MapR Technologies 21
Overview: Get Useful Indicators from Behaviors
1. Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
2. Transform to a co-occurrence matrix of items x items
3. Look for useful indicators by identifying anomalous co-occurrences to
make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
© 2014 MapR Technologies 22
Apache Mahout: Overview
• Open source Apache project http://mahout.apache.org/
• Mahout version is 0.9 released Feb 2014; inc Scala
– Summary 0.9 blog at http://bit.ly/1rirUUL
• Library of scalable algorithms for machine learning
– Some run on Apache Hadoop distributions; others do not require Hadoop
– Some can be run at small scale
– Some are run in parallel; others are sequential
• Includes the following main areas:
– Clustering & related techniques
– Classification
– Recommendation
– Mahout Math Library
© 2014 MapR Technologies 24
Log Files
Alice
Bob
Charles
Alice
Bob
Charles
Alice
© 2014 MapR Technologies 25
Log Files
u1
u3
u2
u1
u3
u2
u1
t1
t4
t3
t2
t3
t3
t1
© 2014 MapR Technologies 26
History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 27
Co-Occurrence Matrix: Items x Items
1 2 0
1
1 1
1
1
0
00
2
How do you tell which co-
occurrences are useful?
© 2014 MapR Technologies 28
Co-Occurrence Matrix: Items x Items
1 2 0
1
1 1
1
1
0
00
2
Use LLR test to turn co-
occurrence into
indicators…
© 2014 MapR Technologies 29
Co-occurrence Binary Matrix
1
1not
not
1
© 2014 MapR Technologies 30
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2014 MapR Technologies 31
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
© 2014 MapR Technologies 32
Co-Occurrence Matrix: Items x Items
1 2 0
1
1 1
1
1
0
00
2
Recap:
Use LLR test to turn co-
occurrence into indicators
© 2014 MapR Technologies 33
Indicator Matrix: Anomalous Co-Occurrence
Result:
Each marked row shows
the indicators of what to
recommend…
✔
✔
© 2014 MapR Technologies 34
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
Why not pony + other item?
© 2014 MapR Technologies 36
How will you deliver
recommendations to users?
© 2014 MapR Technologies 37
Seeking Simplicity
Innovation:
Exploit search technology to deploy
your recommendation system
© 2014 MapR Technologies 38
But first, a look at how
search works…
© 2014 MapR Technologies 39
Apache Solr/Apache Lucene
• Apache Solr/Lucene is an open-source powerful search
engine used for flexible, heavily indexed queries including data
such as
– Full text, geographical data, statistically weighted data
• Lucene
– Provides core retrieval
– Is a low-level library
• Solr
– Is a web-based wrapper around Lucene
– Is easy to integrate because you talk to it via a web-interface
• URL http://host machine:8888
© 2014 MapR Technologies 40
LucidWorks
• Enterprise platform and collection of applications based on
Apache Solr/Lucene
– Wrapper around Solr
– A free version ships with MapR
• LucidWorks leaves Solr exposed but makes Solr
administration much easier, which in turn makes it easier to
use Lucene
• URL http://host machine:8989
© 2014 MapR Technologies 41
Solr
LucidWorks
Query Response
Index
Lucene
Query Response
Data Source
Relationship of Solr /Lucene/LucidWorks
© 2014 MapR Technologies 42
Other Options: Elastic Search
• Apache Lucene library at the heart of several approaches
– It can also be used on its own
• Elastic Search is a different web interface for Lucene
– Real time search and analytics
– Open source (not Apache)
– Big advantage is less accumulated cruft
http://www.elasticsearch.com/
© 2014 MapR Technologies 44
What is a Document?
• Data is stored in collections made up of documents
• Documents contain fields that can be
– Indexed
• makes field searchable; don’t have to index all
– Stored
• If you want Solr to return content, have Solr store content
• Not all data of interest must be stored: can access via stored URL (good for very
large data set)
– Multi-value
• Body field can contain more than one type of data
– Facetted
• A way to refine a search or use for statistics
• Example: data for country: Could return facetted as “37 from US, 23 UK, 7 Japan”
© 2014 MapR Technologies 45
Fields and How to Set Them Up
• Lucene is mostly used for text
– Text has to be tokenized
• Also supports other types
– Long, string, keywords, comma separated
• Fields properties such as stored, indexed, faceted can also be
defined
• Defaults aren’t usually so great
© 2014 MapR Technologies 46
Example of Facetted Search
The indexed fields “Area” and “Gender” have been
facetted to provide counts for the results
© 2014 MapR Technologies 48
Field-specific Searches Using Lucene Syntax
• Documents often have title, author, keywords and body text
• General syntax is
field:(term1 term2)
• Alternatives
field:(term1 OR term2) field:(“term1 term2” ~ 5) field:(term1 AND term2)
• Default field, default interpretations work very well for text
© 2014 MapR Technologies 49
Send Data to Solr (not LucidWorks)
• LucidWorks has lots of spiders that can use file extension or
mime type to trigger certain file parsers
– Works best with web-ish sources and mime types
– Also includes MapR and MapR high volume indexers
• More common at modest volumes or for updates to use JSON
format
{"id":"book_314", "title":{"set":"The Call of the Wild"}}
• Use REST interface to send update files
curl http://localhost:8983/solr/update 
-H 'Content-type:application/json' --data-binary @file.json
© 2014 MapR Technologies 51
Back to recommendation:
How do you abuse search to make
recommendation easy?
© 2014 MapR Technologies 52
Collection of Documents: Insert Meta-Data
Search
Technology
Item
meta-data
Document for
“puppy” id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS
© 2014 MapR Technologies 53
From Indicator Matrix to New Indicator Field
✔
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
indicators: (t1)
Solr document
for “puppy”
Note: data for the indicator field is added directly to meta-data for a document in Apache
Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
© 2014 MapR Technologies 54
Let’s look at a real example:
we built a music recommender
© 2014 MapR Technologies 56
© 2014 MapR Technologies 57
User activity: Listens to classic jazz hit “Take the A Train”
© 2014 MapR Technologies 58
System delivers recommendations based on activity
© 2014 MapR Technologies 59
Let’s look inside the music
recommender…
© 2014 MapR Technologies 60
Music Meta Data for Search Document Collections
• MusicBrainz data
• Data includes Artist ID, MusicBrainz ID, Name, Group/Person, From (geo
locations) and Gender as seen in this sample
© 2014 MapR Technologies 62
Sample User Behavior Histories: Music Log Files
13 START 10113 2182654281
23 BEACON 10113 2182654281
24 START 10113 79600611935028
34 BEACON 10113 79600611935028
44 BEACON 10113 79600611935028
54 BEACON 10113 79600611935028
64 BEACON 10113 79600611935028
74 BEACON 10113 79600611935028
84 BEACON 10113 79600611935028
94 BEACON 10113 79600611935028
104 BEACON 10113 79600611935028
109 FINISH10113 79600611935028
111 START 10113 58999912011972
121 BEACON 10113 58999912011972
Time
Event type
User ID
Artist ID
Track ID
© 2014 MapR Technologies 63
Sample Music Log Files
Artist ID for jazz
musician Duke Ellington
What has user 119
done here in the
highlighted lines?
© 2014 MapR Technologies 64
Internals of a Recommendation Engine
© 2014 MapR Technologies 65
Internals of a Recommendation Engine
© 2014 MapR Technologies 66
id 1710
mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499
name Chuck Berry
area United States
gender Male
indicator_artists 386685,875994,637954,3418,1344,789739,1460, …
id 541902
mbid 983d4f8f-473e-4091-8394-415c105c4656
name Charlie Winston
area United Kingdom
gender None
indicator_artists 997727,815,830794,59588,900,2591,1344,696268, …
Lucene Documents for Music Recommendation
Notice that data from indicator matrix of trained Mahout recommender
model has been added to indicator field in documents of the artists
collection
© 2014 MapR Technologies 68
Offline Analysis
Analysis Using
Mahout
Users History
Log Files Indicators
Search
Technology
Item
Meta-Data
© 2014 MapR Technologies 69
Log Files
Mahout
Analysis
Search
Technology
Item
Meta-Data
Ingest easily via NFS
MapR Cluster
via NFS Python
Use Python
directly via NFS
Pig
Web
TierRecommendations
New User History
Real-time recommendations using MapR data platform
© 2014 MapR Technologies 70
A Quick Simplification
• Users who do h
• Also do
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
© 2014 MapR Technologies 71
Architectural Advantage
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
© 2014 MapR Technologies 72
Architectural Advantage
AT
Ah( ) User-centric recommendations
With the first design, you have to do the real-time computation first (in
parenthesis). No way to pre-compute. Less efficient, less fast.
With the second design, you can pre-compute offline (overnight)
things that change slowly. Only the smaller computation for new user
vector (h) is done in real-time, so response is very fast.
AT
A( )h Item-centric recommendations
© 2014 MapR Technologies 74
Tutorial Part 2:
How to make recommendation better
© 2014 MapR Technologies 75
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 76
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 77
For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• ATA gives query recommendation
– “did you mean to ask for”
• BTB gives video recommendation
– “you might like these videos”
© 2014 MapR Technologies 78
The punch-line
• BTA recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
© 2014 MapR Technologies 79
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2014 MapR Technologies 80
Real-life example
© 2014 MapR Technologies 81
Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever users think they
should be
© 2014 MapR Technologies 82
Nice. But we can
do better?
© 2014 MapR Technologies 84
Symmetry Gives Cross Recommentations
AT
A( )h
BT
A( )h
Conventional recommendations with
off-line learning
Cross recommendations
© 2014 MapR Technologies 85
Ausers
things
© 2014 MapR Technologies 86
A1 A2
é
ë
ù
û
users
thing
type 1
thing
type 2
© 2014 MapR Technologies 87
A1 A2
é
ë
ù
û
T
A1 A2
é
ë
ù
û=
A1
T
A2
T
é
ë
ê
ê
ù
û
ú
ú
A1 A2
é
ë
ù
û
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
r1
r2
é
ë
ê
ê
ù
û
ú
ú
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
r1 = A1
T
A1 A1
T
A2
é
ëê
ù
ûú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
© 2014 MapR Technologies 88
Bonus Round:
When worse is better
© 2014 MapR Technologies 89
The Real Issues After First Production
• Exploration
• Diversity
• Speed
• Not the last fraction of a percent
© 2014 MapR Technologies 90
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual performance
much better
© 2014 MapR Technologies 91
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual performance
much better
“Made more difference than any other change”
© 2014 MapR Technologies 92
Why Dithering Works
Real-time
recommender
Overnight
training
Log Files
© 2014 MapR Technologies 93
Why Use Dithering?
© 2014 MapR Technologies 94
Simple Dithering Algorithm
• Synthetic score from log rank plus Gaussian
• Pick noise scale to provide desired level of mixing
• Typically
• Also… use floor(t/T) as seed
s = logr + N(0,loge)
Dr
r
µe
e Î 1.5,3[ ]
© 2014 MapR Technologies 95
Example … ε = 2
1 2 8 3 9 15 7 6
1 8 14 15 3 2 22 10
1 3 8 2 10 5 7 4
1 2 10 7 3 8 6 14
1 5 33 15 2 9 11 29
1 2 7 3 5 4 19 6
1 3 5 23 9 7 4 2
2 4 11 8 3 1 44 9
2 3 1 4 6 7 8 33
3 4 1 2 10 11 15 14
11 1 2 4 5 7 3 14
1 8 7 3 22 11 2 33
© 2014 MapR Technologies 96
Lesson:
Exploration is good
© 2014 MapR Technologies 97
Thank you
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
22 June 2014 Big Data Everywhere Conference #DataIsrael

Mais conteúdo relacionado

Mais procurados

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Chris Huang
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoopChris Huang
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012Chris Huang
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeDataWorks Summit
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformMapR Technologies
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 

Mais procurados (20)

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoop
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data Platform
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Apache drill
Apache drillApache drill
Apache drill
 

Destaque

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapRlohitvijayarenu
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesMapR Technologies
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleMapR Technologies
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションMapR Technologies Japan
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series AnalysisMapR Technologies
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast EssentialsRahul Gupta
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 

Destaque (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL References
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at Scale
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast Essentials
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 

Semelhante a Practical Machine Learning: Innovations in Recommendation Workshop

Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
MapR and Lucidworks Joint Webinar 2012
MapR and Lucidworks Joint Webinar 2012MapR and Lucidworks Joint Webinar 2012
MapR and Lucidworks Joint Webinar 2012MapR Technologies
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendrapasalapudi123
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadooplucenerevolution
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceMapR Technologies
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningMapR Technologies
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Data Science with Hadoop - A primer
Data Science with Hadoop - A primerData Science with Hadoop - A primer
Data Science with Hadoop - A primerOfer Mendelevitch
 

Semelhante a Practical Machine Learning: Innovations in Recommendation Workshop (20)

Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
MapR and Lucidworks Joint Webinar 2012
MapR and Lucidworks Joint Webinar 2012MapR and Lucidworks Joint Webinar 2012
MapR and Lucidworks Joint Webinar 2012
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendra
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoop
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
From Big Data to Fast Data
From Big Data to Fast DataFrom Big Data to Fast Data
From Big Data to Fast Data
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted Dunning
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Data Science with Hadoop - A primer
Data Science with Hadoop - A primerData Science with Hadoop - A primer
Data Science with Hadoop - A primer
 

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Practical Machine Learning: Innovations in Recommendation Workshop

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout 22 June 2014 Big Data Everywhere Conference #DataIsrael
  • 3. © 2014 MapR Technologies 3 http://www.wired.com/wiredenterprise/2012/12/mahout/ Recommendation: Widely Used Machine Learning Example: Open source Apache Mahout used in production
  • 4. © 2014 MapR Technologies 4 Recommendations – Data: interactions between people taking action (users) and items • Data used to train recommendation model – Goal is to suggest additional interactions – Example applications: movie, music or map-based restaurant choices; suggesting sale items for e-stores or via cash-register receipts
  • 5. © 2014 MapR Technologies 5 Google maps: restaurant recommendations
  • 6. © 2014 MapR Technologies 6 Google maps: tech recommendations
  • 7. © 2014 MapR Technologies 7 Tutorial Part 1: How recommendation works, or “I want a pony”…
  • 8. © 2014 MapR Technologies 9 First question: Are you using the right data?
  • 9. © 2014 MapR Technologies 10 Recommendation Behavior of a crowd helps us understand what individuals will do
  • 10. © 2014 MapR Technologies 11 Recommendations Alice got an apple and a puppyAlice
  • 11. © 2014 MapR Technologies 12 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles
  • 12. © 2014 MapR Technologies 13 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 13. © 2014 MapR Technologies 14 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob What else would Bob like?
  • 14. © 2014 MapR Technologies 15 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 15. © 2014 MapR Technologies 16 You get the idea of how recommenders can work…
  • 16. © 2014 MapR Technologies 17 By the way, like me, Bob also wants a pony…
  • 17. © 2014 MapR Technologies 18 Recommendations ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 18. © 2014 MapR Technologies 19 Recommendations ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 19. © 2014 MapR Technologies 20 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 20. © 2014 MapR Technologies 21 Overview: Get Useful Indicators from Behaviors 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 21. © 2014 MapR Technologies 22 Apache Mahout: Overview • Open source Apache project http://mahout.apache.org/ • Mahout version is 0.9 released Feb 2014; inc Scala – Summary 0.9 blog at http://bit.ly/1rirUUL • Library of scalable algorithms for machine learning – Some run on Apache Hadoop distributions; others do not require Hadoop – Some can be run at small scale – Some are run in parallel; others are sequential • Includes the following main areas: – Clustering & related techniques – Classification – Recommendation – Mahout Math Library
  • 22. © 2014 MapR Technologies 24 Log Files Alice Bob Charles Alice Bob Charles Alice
  • 23. © 2014 MapR Technologies 25 Log Files u1 u3 u2 u1 u3 u2 u1 t1 t4 t3 t2 t3 t3 t1
  • 24. © 2014 MapR Technologies 26 History Matrix: Users x Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 25. © 2014 MapR Technologies 27 Co-Occurrence Matrix: Items x Items 1 2 0 1 1 1 1 1 0 00 2 How do you tell which co- occurrences are useful?
  • 26. © 2014 MapR Technologies 28 Co-Occurrence Matrix: Items x Items 1 2 0 1 1 1 1 1 0 00 2 Use LLR test to turn co- occurrence into indicators…
  • 27. © 2014 MapR Technologies 29 Co-occurrence Binary Matrix 1 1not not 1
  • 28. © 2014 MapR Technologies 30 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 29. © 2014 MapR Technologies 31 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3
  • 30. © 2014 MapR Technologies 32 Co-Occurrence Matrix: Items x Items 1 2 0 1 1 1 1 1 0 00 2 Recap: Use LLR test to turn co- occurrence into indicators
  • 31. © 2014 MapR Technologies 33 Indicator Matrix: Anomalous Co-Occurrence Result: Each marked row shows the indicators of what to recommend… ✔ ✔
  • 32. © 2014 MapR Technologies 34 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔ Why not pony + other item?
  • 33. © 2014 MapR Technologies 36 How will you deliver recommendations to users?
  • 34. © 2014 MapR Technologies 37 Seeking Simplicity Innovation: Exploit search technology to deploy your recommendation system
  • 35. © 2014 MapR Technologies 38 But first, a look at how search works…
  • 36. © 2014 MapR Technologies 39 Apache Solr/Apache Lucene • Apache Solr/Lucene is an open-source powerful search engine used for flexible, heavily indexed queries including data such as – Full text, geographical data, statistically weighted data • Lucene – Provides core retrieval – Is a low-level library • Solr – Is a web-based wrapper around Lucene – Is easy to integrate because you talk to it via a web-interface • URL http://host machine:8888
  • 37. © 2014 MapR Technologies 40 LucidWorks • Enterprise platform and collection of applications based on Apache Solr/Lucene – Wrapper around Solr – A free version ships with MapR • LucidWorks leaves Solr exposed but makes Solr administration much easier, which in turn makes it easier to use Lucene • URL http://host machine:8989
  • 38. © 2014 MapR Technologies 41 Solr LucidWorks Query Response Index Lucene Query Response Data Source Relationship of Solr /Lucene/LucidWorks
  • 39. © 2014 MapR Technologies 42 Other Options: Elastic Search • Apache Lucene library at the heart of several approaches – It can also be used on its own • Elastic Search is a different web interface for Lucene – Real time search and analytics – Open source (not Apache) – Big advantage is less accumulated cruft http://www.elasticsearch.com/
  • 40. © 2014 MapR Technologies 44 What is a Document? • Data is stored in collections made up of documents • Documents contain fields that can be – Indexed • makes field searchable; don’t have to index all – Stored • If you want Solr to return content, have Solr store content • Not all data of interest must be stored: can access via stored URL (good for very large data set) – Multi-value • Body field can contain more than one type of data – Facetted • A way to refine a search or use for statistics • Example: data for country: Could return facetted as “37 from US, 23 UK, 7 Japan”
  • 41. © 2014 MapR Technologies 45 Fields and How to Set Them Up • Lucene is mostly used for text – Text has to be tokenized • Also supports other types – Long, string, keywords, comma separated • Fields properties such as stored, indexed, faceted can also be defined • Defaults aren’t usually so great
  • 42. © 2014 MapR Technologies 46 Example of Facetted Search The indexed fields “Area” and “Gender” have been facetted to provide counts for the results
  • 43. © 2014 MapR Technologies 48 Field-specific Searches Using Lucene Syntax • Documents often have title, author, keywords and body text • General syntax is field:(term1 term2) • Alternatives field:(term1 OR term2) field:(“term1 term2” ~ 5) field:(term1 AND term2) • Default field, default interpretations work very well for text
  • 44. © 2014 MapR Technologies 49 Send Data to Solr (not LucidWorks) • LucidWorks has lots of spiders that can use file extension or mime type to trigger certain file parsers – Works best with web-ish sources and mime types – Also includes MapR and MapR high volume indexers • More common at modest volumes or for updates to use JSON format {"id":"book_314", "title":{"set":"The Call of the Wild"}} • Use REST interface to send update files curl http://localhost:8983/solr/update -H 'Content-type:application/json' --data-binary @file.json
  • 45. © 2014 MapR Technologies 51 Back to recommendation: How do you abuse search to make recommendation easy?
  • 46. © 2014 MapR Technologies 52 Collection of Documents: Insert Meta-Data Search Technology Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS
  • 47. © 2014 MapR Technologies 53 From Indicator Matrix to New Indicator Field ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Solr document for “puppy” Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
  • 48. © 2014 MapR Technologies 54 Let’s look at a real example: we built a music recommender
  • 49. © 2014 MapR Technologies 56
  • 50. © 2014 MapR Technologies 57 User activity: Listens to classic jazz hit “Take the A Train”
  • 51. © 2014 MapR Technologies 58 System delivers recommendations based on activity
  • 52. © 2014 MapR Technologies 59 Let’s look inside the music recommender…
  • 53. © 2014 MapR Technologies 60 Music Meta Data for Search Document Collections • MusicBrainz data • Data includes Artist ID, MusicBrainz ID, Name, Group/Person, From (geo locations) and Gender as seen in this sample
  • 54. © 2014 MapR Technologies 62 Sample User Behavior Histories: Music Log Files 13 START 10113 2182654281 23 BEACON 10113 2182654281 24 START 10113 79600611935028 34 BEACON 10113 79600611935028 44 BEACON 10113 79600611935028 54 BEACON 10113 79600611935028 64 BEACON 10113 79600611935028 74 BEACON 10113 79600611935028 84 BEACON 10113 79600611935028 94 BEACON 10113 79600611935028 104 BEACON 10113 79600611935028 109 FINISH10113 79600611935028 111 START 10113 58999912011972 121 BEACON 10113 58999912011972 Time Event type User ID Artist ID Track ID
  • 55. © 2014 MapR Technologies 63 Sample Music Log Files Artist ID for jazz musician Duke Ellington What has user 119 done here in the highlighted lines?
  • 56. © 2014 MapR Technologies 64 Internals of a Recommendation Engine
  • 57. © 2014 MapR Technologies 65 Internals of a Recommendation Engine
  • 58. © 2014 MapR Technologies 66 id 1710 mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499 name Chuck Berry area United States gender Male indicator_artists 386685,875994,637954,3418,1344,789739,1460, … id 541902 mbid 983d4f8f-473e-4091-8394-415c105c4656 name Charlie Winston area United Kingdom gender None indicator_artists 997727,815,830794,59588,900,2591,1344,696268, … Lucene Documents for Music Recommendation Notice that data from indicator matrix of trained Mahout recommender model has been added to indicator field in documents of the artists collection
  • 59. © 2014 MapR Technologies 68 Offline Analysis Analysis Using Mahout Users History Log Files Indicators Search Technology Item Meta-Data
  • 60. © 2014 MapR Technologies 69 Log Files Mahout Analysis Search Technology Item Meta-Data Ingest easily via NFS MapR Cluster via NFS Python Use Python directly via NFS Pig Web TierRecommendations New User History Real-time recommendations using MapR data platform
  • 61. © 2014 MapR Technologies 70 A Quick Simplification • Users who do h • Also do Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 62. © 2014 MapR Technologies 71 Architectural Advantage AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 63. © 2014 MapR Technologies 72 Architectural Advantage AT Ah( ) User-centric recommendations With the first design, you have to do the real-time computation first (in parenthesis). No way to pre-compute. Less efficient, less fast. With the second design, you can pre-compute offline (overnight) things that change slowly. Only the smaller computation for new user vector (h) is done in real-time, so response is very fast. AT A( )h Item-centric recommendations
  • 64. © 2014 MapR Technologies 74 Tutorial Part 2: How to make recommendation better
  • 65. © 2014 MapR Technologies 75 Going Further: Multi-Modal Recommendation
  • 66. © 2014 MapR Technologies 76 Going Further: Multi-Modal Recommendation
  • 67. © 2014 MapR Technologies 77 For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • ATA gives query recommendation – “did you mean to ask for” • BTB gives video recommendation – “you might like these videos”
  • 68. © 2014 MapR Technologies 78 The punch-line • BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 69. © 2014 MapR Technologies 79 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 70. © 2014 MapR Technologies 80 Real-life example
  • 71. © 2014 MapR Technologies 81 Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 72. © 2014 MapR Technologies 82 Nice. But we can do better?
  • 73. © 2014 MapR Technologies 84 Symmetry Gives Cross Recommentations AT A( )h BT A( )h Conventional recommendations with off-line learning Cross recommendations
  • 74. © 2014 MapR Technologies 85 Ausers things
  • 75. © 2014 MapR Technologies 86 A1 A2 é ë ù û users thing type 1 thing type 2
  • 76. © 2014 MapR Technologies 87 A1 A2 é ë ù û T A1 A2 é ë ù û= A1 T A2 T é ë ê ê ù û ú ú A1 A2 é ë ù û = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú r1 r2 é ë ê ê ù û ú ú = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú h1 h2 é ë ê ê ù û ú ú r1 = A1 T A1 A1 T A2 é ëê ù ûú h1 h2 é ë ê ê ù û ú ú
  • 77. © 2014 MapR Technologies 88 Bonus Round: When worse is better
  • 78. © 2014 MapR Technologies 89 The Real Issues After First Production • Exploration • Diversity • Speed • Not the last fraction of a percent
  • 79. © 2014 MapR Technologies 90 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better
  • 80. © 2014 MapR Technologies 91 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”
  • 81. © 2014 MapR Technologies 92 Why Dithering Works Real-time recommender Overnight training Log Files
  • 82. © 2014 MapR Technologies 93 Why Use Dithering?
  • 83. © 2014 MapR Technologies 94 Simple Dithering Algorithm • Synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically • Also… use floor(t/T) as seed s = logr + N(0,loge) Dr r µe e Î 1.5,3[ ]
  • 84. © 2014 MapR Technologies 95 Example … ε = 2 1 2 8 3 9 15 7 6 1 8 14 15 3 2 22 10 1 3 8 2 10 5 7 4 1 2 10 7 3 8 6 14 1 5 33 15 2 9 11 29 1 2 7 3 5 4 19 6 1 3 5 23 9 7 4 2 2 4 11 8 3 1 44 9 2 3 1 4 6 7 8 33 3 4 1 2 10 11 15 14 11 1 2 4 5 7 3 14 1 8 7 3 22 11 2 33
  • 85. © 2014 MapR Technologies 96 Lesson: Exploration is good
  • 86. © 2014 MapR Technologies 97 Thank you Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout 22 June 2014 Big Data Everywhere Conference #DataIsrael

Notas do Editor

  1. Timing: Assume 5 min to here
  2. Mention that the Pony book said “RowSimilarityJob”…
  3. Talk track: Apache Mahout is an open-source project with international contributors and a vibrant community of users and developers. A new version – 0.8 – was recently released. Mahout is a library of scalable algorithms used for clustering, classification and recommendation. Mahout also includes a math library that is low level, flexible, scalable and makes certain functions very easy to carry out. Talk track: First let’s make a quick comparison of the three main areas of Mahout machine learning…
  4. Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  5. Talk track: Solr is a small data tool that has flourished in a big data world
  6. *For the hands-on lab in this course, we will use a free part of LucidWorks that comes as part of the MapR distribution
  7. Needed???
  8. Optional: don’t spend time on this
  9. Note to speaker: fast on this slide as overview and reference only; additional slides and labs will explain this
  10. TED: Do you need to talk about term locations? I left off SEE NEXT SLIDE FOR MORE ON FACETTING
  11. FOR HOW TO SET THEM UP, is it done from dashboard (see next slide) or as a command?
  12. Nice to show but don’t have to … they can just find it in the lab. I would show this slide very briefly and move on
  13. Skip this?
  14. Note to speaker: point out that using Solr is a state-of-the-art approach that simplifies deploying recommender
  15. Talk track: We built a real music recommender on MapR and deployed it to a website for a mock company, Music Machine. Everything worked except you didn’t really hear music play…
  16. Talk track: Here are documents for two different artists with indicator IDs that are part of the recommendation model. When recommendations are needed, the web-site uses recent visitor behavior to query against the indicators in these documents.
  17. Notes to trainer: A lot of work to do a grid. Represent by math A is history matrix Ah finds users who do the same things as in h H is vector of items for one (new current) user A transpose times Ah gives you the things That computes what these users do Shape of matrix multiplications and many of the same properties. Sometimes have weights etc. Had they been exactly the same, we could just move the parentheses. Our recommender does the item-centric version General relationships in data don’t change fast (what is related to what; nothing happens to change mozart related to Hayden overnight. ) What does change fast is what the user did in the last five minutes. //in first case, we have to compute Ah first. Inputs to that compution (h) only available now, in RT so nothing can be computed ahead of time Second case (Atranspose A) only involves things that change slowly. So pre-compute. Makes it possible to do this offline. Significant because we move a lot of computation for all users into an overnight process. So each RT recommendation involves only a small part, only 1 big matrix multiply in RT. Result: you get a fast response for the recommendations Second form runs on one machine for one user (the RT part)
  18. A lot of work to do a grid. Represent by math A is history matrix Ah finds users who do the same things as in h H is vector of items for one (new current) user A transpose times Ah gives you the things That computes what these users do Shape of matrix multiplications and many of the same properties. Sometimes have weights etc. Had they been exactly the same, we could just move the parentheses. Our recommender does the item-centric version General relationships in data don’t change fast (what is related to what; nothing happens to change mozart related to Hayden overnight. ) What does change fast is what the user did in the last five minutes. //in first case, we have to compute Ah first. Inputs to that computation (h) only available now, in RT so nothing can be computed ahead of time Second case (A transpose A) only involves things that change slowly. So pre-compute. Makes it possible to do this offline. Significant because we move a lot of computation for all users into an overnight process. So each RT recommendation involves only a small part, only 1 big matrix multiply in RT. Result: you get a fast response for the recommendations Second form runs on one machine for one user (the RT part)
  19. A lot of work to do a grid. Represent by math A is history matrix Ah finds users who do the same things as in h H is vector of items for one (new current) user A transpose times Ah gives you the things That computes what these users do Shape of matrix multiplications and many of the same properties. Sometimes have weights etc. Had they been exactly the same, we could just move the parentheses. Our recommender does the item-centric version General relationships in data don’t change fast (what is related to what; nothing happens to change mozart related to Hayden overnight. ) What does change fast is what the user did in the last five minutes. //in first case, we have to compute Ah first. Inputs to that computation (h) only available now, in RT so nothing can be computed ahead of time Second case (A transpose A) only involves things that change slowly. So pre-compute. Makes it possible to do this offline. Significant because we move a lot of computation for all users into an overnight process. So each RT recommendation involves only a small part, only 1 big matrix multiply in RT. Result: you get a fast response for the recommendations Second form runs on one machine for one user (the RT part)
  20. Problem starts here…
  21. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  22. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  23. This is a diagnostics window in the LucidWorks Solr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine. In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?