SlideShare uma empresa Scribd logo
1 de 48
Part 1: Non Relational Databases
Part 2: Collaborative Filtering
          Simon Woodman
       [s.j.woodman@ncl.ac.uk]
Outline
•   Part 1: Non-Relational Databases (NoSQL)
     – Trends forcing change
     – NoSQL database types
     – Graph Databases (Neo4J)
     – Demo



•   Part 2: Making Recommendations
     – Background/example
     – Pearson Score
     – User based
     – Item based
Credit: http://ecogreenliving.net/
Trend 1: Data Size
                         Digital Information
                    Created, Captured, Replicated
                              worldwide
           3000

           2500

           2000
Exabytes
           1500

           1000

            500

             0
                  2006   2007   2008   2009   2010     2011   2012
                                                     Source: IDC 2009
Trend 2: Connectedness
Trend 2: connectedness
                                                                                                   Giant
                                                                                                  Global
                                                                                                  Graph
                                                                                                  (GGG)
 Information connectivity


                                                                                     Ontologies


                                                                              RDF

                                                                                          Folksonomies
                                                                          Tagging


                                                              Wikis             User-
                                                                              generated
                                                                               content
                                                                      Blogs


                                                            RSS


                                              Hypertext


                               Text
                            documents          web 1.0                web 2.0             “web 3.0”
                                          1990              2000                    2010                   2020

                                  Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
Trend 3: semi-structure
• “The great majority of the data out there is not structured and [there’s]
   no way in the world you can force people to structure it.” [1]


• Trend accelerated by the decentralization of content generation that is
   the hallmark of the age of participation (“web 2.0”)


• Evolving applications

    [1] Stefano Mazzocci Apache and MIT
Types of Databases

• Relational

• Key-Value Stores

• BigTable Clones

• Document Databases

• Graph Databases
Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
   – Payroll, accounts
   – Well understood
   – Not evolving
• SQL Queries (joins etc.)
• Good Tooling


• Examples: Oracle, MySQL, Postgres, …
Key-Value Stores
•       Data Model: (global) collection of K-V pairs
•       Massive Distributed HashMap
•       Partitioning and Replication usually ring based
           –      Load Balancer round robins the requests
           –      Hash(key) = partition
           –      Partition map maintains partition -> node mapping
           –      Quorum System (N, R, W), usually (3,2,2)


•       Scales Well (1000B rows)
•       How many apps need that?
           –      Google, Amazon, Facebook etc.
           –      <10 in the world

•       Examples: Dynomite, Voldemort, Tokyo

[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
BigTable Clones
•        Data model: single table, column families
•        Distributed storage of semi-structured data (column families)
•        Scale: “Petabyte range”
•        Supports MapReduce well




•        Example: Hbase, Hypertable


[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
Document Databases
•   Inspired by Lotus Notes
•   Data model: collections of K-V collections
•   Document:
      –   Collection of K-V pairs (often JSON)
      –   Often versioned

•   Scales: Dependant on implementation


•   Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)




•   Example: CouchDB, MongoDB
Graph Databases
•   Inspired by Euler & graph theory
•   Data model: nodes, relationships, K-V on both
•   Scale: 10B entities
•   SPARQL Queries


•   No O/R Impedance mismatch
•   Semi Structured & Evolving Schema




•   Example: AllegroGraph, VertexDB, Neo4j
Social Network Problem


• System stores people and friends

• Find all “friends of friends”
RDBMS Solution
•   SQL: single join to get
    friends


•   SELECT p.name, p2.name
     FROM people AS p, people AS p2,
     friends AS f
     WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;



•   SQL: 2-3 joins or subqueries to get “friends of friends”


•   i.e. Not trivial and doesn’t scale
Graph DB Solution
• Graph Traversal



• pathExists(a,b)

  limit depth 2
Neo4J Model
• Nodes
• Relationships (edges)             type=“KNOWS”
                                      age=4 years

• Properties on Both
                            1

                                           2
           name = “Simon”
              job=“RA”



                                3              name = “Chris”
Live Demo!
Neo4J Model
• Transactions
• Reference Node
• Indexes (Apache Lucene)
• Visualisation
  – Neoclipse
  – The JIT
Neoclipse
Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals


• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)
Important Reminders
• Other options exist apart from the Relational
  Database




• Fit the technology to the domain model, not the
  domain model to the technology
Questions?

• http://neo4j.org/



• Some material from

[http://nosql.mypopescu.com/post/342947902/
  presentation-graphs-neo4j-teh-awesome]
Part 2: Collaborative Filtering



• Calculating Similarities

• User based filtering

• Item based filtering
Why?
•   Sell more items
•   Increase market share
•   Better targeted advertising


•   Up sell rather than new-sell


•   Make more £££


•   Not perfect
     – Bad recommendations
     – Inappropriate recommendations
It can go wrong
It will go wrong
Preference Data
Movie Ratings    Online Shopping     Site Recommender
     5           Bought       1       Like       1
     4          Didn’t Buy    0     No vote      0
      3                            Didn’t Like   -1
     2
      1
Recommending Items

• Step 1: Calculate similarities
  – either user-user or item-item

• Step 2: Predict scores for “unseen” items

• Step 3: Normalise and order
Example Data: Movie Reviews

          Shawshank     The    Lock     Love
                                                Titanic   Seven
          Redemption   Ghost   Stock   Actually
  Simon       5         4       4         1

  Chris       1          3      4         5        4

  Paul        4          5                         2        4
Calculating Similarity
• Method 1: Euclidian Distance Score
• Compare Common Rankings
• n-dimensional preference space
• Score 0 – 1
• 1 = Identical
• 0 = Highly dissimilar
Calculating Euclidian Distance Score


• Done for each pair of people


• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert
Chris and Simon


•   Difference in each axis
     –   (5-1), (4-3) = 4, 1

•   Square
     –   16, 1

•   Add them together
     –   17

•   Add 1 (avoids divide by zero)
     –   = 18

•   Square Root
     –   = 4.24264069

•   Invert
     –   = 0.23570226
Euclidian Distance Score


• Easy to calculate

• Bad for people who are similar but
  consistently rate higher/lower
Pearson Correlation Coefficient

• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation




• Other measures
   – Jaccard Coefficient
   – Manhattan Distance
User based Filtering
• Look at what similar people have liked but you
  haven’t seen?
  – Similar person likes something that has bad reviews
    from everyone else?


• Weighted Score that ranks the other people and
  takes into account similarity
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
User Based Filtering - Conclusions

• Calculate Similarity between users
• Recommend based on similar users


• Similarity
   – Euclidian Distance Score
   – Pearson Coefficient – better for non-normalised data


• Problem – need to compare every user/item to every other
  user/item
Item Based Filtering
• Pre-compute most similar items for each item
  – Item similarities change less often than user
    similarities and can be re-used



• Create a weighted list of items most similar to
  user’s top rated items
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Item Based Filtering - Conclusions

• Calculate Similarity between items
• Recommend based on user’s ratings for items


• Similarity (as before)
   – Euclidian Distance Score
   – Pearson Coefficient – better for non-normalised data



• Problem – need to maintain item similarity data set
Item vs. User Based Filtering
• Item based scales better
   – Need to maintain the similarities data set

• User based simpler to implement
• May (or may not) want to show users who is similar in
  terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets
Questions?
• Reference: Programming Collective Intelligence,
  Toby Seagram, O’Reilly 2007




• s.j.woodman@ncl.ac.uk

Mais conteúdo relacionado

Semelhante a CSC 8101 Non Relational Databases

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
Rikkyo University
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 

Semelhante a CSC 8101 Non Relational Databases (20)

An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
No Sql Movement
No Sql MovementNo Sql Movement
No Sql Movement
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Anti-social Databases
Anti-social DatabasesAnti-social Databases
Anti-social Databases
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data Exploration
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
NoSQL
NoSQLNoSQL
NoSQL
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
 
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

CSC 8101 Non Relational Databases

  • 1. Part 1: Non Relational Databases Part 2: Collaborative Filtering Simon Woodman [s.j.woodman@ncl.ac.uk]
  • 2. Outline • Part 1: Non-Relational Databases (NoSQL) – Trends forcing change – NoSQL database types – Graph Databases (Neo4J) – Demo • Part 2: Making Recommendations – Background/example – Pearson Score – User based – Item based
  • 4. Trend 1: Data Size Digital Information Created, Captured, Replicated worldwide 3000 2500 2000 Exabytes 1500 1000 500 0 2006 2007 2008 2009 2010 2011 2012 Source: IDC 2009
  • 5. Trend 2: Connectedness Trend 2: connectedness Giant Global Graph (GGG) Information connectivity Ontologies RDF Folksonomies Tagging Wikis User- generated content Blogs RSS Hypertext Text documents web 1.0 web 2.0 “web 3.0” 1990 2000 2010 2020 Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
  • 6. Trend 3: semi-structure • “The great majority of the data out there is not structured and [there’s] no way in the world you can force people to structure it.” [1] • Trend accelerated by the decentralization of content generation that is the hallmark of the age of participation (“web 2.0”) • Evolving applications [1] Stefano Mazzocci Apache and MIT
  • 7. Types of Databases • Relational • Key-Value Stores • BigTable Clones • Document Databases • Graph Databases
  • 8. Relational Databases • Data Model: Normalised, multi-table with referential integrity • Good for very static data – Payroll, accounts – Well understood – Not evolving • SQL Queries (joins etc.) • Good Tooling • Examples: Oracle, MySQL, Postgres, …
  • 9. Key-Value Stores • Data Model: (global) collection of K-V pairs • Massive Distributed HashMap • Partitioning and Replication usually ring based – Load Balancer round robins the requests – Hash(key) = partition – Partition map maintains partition -> node mapping – Quorum System (N, R, W), usually (3,2,2) • Scales Well (1000B rows) • How many apps need that? – Google, Amazon, Facebook etc. – <10 in the world • Examples: Dynomite, Voldemort, Tokyo [http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
  • 10. BigTable Clones • Data model: single table, column families • Distributed storage of semi-structured data (column families) • Scale: “Petabyte range” • Supports MapReduce well • Example: Hbase, Hypertable [http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
  • 11. Document Databases • Inspired by Lotus Notes • Data model: collections of K-V collections • Document: – Collection of K-V pairs (often JSON) – Often versioned • Scales: Dependant on implementation • Can (potentially) store entire 3 tier web app in the database (probably NOT the best architecture!) • Example: CouchDB, MongoDB
  • 12. Graph Databases • Inspired by Euler & graph theory • Data model: nodes, relationships, K-V on both • Scale: 10B entities • SPARQL Queries • No O/R Impedance mismatch • Semi Structured & Evolving Schema • Example: AllegroGraph, VertexDB, Neo4j
  • 13. Social Network Problem • System stores people and friends • Find all “friends of friends”
  • 14. RDBMS Solution • SQL: single join to get friends • SELECT p.name, p2.name FROM people AS p, people AS p2, friends AS f WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2; • SQL: 2-3 joins or subqueries to get “friends of friends” • i.e. Not trivial and doesn’t scale
  • 15. Graph DB Solution • Graph Traversal • pathExists(a,b) limit depth 2
  • 16. Neo4J Model • Nodes • Relationships (edges) type=“KNOWS” age=4 years • Properties on Both 1 2 name = “Simon” job=“RA” 3 name = “Chris”
  • 18. Neo4J Model • Transactions • Reference Node • Indexes (Apache Lucene) • Visualisation – Neoclipse – The JIT
  • 20. Pros and Cons • “Whiteboard friendly” – fits domain models better • Scales up “enough” • Evolve Schema • Can represent semi-structured data • Good Performance for graph/network traversals • Lacks tool support • Harder to write ad-hoc queries (SPARQL vs. SQL)
  • 21. Important Reminders • Other options exist apart from the Relational Database • Fit the technology to the domain model, not the domain model to the technology
  • 22. Questions? • http://neo4j.org/ • Some material from [http://nosql.mypopescu.com/post/342947902/ presentation-graphs-neo4j-teh-awesome]
  • 23. Part 2: Collaborative Filtering • Calculating Similarities • User based filtering • Item based filtering
  • 24.
  • 25.
  • 26. Why? • Sell more items • Increase market share • Better targeted advertising • Up sell rather than new-sell • Make more £££ • Not perfect – Bad recommendations – Inappropriate recommendations
  • 27. It can go wrong
  • 28. It will go wrong
  • 29. Preference Data Movie Ratings Online Shopping Site Recommender 5 Bought 1 Like 1 4 Didn’t Buy 0 No vote 0 3 Didn’t Like -1 2 1
  • 30. Recommending Items • Step 1: Calculate similarities – either user-user or item-item • Step 2: Predict scores for “unseen” items • Step 3: Normalise and order
  • 31. Example Data: Movie Reviews Shawshank The Lock Love Titanic Seven Redemption Ghost Stock Actually Simon 5 4 4 1 Chris 1 3 4 5 4 Paul 4 5 2 4
  • 32. Calculating Similarity • Method 1: Euclidian Distance Score • Compare Common Rankings • n-dimensional preference space • Score 0 – 1 • 1 = Identical • 0 = Highly dissimilar
  • 33. Calculating Euclidian Distance Score • Done for each pair of people • Difference in each axis • Square • Add them together • Add 1 (avoids divide by zero) • Square Root • Invert
  • 34. Chris and Simon • Difference in each axis – (5-1), (4-3) = 4, 1 • Square – 16, 1 • Add them together – 17 • Add 1 (avoids divide by zero) – = 18 • Square Root – = 4.24264069 • Invert – = 0.23570226
  • 35. Euclidian Distance Score • Easy to calculate • Bad for people who are similar but consistently rate higher/lower
  • 36. Pearson Correlation Coefficient • More Complicated • Line of Best Fit between commonly rated items • Deals with grade inflation • Other measures – Jaccard Coefficient – Manhattan Distance
  • 37. User based Filtering • Look at what similar people have liked but you haven’t seen? – Similar person likes something that has bad reviews from everyone else? • Weighted Score that ranks the other people and takes into account similarity
  • 38. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 39. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 40. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 41. User Based Filtering - Conclusions • Calculate Similarity between users • Recommend based on similar users • Similarity – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data • Problem – need to compare every user/item to every other user/item
  • 42. Item Based Filtering • Pre-compute most similar items for each item – Item similarities change less often than user similarities and can be re-used • Create a weighted list of items most similar to user’s top rated items
  • 43. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 44. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 45. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 46. Item Based Filtering - Conclusions • Calculate Similarity between items • Recommend based on user’s ratings for items • Similarity (as before) – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data • Problem – need to maintain item similarity data set
  • 47. Item vs. User Based Filtering • Item based scales better – Need to maintain the similarities data set • User based simpler to implement • May (or may not) want to show users who is similar in terms of habits • Perform equally on dense data sets • Item based performs better on sparse data sets
  • 48. Questions? • Reference: Programming Collective Intelligence, Toby Seagram, O’Reilly 2007 • s.j.woodman@ncl.ac.uk

Notas do Editor

  1. Information overload. Creating too much data to be able to store it. Digital Cameras/Video Cameras/CCTVVOIP, Sensors, Medical imaging
  2. Information overload.
  3. Over time data has evolved to be more interlinked and connectedHypertext has linksBlogs have pingbacksTagging groups related dataOntologies formalise it moreGGG the relationships contain information rather than the data items. e.g. friends on FB – the data was there before but the relationships are the important part
  4. Applications in 70s and 80s were simple and rigid. Doesn’t work now with the interconnected world.Semi structured data is bad for RDBMSFB, Twitter,etc have had to build their own databases
  5. Used internally at Amazon in services like S3 and EC2Quorum(N, R, W)N = number of replicas that will be written toW = number of responses to wait for for write to succeedR = number of responses to agree on for read to be returnedMeans that n-r/w nodes can go down and the system still function
  6. ----- Meeting Notes (29/11/2011 10:32) -----Part 1: 40mins