CSC 8101 Non Relational Databases

Part 1: Non Relational Databases
Part 2: Collaborative Filtering
Simon Woodman
[s.j.woodman@ncl.ac.uk]

Outline
• Part 1: Non-Relational Databases (NoSQL)
– Trends forcing change
– NoSQL database types
– Graph Databases (Neo4J)
– Demo

• Part 2: Making Recommendations
– Background/example
– Pearson Score
– User based
– Item based

Credit: http://ecogreenliving.net/

Trend 1: Data Size
Digital Information
Created, Captured, Replicated
worldwide
3000

2500

2000
Exabytes
1500

1000

500

0
2006 2007 2008 2009 2010 2011 2012
Source: IDC 2009

Trend 2: Connectedness
Trend 2: connectedness
Giant
Global
Graph
(GGG)
Information connectivity

Ontologies

RDF

Folksonomies
Tagging

Wikis User-
generated
content
Blogs

RSS

Hypertext

Text
documents web 1.0 web 2.0 “web 3.0”
1990 2000 2010 2020

Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome

Trend 3: semi-structure
• “The great majority of the data out there is not structured and [there’s]
no way in the world you can force people to structure it.” [1]

• Trend accelerated by the decentralization of content generation that is
the hallmark of the age of participation (“web 2.0”)

• Evolving applications

[1] Stefano Mazzocci Apache and MIT

Types of Databases

• Relational

• Key-Value Stores

• BigTable Clones

• Document Databases

• Graph Databases

Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
– Payroll, accounts
– Well understood
– Not evolving
• SQL Queries (joins etc.)
• Good Tooling

• Examples: Oracle, MySQL, Postgres, …

Key-Value Stores
• Data Model: (global) collection of K-V pairs
• Massive Distributed HashMap
• Partitioning and Replication usually ring based
– Load Balancer round robins the requests
– Hash(key) = partition
– Partition map maintains partition -> node mapping
– Quorum System (N, R, W), usually (3,2,2)

• Scales Well (1000B rows)
• How many apps need that?
– Google, Amazon, Facebook etc.
– <10 in the world

• Examples: Dynomite, Voldemort, Tokyo

[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]

BigTable Clones
• Data model: single table, column families
• Distributed storage of semi-structured data (column families)
• Scale: “Petabyte range”
• Supports MapReduce well

• Example: Hbase, Hypertable

[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]

Document Databases
• Inspired by Lotus Notes
• Data model: collections of K-V collections
• Document:
– Collection of K-V pairs (often JSON)
– Often versioned

• Scales: Dependant on implementation

• Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)

• Example: CouchDB, MongoDB

Graph Databases
• Inspired by Euler & graph theory
• Data model: nodes, relationships, K-V on both
• Scale: 10B entities
• SPARQL Queries

• No O/R Impedance mismatch
• Semi Structured & Evolving Schema

• Example: AllegroGraph, VertexDB, Neo4j

Social Network Problem

• System stores people and friends

• Find all “friends of friends”

RDBMS Solution
• SQL: single join to get
friends

• SELECT p.name, p2.name
FROM people AS p, people AS p2,
friends AS f
WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;

• SQL: 2-3 joins or subqueries to get “friends of friends”

• i.e. Not trivial and doesn’t scale

Graph DB Solution
• Graph Traversal

• pathExists(a,b)

limit depth 2

Neo4J Model
• Nodes
• Relationships (edges) type=“KNOWS”
age=4 years

• Properties on Both
1

2
name = “Simon”
job=“RA”

3 name = “Chris”

Neo4J Model
• Transactions
• Reference Node
• Indexes (Apache Lucene)
• Visualisation
– Neoclipse
– The JIT

Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals

• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)

Important Reminders
• Other options exist apart from the Relational
Database

• Fit the technology to the domain model, not the
domain model to the technology

Questions?

• http://neo4j.org/

• Some material from

[http://nosql.mypopescu.com/post/342947902/
presentation-graphs-neo4j-teh-awesome]

Part 2: Collaborative Filtering

• Calculating Similarities

• User based filtering

• Item based filtering

Why?
• Sell more items
• Increase market share
• Better targeted advertising

• Up sell rather than new-sell

• Make more £££

• Not perfect
– Bad recommendations
– Inappropriate recommendations

Preference Data
Movie Ratings Online Shopping Site Recommender
5 Bought 1 Like 1
4 Didn’t Buy 0 No vote 0
3 Didn’t Like -1
2
1

Recommending Items

• Step 1: Calculate similarities
– either user-user or item-item

• Step 2: Predict scores for “unseen” items

• Step 3: Normalise and order

Example Data: Movie Reviews

Shawshank The Lock Love
Titanic Seven
Redemption Ghost Stock Actually
Simon 5 4 4 1

Chris 1 3 4 5 4

Paul 4 5 2 4

Calculating Similarity
• Method 1: Euclidian Distance Score
• Compare Common Rankings
• n-dimensional preference space
• Score 0 – 1
• 1 = Identical
• 0 = Highly dissimilar

Calculating Euclidian Distance Score

• Done for each pair of people

• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert

Chris and Simon

• Difference in each axis
– (5-1), (4-3) = 4, 1

• Square
– 16, 1

• Add them together
– 17

• Add 1 (avoids divide by zero)
– = 18

• Square Root
– = 4.24264069

• Invert
– = 0.23570226

Euclidian Distance Score

• Easy to calculate

• Bad for people who are similar but
consistently rate higher/lower

Pearson Correlation Coefficient

• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation

• Other measures
– Jaccard Coefficient
– Manhattan Distance

User based Filtering
• Look at what similar people have liked but you
haven’t seen?
– Similar person likes something that has bad reviews
from everyone else?

• Weighted Score that ranks the other people and
takes into account similarity

Recommending Items

Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven

Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12

Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum 2.455445545 4

User Based Filtering - Conclusions

• Calculate Similarity between users
• Recommend based on similar users

• Similarity
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data

• Problem – need to compare every user/item to every other
user/item

Item Based Filtering
• Pre-compute most similar items for each item
– Item similarities change less often than user
similarities and can be re-used

• Create a weighted list of items most similar to
user’s top rated items

Recommending Items

Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272
Love Actually 1 0.737 0.737 0.184 0.184

Total 1.037 2.021 1.355 5.234
Normalised (Rating / Similarity) 1.948 3.862730627

Item Based Filtering - Conclusions

• Calculate Similarity between items
• Recommend based on user’s ratings for items

• Similarity (as before)
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data

• Problem – need to maintain item similarity data set

Item vs. User Based Filtering
• Item based scales better
– Need to maintain the similarities data set

• User based simpler to implement
• May (or may not) want to show users who is similar in
terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets

Questions?
• Reference: Programming Collective Intelligence,
Toby Seagram, O’Reilly 2007

• s.j.woodman@ncl.ac.uk

CSC 8101 Non Relational Databases

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a CSC 8101 Non Relational Databases

Semelhante a CSC 8101 Non Relational Databases (20)

Último

Último (20)

CSC 8101 Non Relational Databases

Notas do Editor