SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
NoSQL - Life beyond the Outer
             Join
              Glen Smith
        (glen@bytecode.com.au)
Objectives


   Survey the landscape of NoSQL offerings
   Learn some of the terminology
   Look at some of the Java offerings in the space
   Take away source to play with
   Be able to ask questions (but you may not get
    answers)
What is NoSQL?


   (N)ot (O)nly SQL not “Anti SQL”
   Movement more than “one” technology
   Distributed Storage System
   Much weaker queries
   Scale across many machines
   Much larger data, much faster queries
Why NoSQL?


 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
  certain large problem types
 High-write situations (eg activity tracking or timeline
  rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
  fetch by PK with update)
What’s wrong with RDBMS?


 Nothing ;-)
 To scale RDBMS, your approach is typically:
   Shard your datasource
   Put in a bunch of read replicas
   Put memcached in front of those
 What could possibly go wrong? 
   Complex. Custom caching. Partitioning. Migrating of
    shards. Tons of moving parts.
How can I live w/o ACID?


   Atomic (it happens or not, no partial completes)
   Consistent (DB internals, ref integ, field validate)
   Isolated (Can’t modify uncommitted data)
   Durable (written to disk/transaction log)

 But in a distributed db, life is not so simple...
The CAP theorum


In a distributed system, when you have state on more
  than one machine, pick any two:
 Consistency (easy in read-only states – copy!)
 Availability (can you get at your data? Is it up?)
 Partition Tolerance (3 machines on one net, 3 on the
  other, with a broken link. How do you take updates
  since you can’t keep people up to date. What if you
  don’t agree on what’s up?)
How do these NoSQL things work?


 Basically big distributed hashtables
 Push all logic into the write (update two lists – one for
  userId, one for email)
 Things don’t happen transactionally. These are two
  writes.
 There is no free lunch. The programmer is now
  handling consistency problems.
 You were thinking about query optimisation before,
  and now even more so.
How big are we talking?


   Digg - 3Tb
   Facebook Inbox – 50 Tb
   eBay – 2 Pb
   Think about Twitter’s issues.. Billion of queries a
    second over Tb of data.
The NoSQL Taxonomy


 Key-Value In-Memory stores (Memcached, Redis)
 Key-Value “Eventually Consistent” stores (“Dynamo
  Clones” like Cassandra, Voldemort, Riak)
 Document stores (Couchdb, Mongodb, JCR)
 Graph Databases (Neo4j)
 Tabular (“BigTable clones” like Hadoop/Hbase)
Memcached


   Developed for the original LiveJournal site
   LRU, distributed hashtable
   Logic is in both client and server
   Used in Google App Engine, Facebook, Twitter
   Ehcache now has similar service
   Good for things that outlive an app server
How does it work?


 Clients know how to:
     Send items to servers (consistent hashing)
     What to do when a server fails
     How to fetch keys from servers
     Can “weigh” to server capacities
 Servers know how to:
   Store items they receive
   Expire them from the cache
   No inter-server comms – everything is unaware
Sample Code
Voldemort


   Less than Memcached, but also more!
   Not a cache, but a distributed key/value store
   Developed by LinkedIn
   Works on distributed hashmap w/failover
   Logic can be in client/server or just server
   Pluggable storage (mysql,bdb,mock)
   Pluggable serialization (JSON, Google PB, etc)
“Relaxed” Consistency


 Eventual consistency – data will come into sync but
  not immediately on the write. In practice “pretty
  soon” is milliseconds later
 We are actually used to this – eg Google indexes
  update every so often.
 Guarantees to read your own writes (eg your profile
  on LinkedIn)
 Tuneable to better performance/weaker consistency
What’s attractive?


   Data is automatically replicated
   Partitioning ensures all servers have subset
   Server failure is handled transparently
   Data is rebalanced when servers added/removed
   Serialization is pluggable
   Apache License
Impressive Performance


 “We were able to move applications that needed to
  handle hundreds of millions of reads and writes per day
  from over 400ms to under 10ms while simultaneously
  increasing the amount of data we store.”
Performance Info




http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
Sample Script


 Starting the server (or deploy as a .war)
binvoldemort-server.bat configsingle_node_cluster
 Starting the console
binvoldemort-shell.bat test tcp://localhost:6666
 Run some queries
put “hello” “world”
get “hello”
put “hello” “world 2.0”
delete “hello”
Sample Code
CouchDb


 Document-Oriented Db – No Schema
 Written in Erlang (!) by a Notes Dev (!!!)
 Everything is stored in JSON, Restful API
 Clever replication concepts – works in disconnected
  settings
 Every write is a new document, version
 Map/Reduce baked in
 Apache License
What’s attractive?


 Schemaless operation – Adhoc data
 Incremental replication (great for disconnected
  settings)
 Great fault-tolerance (with versioned conflicts)
 Fast query with flexibility (MapReduce)
So what is this Map/Reduce thing?


  Popularized by Google’s BigTable
  Map functions collect documents matching criteria
   and create a B-Tree
  Reduce functions operate on the B-Tree
  Everything happens in parallel on many machines
  Example: distributed grep
The Naked Couch


   http://127.0.0.1:5984/
   http://127.0.0.1:5984/_all_dbs
   http://127.0.0.1:5984/mydb (PUT)
   http://127.0.0.1:5984/_utils/ (Futon)
Mapping Couch with Ekron


 You lose some of the joy of schema-less
 But you do get lots of boilerplate ;-)
 Oh, and strong typing.
Writing a Couch MapReduce


 You write a map function to extract data
 You always return a key/value pair

function(doc) {
  if (doc.title.indexOf(“Hi!") > -1) {
    emit(doc.title, doc);
  }
}
Neo4j


   Stored data in a graph of nodes and r’ships
   Can handle billions of nodes per machine
   Means you can query on relationships!
   Supports ACID transactions
   One 500kb jar (!)
   Dual-licensed GPL/Commercial
Sample Code
Blogvertising


 http://blogs.bytecode.com.au/glen
 http://twitter.com/glen_a_smith
 http://grailspodcast.com/


 Download all the source from today:
 http://bitbucket.org/glen_a_smith/cjug-nosql-
  examples
Q&A


 Looking for a good book?

Mais conteúdo relacionado

Último

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Último (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

NoSQL - Life Beyond the Outer Join

  • 1. NoSQL - Life beyond the Outer Join Glen Smith (glen@bytecode.com.au)
  • 2. Objectives  Survey the landscape of NoSQL offerings  Learn some of the terminology  Look at some of the Java offerings in the space  Take away source to play with  Be able to ask questions (but you may not get answers)
  • 3. What is NoSQL?  (N)ot (O)nly SQL not “Anti SQL”  Movement more than “one” technology  Distributed Storage System  Much weaker queries  Scale across many machines  Much larger data, much faster queries
  • 4. Why NoSQL?  Inspired by Distributed Data Storage problems  Scale easily by adding servers  Not suited to all problem types, but super-suited to certain large problem types  High-write situations (eg activity tracking or timeline rendering for millions of users)  A lot of relational uses are really dumbed down (eg fetch by PK with update)
  • 5. What’s wrong with RDBMS?  Nothing ;-)  To scale RDBMS, your approach is typically:  Shard your datasource  Put in a bunch of read replicas  Put memcached in front of those  What could possibly go wrong?   Complex. Custom caching. Partitioning. Migrating of shards. Tons of moving parts.
  • 6. How can I live w/o ACID?  Atomic (it happens or not, no partial completes)  Consistent (DB internals, ref integ, field validate)  Isolated (Can’t modify uncommitted data)  Durable (written to disk/transaction log)  But in a distributed db, life is not so simple...
  • 7. The CAP theorum In a distributed system, when you have state on more than one machine, pick any two:  Consistency (easy in read-only states – copy!)  Availability (can you get at your data? Is it up?)  Partition Tolerance (3 machines on one net, 3 on the other, with a broken link. How do you take updates since you can’t keep people up to date. What if you don’t agree on what’s up?)
  • 8. How do these NoSQL things work?  Basically big distributed hashtables  Push all logic into the write (update two lists – one for userId, one for email)  Things don’t happen transactionally. These are two writes.  There is no free lunch. The programmer is now handling consistency problems.  You were thinking about query optimisation before, and now even more so.
  • 9. How big are we talking?  Digg - 3Tb  Facebook Inbox – 50 Tb  eBay – 2 Pb  Think about Twitter’s issues.. Billion of queries a second over Tb of data.
  • 10. The NoSQL Taxonomy  Key-Value In-Memory stores (Memcached, Redis)  Key-Value “Eventually Consistent” stores (“Dynamo Clones” like Cassandra, Voldemort, Riak)  Document stores (Couchdb, Mongodb, JCR)  Graph Databases (Neo4j)  Tabular (“BigTable clones” like Hadoop/Hbase)
  • 11. Memcached  Developed for the original LiveJournal site  LRU, distributed hashtable  Logic is in both client and server  Used in Google App Engine, Facebook, Twitter  Ehcache now has similar service  Good for things that outlive an app server
  • 12. How does it work?  Clients know how to:  Send items to servers (consistent hashing)  What to do when a server fails  How to fetch keys from servers  Can “weigh” to server capacities  Servers know how to:  Store items they receive  Expire them from the cache  No inter-server comms – everything is unaware
  • 14. Voldemort  Less than Memcached, but also more!  Not a cache, but a distributed key/value store  Developed by LinkedIn  Works on distributed hashmap w/failover  Logic can be in client/server or just server  Pluggable storage (mysql,bdb,mock)  Pluggable serialization (JSON, Google PB, etc)
  • 15. “Relaxed” Consistency  Eventual consistency – data will come into sync but not immediately on the write. In practice “pretty soon” is milliseconds later  We are actually used to this – eg Google indexes update every so often.  Guarantees to read your own writes (eg your profile on LinkedIn)  Tuneable to better performance/weaker consistency
  • 16. What’s attractive?  Data is automatically replicated  Partitioning ensures all servers have subset  Server failure is handled transparently  Data is rebalanced when servers added/removed  Serialization is pluggable  Apache License
  • 17. Impressive Performance  “We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.”
  • 19. Sample Script  Starting the server (or deploy as a .war) binvoldemort-server.bat configsingle_node_cluster  Starting the console binvoldemort-shell.bat test tcp://localhost:6666  Run some queries put “hello” “world” get “hello” put “hello” “world 2.0” delete “hello”
  • 21. CouchDb  Document-Oriented Db – No Schema  Written in Erlang (!) by a Notes Dev (!!!)  Everything is stored in JSON, Restful API  Clever replication concepts – works in disconnected settings  Every write is a new document, version  Map/Reduce baked in  Apache License
  • 22. What’s attractive?  Schemaless operation – Adhoc data  Incremental replication (great for disconnected settings)  Great fault-tolerance (with versioned conflicts)  Fast query with flexibility (MapReduce)
  • 23. So what is this Map/Reduce thing?  Popularized by Google’s BigTable  Map functions collect documents matching criteria and create a B-Tree  Reduce functions operate on the B-Tree  Everything happens in parallel on many machines  Example: distributed grep
  • 24. The Naked Couch  http://127.0.0.1:5984/  http://127.0.0.1:5984/_all_dbs  http://127.0.0.1:5984/mydb (PUT)  http://127.0.0.1:5984/_utils/ (Futon)
  • 25. Mapping Couch with Ekron  You lose some of the joy of schema-less  But you do get lots of boilerplate ;-)  Oh, and strong typing.
  • 26. Writing a Couch MapReduce  You write a map function to extract data  You always return a key/value pair function(doc) { if (doc.title.indexOf(“Hi!") > -1) { emit(doc.title, doc); } }
  • 27. Neo4j  Stored data in a graph of nodes and r’ships  Can handle billions of nodes per machine  Means you can query on relationships!  Supports ACID transactions  One 500kb jar (!)  Dual-licensed GPL/Commercial
  • 29. Blogvertising  http://blogs.bytecode.com.au/glen  http://twitter.com/glen_a_smith  http://grailspodcast.com/  Download all the source from today:  http://bitbucket.org/glen_a_smith/cjug-nosql- examples
  • 30. Q&A Looking for a good book?