Slides presented at SDBigData Meetup:
http://www.meetup.com/sdbigdata/events/225691323/
There was a request for more Couchbase use case information and NoSQL primer, so I added a number of slides to let me talk to those aspects right before doing the presentation.
KEY POINT: COUCHBASE PROVIDES A SET OF MULTI-PURPOSE, CORE CAPABILITIES THAT SUPPORT A BROAD RANGE OF APPLICATIONS AND USE CASES, ALL IN A SINGLE DATA MANAGEMENT PLATFORM.
Couchbase provides a set of technology capabilities to support a broad range of applications and use cases:
High Availability Cache: Couchbase provides an integrated managed object cache, so you can start out using Couchbase as a high availability cache on top of your existing relational database. For example, you can use Couchbase as a session store in front of your relational database, if your relational DB is struggling to keep up with the load required for online interactive applications.
Key-Value Store: Many customers start with Couchbase as a cache and then broaden their usage to other capabilities, like using Couchbase as a Key-Value Store for things like Profile Management.
Document Database: From there, you can grow into using Couchbase as a Document Database, where you can do more with capabilities like indexing and Cross Data Center Replication.
Embedded Database: Couchbase also provides an embedded database called Couchbase Lite. It’s a purpose-built database for the device, so you can build applications that are always available and always work, whether offline or online.
Sync Management: Finally, as part of our solution for mobile applications, we provide Couchbase Sync Gateway, which automatically synchronizes data on the device with Couchbase Server in the cloud so your developer doesn’t have to write code to manage the complex sync process.
Starting with cache and then expanding to other capabilities is often a good way to learn the technology and get comfortable with Couchbase for a wider set of use cases.
Couchbase has emerged as a leading NoSQL provider for number of reasons:
Best in performance and scalability
We’ve engineered Couchbase from the ground up for high performance and scalability
Couchbase is designed to deliver sub-millisecond responsiveness with very high throughput for both reads and writes
We consistently outperform competitors like MongoDB and DataStax in multiple independent benchmarks
Our performance advantage is driven in large part by our memory-centric architecture, which includes an integrated managed object cache and stream-based replication
Broad use case support
We’re the only NoSQL provider that has consolidated distributed cache, key-value store, and a JSON-based document database in a single platform
This means customers can use Couchbase for a much broader range of applications
Integrated mobile solution
We’re the only vendor that provides an end-to-end NoSQL mobile solution -- allows customers to easily build mobile apps that run great on or offline
Includes a JSON database embedded on the device, along with a prebuilt syncing tier
So apps run great on the device, even without a network connection or no connectivity at all
Data on the device auto-syncs with the backend server when a connection is available
Simplified administration
We’ve designed Couchbase to be exceptionally easy to deploy and manage
Features such as an integrated Admin Console and single-click cluster expansion & rebalance dramatically increase admin efficiency
Each Couchbase node is exactly the same.
All nodes are broken down into two components: A data manager (on the left) and a cluster manager (on the right). It’s important to realize that these are separate processes within the system specifically designed so that a node can continue serving its data even in the face of cluster problems like network disruption.
The data manager is written in C and C++ and is responsible both for the object caching layer, persistence layer and querying engine. It is based off of memcached and so provides a number of benefits;
-The very low lock contention of memcached allows for extremely high throughput and low latencies both to a small set of documents (or just one) as well as across millions of documents
-Being compatible with the memcached protocol means we are not only a drop-in replacement, but inherit support for automatic item expiration (TTL), atomic incrementer.
-We’ve increased the maximum object size to 20mb, but still recommend keeping them much smaller
-Support for both binary objects as well as natively supporting JSON documents
-All of the metadata for the documents and their keys is kept in RAM at all times. While this does add a bit of overhead per item, it also allows for extremely fast “miss” speeds which are critical to the operation of some applications….we don’t have to scan a disk to know when we don’t have some data.
The cluster manager is based on Erlang/OTP which was developed by Ericsson to deal with managing hundreds or even thousands of distributed telco switches. This component is responsible for configuration, administration, process monitoring, statistics gathering and the UI and REST interface. Note that there is no data manipulation done through this interface.
Now, as you fill up memory (click), some data that has already been written to disk will be ejected from RAM to make room for new data. (click)
Couchbase supports holding much more data than you have RAM available. It’s important to size the RAM capacity appropriately for your working set: the portion of data your application is working with at any given point in time and needs very low latency, high throughput access to. In some applications this is the entire data set, in others it is much smaller. As RAM fills up, we use a “not recently used” algorithm to determine the best data to be ejected from cache.
Should a read now come in for one of those documents that has been ejected (click), it is copied back from disk into RAM and sent back to the application. The document then remains in RAM as long as there is space and it is being accessed.
KEY POINTS: BIG DATA IS NOT ONE THING – IT’S A COMBINATION OF OPERATIONAL (NOSQL) AND ANALYTICAL DATABASES. YOU NEED BOTH. COUCHBASE PROVIDES THE OPERATIONAL SOLUTION.
Big data has two major pieces: Operational and Analytical
Operational is about:
Real time
Online, interactive
Customer/consumer facing
Processing data at high velocity
Analytical is about:
Offline analytics
Often batch oriented
Takes time processing
Directly touches relatively few users (business analysts)
These two pieces together form “Big Data”
There’s some overlap
NoSQL can deliver some analytics
Hadoop can deliver some operational
But in general each technology designed for separate purposes
Couchbase fits on the operational side, Hadoop on the analytics side
The data generated by users is published to Apache Kafka.
Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop.
Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.