2. CAP Theorum
Before we get into big data and the role of NOSQL, we must first
understand the CAP theorem. In theoretical computer science, the
CAP theorem, also known as Brewer's theorem, states that it is
impossible for a distributed computer system to simultaneously
provide all three of the following guarantees
1. Consistency (all nodes see the same data at the same time)
2. Availability (a guarantee that every request receives a response about
whether it succeeded or failed)
3. Partition tolerance (the system continues to operate despite arbitrary
message loss or failure of part of the system)
Although all three are impossible to achieve, any two can be achieved
by the systems. That means in order to get high availability and
partition tolerance, you need to sacrifice consistency
3.
4. The 5 Vs of Big Data
• Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate.
• Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization,
and information privacy.
• We currently only see the beginnings of a transformation into a
big data economy. Any business that doesn’t seriously
consider the implications of Big Data runs the risk of being left
behind.
• To get a better understanding of what Big Data is, it is often
described using 5 Vs: Volume Velocity Variety Veracity Value
5. Volume
Volume Refers to the vast amounts of data generated
every second. We are not talking Terabytes but
Zettabytes or Brontobytes. If we take all the data
generated in the world between the beginning of time
and 2008, the same amount of data will soon be
generated every minute. This makes most data sets too
large to store and analyse using traditional database
technology. New big data tools use distributed systems
so that we can store and analyse data across databases
that are dotted around anywhere in the world
6. Variety
Variety Refers to the different types of data we
can now use. In the past we only focused on
structured data that neatly fitted into tables or
relational databases, such as financial data. In
fact, 80% of the world’s data is unstructured
(text, images, video, voice, etc.) With big data
technology we can now analyse and bring
together data of different types such as
messages, social media conversations, photos,
sensor data, video or voice recordings.
7. Velocity
Velocity Refers to the speed at which new data
is generated and the speed at which data moves
around. Just think of social media messages
going viral in seconds. Technology allows us
now to analyze the data while it is being
generated (sometimes referred to as in-memory
analytics), without ever putting it into databases.
8. Veracity & Value
Veracity refers to truthfulness, correctness
of the data.
Value! Having access to big data is no
good unless we can turn it into value.
Companies are starting to generate
amazing value from their big data.
9. Big Data and Human Brain
To understand how big data could be solution architected, let’s try to
understand how human brain is architected.
So the key is parallel processing. Hureeyyyyyy!!!
10. Hadoop & MapReduce
• In 2004, Google published a paper on a process called MapReduce that
used such an architecture.
• The MapReduce framework provides a parallel processing model and
associated implementation to process huge amounts of data. With
MapReduce, queries are split and distributed across parallel nodes and
processed in parallel (the Map step). The results are then gathered and
delivered (the Reduce step). The framework was very successful, so others
wanted to replicate the algorithm. Therefore, an implementation of the
MapReduce framework was adopted by an Apache open source project
named Hadoop.
• But Hadoop is only for processing the data. How can we store this huge
data?
11. NoSql Database
• A NoSQL (often interpreted as Not only SQL) database often used in big data-centric
real-time web applications, provides a mechanism for storage and retrieval of data
that is modeled in means other than the tabular relations used in relational
databases.
• Motivations for this approach include simplicity of design, horizontal scaling, and finer
control over availability. The data structures used by NoSQL databases (e.g. key-
value, graph, or document) differ from those used in relational databases, making
some operations faster in NoSQL and others faster in relational databases.
• The particular suitability of a given NoSQL database depends on the problem it must
solve.
12. Types of NoSQL databases
• There have been various approaches to classify NoSQL databases, each
with different categories and subcategories. Because of the variety of
approaches and overlaps it is difficult to get and maintain an overview of
non-relational databases. Nevertheless, a basic classification is based on
data model. A few examples in each category are:
• Column: Accumulo, Cassandra, Druid, HBase, Vertica
• Document: Lotus Notes, Clusterpoint, Apache CouchDB, Couchbase,
MarkLogic, MongoDB, OrientDB, Qizx
• Key-value: CouchDB, Dynamo, FoundationDB, MemcacheDB, Redis, Riak,
FairCom c-treeACE, Aerospike, OrientDB, MUMPS
• Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog
• Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database,
CortexDB
13. Graph Database
• This kind of database is designed for data
whose relations are well represented as a
graph (elements interconnected with an
undetermined number of relations
between them). The kind of data could be
social relations, public transport links, road
maps or network topologies.
14. Key-value stores
• In this model, data is represented as a
collection of key-value pairs, such that
each possible key appears at most once
in the collection. The key-value model is
one of the simplest non-trivial data
models, and richer data models are often
implemented on top of it.
15. Document-oriented databases
• The central concept of a document store is the notion of a
"document". While each document-oriented database
implementation differs on the details of this definition, in general,
they all assume that documents encapsulate and encode data (or
information) in some standard formats or encodings. Encodings in
use include XML, JSON as well as binary forms like BSON.
• The most widely used solutions in no-sql are MongoDB and
CouchBase and both of them are document-oriented databases.
• Here is a sample document:
{
'_id' : '5897g42s0245afo4o473ai1e7',
'firstname': 'John',
'lastname': 'Doe',
'age': 26,
'sex': 'M',
'interests': [ 'Reading', 'Running', 'Hacking' ]
}
19. Scalability
• In Couchbase, you can easily add servers to do clustering and
obtain a distributed system, Couchbase is flexible enough to avoid
downtime. Indeed, it relies on the power of the Erlang language, a
functional and fault-tolerant language that manages hot changes.
• For MongoDB, the configuration is a bit more complicated. For
example, once you have defined the shard key (the key to distribute
documents within a sharded cluster), it becomes difficult to change it
afterwards. The system is not as flexible, so you have to think
carefully about your data modeling before you move your
application into production.
• Scalability is why Couchbase is widely used in social gaming, where
millions of players can play and their numbers can increase
exponentially overnight.
20. Monitoring tool
Couchbase comes with a turnkey package while MongoDB requires an
additional subscription to a monitoring service. You can monitor MongoDB
using the command line, but a monitoring tool without graphical interface is
relatively restrictive.
21. Introducing CouchBase
• Couchbase provides the world’s most complete, most scalable and
best performing NoSQL database.
• Based on a share nothing architecture, a single node-type, a built in
caching layer, true auto-sharding and the world’s first NoSQL mobile
offering: Couchbase Mobile, a complete NoSQL mobile solution
comprised of Couchbase Server, Couchbase Sync Gateway and
Couchbase Lite.
• Clients: AT&T, Amadeus, Bally’s, Beats Music, Cisco, Comcast,
Concur, Disney, eBay / PayPal, Neiman Marcus, Orbitz, Rakuten /
Viber, Sky, Tencent, Tesco, Verizon and Willis Group, as well as
hundreds of other household names worldwide
22. Real life Use Cases
Couchbase Server’s unique combinations could be 1) linear, horizontal
scalability, 2) sustained low latency and high throughput performance, and
3) the extensibility of the system.
Few usecases:
• Session store: User sessions are easily stored and managed in
Couchbase, for instance, by using the document ID naming scheme,
“user:USERID”. With Couchbase Server, you can flag items for deletion
after a certain amount of time, and therefore you have the option of having
Couchbase automatically delete old sessions.
• Social gaming: You can model and store game state, property state, time
lines, conversations and chats with Couchbase Server. The asynchronous
persistence algorithms of Couchbase were designed, built and deployed to
support some of the highest scale social games.
• Ad, offer, and content targeting: The same attributes which serve
Couchbase in the gaming context also apply well for real-time ad and
content targeting. For example, Couchbase provides a fast storage
capability for counters. Counters are useful for tracking visits, associating
users with various targeting profiles, tracking ad-offers, and for tracking ad-
inventory.
23. Buckets
• Couchbase Server stores all of your application data in either RAM or on disk. The
data containers used in Couchbase Server are called buckets; there are two bucket
types in Couchbase, which reflect the two types of data storage that we use in
Couchbase Server. Buckets also serve as namespaces for documents and are used
to look up a document by key:
• Couchbase Buckets
• Memcached Buckets
• You can customize the properties of each bucket, within limits using Couchbase
Admin Console, Couchbase Command Line Interface (CLI), or the Couchbase REST
Admin API. Quotas for RAM and disk space can be configured per bucket so you can
manage usage across a cluster
• Couchbase Server is best suited for fast-changing data items of relatively small size.
For in-memory storage, using Couchbase Memcached buckets, the memcached
standard 1 megabyte limit applies to each value. Items suitable for storage include
shopping carts, user profile, user sessions, time lines, game states, pages,
conversations and product catalog. Items that are less suitable include large audio or
video media files.
• On that note, some Couchbase SDKs offer the additional feature of optionally
compressing/decompressing objects stored into Couchbase. The CPU-time versus
space trade-off here should be considered
24. Couchbase Buckets
• Couchbase Buckets: provide data persistence and data replication. Data
stored in Couchbase Buckets is highly-available and reconfigurable without
server downtime. They can survive node failures and restore data plus allow
cluster reconfiguration while still fulfilling service requests. The main
features are:
– Supports items up to 20MB in size.
– Persistence, including data sets that are larger than the allocated memory size
for a bucket. You can configure persistence per bucket and Couchbase Server
will persist data asynchronously from RAM to disk
– Fully supports replication and server rebalancing. You can configure one or more
replica servers for a Couchbase bucket. If a node fails, a replica node can be
promoted to be the host node.
– Full range of statistics supported.
25. Memcached Buckets
• Memcached Buckets: provides in-memory document storage. Memcache
buckets cache frequently-used data in memory, thereby reducing the
number of queries a database server must perform in response to web
application requests. Memcached buckets can work alongside relational
database technology, not only NoSQL databases.
– Item size limited to 1 MByte.
– No persistence.
– No replication; no rebalancing.
– Statistics about Memcached Buckets are on RAM usage and client-side
operations.
26. Keys & Metadata
• All information that you store in Couchbase Server are documents with keys, unique identifiers
for a document, and values are either JSON documents or if you choose the data you want to
store can be byte stream, data types, or other forms of serialized objects.
• Keys are also known as document IDs and serve the same function as a SQL primary key. A key
in Couchbase Server can be any string and is unique.
• By default, all documents contain metadata that is provided by the Couchbase Server. The
metadata is stored with the document and is used to change how the document is handled.
• CAS Value—Also called CAS token or CAS ID, this value is a unique identifier associated with a
document that is verified by the Couchbase Server before a document is deleted or changed and
provides a form of basic optimistic concurrency. When Couchbase Server checks a CAS value
before changing data, it effectively prevents data loss without having to lock records. Couchbase
Server prevents a document from being altered by an operation if another process alters the
document and its CAS value, in the meantime.
• Time to Live (TTL)—This is an expiration for a document typically specified in seconds. By
default, any document created in Couchbase Server that does not have a given TTL will have an
indefinite life span and will remain in Couchbase Server unless an explicit delete call from a client
removes it. The Couchbase Server will delete values during regular maintenance if the TTL for an
item has expired.
Note: The expiration value deletes information from the entire database. It has no effect on when
the information is removed from the RAM caching layer.
• Flags—These are SDK- specific flags which are used to provides a variety of options during
storage, retrieval, update, and removal of documents. Typically flags are optional metadata used
by a Couchbase client library to perform additional processing of a document. An example of
flags include the ability to specify that a document be formatted a specific way before it is stored.
27. Creating First Application
Components for your development environment:
• Couchbase Server: installed on a virtual or physical machine separate from the machine
containing your web application server. Download the appropriate version for your environment
here http://www.couchbase.com/download
• Couchbase SDK: installed for runtime on the machine containing your web application server.
You will also need to make the SDKs available in your development environment in order to
compile/interpret your client-side code. The SDKs are programming-language and platform-
specific. You will use your SDK to communicate with the Couchbase Server from your web
application. Downloads for your chosen SDK are here: http://www.couchbase.com/develop
• Couchbase Admin Console: administering your Couchbase Server is done via the Couchbase
Admin Console, a web application viewable in most modern browsers. Your development
environment should therefore have the latest version of Mozilla Firefox 3.6+, Apple Safari 5+,
Google Chrome 11, or Internet Explorer 8, or higher. You should set your browser preference to
be JavaScript enabled.
The development languages supported by the Couchbase Client SDK Libraries are Java, .NET,
PHP, Ruby, C
28. Connecting A Bucket
• After you have your Couchbase Server up and running, and your
chosen Couchbase Client libraries installed on a web server, you
create the code that connects to the server from the client.
1. Make a new bucket request to the REST endpoint for buckets and
provide the new bucket settings as request parameters:
shell> curl -u Administrator:password
2. -d name=newBucket -d ramQuotaMB=100 -d authType=none
3. -d replicaNumber=1 -d proxyPort=11215
http://localhost:8091/pools/default/buckets
29. Connecting to Couchbase Server
The following shows a basic steps for creating a connection:
• Include, import, link, or require Couchbase SDK libraries into your program
files. In the example that follows, we require 'couchbase'.
• Provide connection information for the Couchbase cluster. Typically this
includes URI, bucket ID, a password and optional parameters and can be
provided as a list or string. To avoid failure to initially connect, you should
provide and try at least two URL’s for two different nodes. In the following
example, we provide connection information as"http://<host>:<port>/pools".
In this case there is no password required.
• Create an instance of a Couchbase client object. In the example that
follows, we create a new client instance in the client =
Couchbase.connect statement.
• Perform any database operations for your applications, such as read, write,
delete, or query.
• If needed, destroy the client, and therefore disconnect.
30. Connecting to Couchbase Server..
• The below example in Java we demonstrate how it is safest to
create at least two possible node URIs while creating an initial
connection with the server. This way, if your application attempts to
connect, but one node is down, the client automatically re-attempts
to connect with the second node URL:
// Set up at least two URIs in case one server fails
List<URI> servers = new ArrayList<URI>();
servers.add("http://<host>:8091/pools");
servers.add("http://<host>:8091/pools");
// Create a client talking to the default bucket
CouchbaseClient cbc = new CouchbaseClient(servers, "default", "");
// Create a client talking to the default bucket
CouchbaseClient cbc = new CouchbaseClient(servers, "default", "");
System.err.println(cbc.get(“thisname") + " is off developing with Couchbase!");