One of the most popular NoSQL databases, MongoDB is one of the building blocks for big data analysis. MongoDB can store unstructured data and makes it easy to analyze files by commonly available tools. This session will go over how big data analytics can improve sales outcomes in identifying users with a propensity to buy by processing information from social networks. All attendees will have a MongoDB instance on a public cloud, plus sample code to run Big Data Analytics.
Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
1.
2.
3. We offer MongoDB-as-a-Service on any cloud of your choice. You can read more about our
MongoDB-as-a-service in our white paper on our website: http://www.cumulogic.com/
resources/mongodb_wp/
4. The goal of this boot camp is to give you hands-on experience with MongoDB database-as-a-
service, how to load the data and show you a sample application to analyze the data. We will
use a small sample Twitter application for our hands-on lab, which will help you write a
MongoDB application. We will also discuss briefly a few performance-related so you can
analyze and tweak performance of your databases. At the same time, you will also see how
you can easily launch a fully managed MongoDB instance in the cloud.
5. About a decade ago, business applications were transactional in nature and most of the
issues were related to executing transactions (i.e. credit card processing) with low latency, as
a result enterprise data was more “relational” in nature and was therefore “structured”.
The nature of business applications has changed and enterprises are trying to figure out how
to use all the data in their enterprise systems, social media, machine logs, etc. to understand
how all the data impacts their business and how they can get competitive advantage by
leveraging nuggets in that data.
Fast forward till today and businesses are trying to solve a different problem. And with the
diverse nature of data sources and data formats, we need newer technologies that scale and
provide answers or identify those nuggets in the data at much faster speed and low cost than
traditional SQL database or data warehouse systems. Hence, we see a slew of new database
technologies being developed that promise to help solving these problems.
Depending on the nature of the data or problem they solve, we can categorize these new
database technologies in three major categories. (1) Document oriented databases, which
store and crunch data in document formats, (2) Key-value pair databases such as Riak and
Redis and (3) Graph databases. Depending on the type of data, we could use one of these
databases to solve your data analytics problems. Today, we are focus on MongoDB.
6. When should we want to use NoSQL database Vs SQL database, and which NoSQL
database?
As I mentioned before, the problems that NoSQL databases solve is related to the nature and
amount of data we want to processes in our next generation applications. We need databases
that can scale to petabytes of data at a fraction of the cost of a relational database. We need
database systems which can help us quickly analyze petabytes of data and provide results in
realtime - hence the speed and velocity of data access is critical.
NoSQL database systems can provide high speed access and low latency access to large
amount of data. And one key criteria to consider when choosing NoSQL database is the
nature of your applications and main issues with them – are they operational or analytical? For
example, for batch processing, analytical apps, you may be better off with Hadoop – while for
operational issues of scalability and realtime processing, you may want to choose MongoDB
database. So consider these criteria in making your decisions and do some experiments and
find the best ones that fits your application needs.
7. 1. Let’s take a look at the key feature sets of MongoDB at very high level. MongoDB is a
document oriented database server. It stores objects as BSON (pronounced as bison), which
is a binary versions of JSON format and it supports dynamic schemas – which essentially
means it is schema-less database. There is no rigid SQL-like schema to store the data. This
gives flexibility in choosing the data types from different data sources such as social networks,
machine logs or CRM systems.
2. MongoDb supports indexing just like traditional SQL indexing, which means you can index
data on any field with high fidelity to improve query performance. (FYI – High fidelity here
means the field which is a variable in all records. For example, if we are storing data about
employees, the data field that varies most is the phone number and not the city name or
company name)
3. Most of you may be familiar with the concept of database sharding. MongoDB is a
horizontally scalable database and supports sharding – which means it stores data in smaller
chunks on several data nodes for low latency access to the data. Hence MongoDB is widely
used in the cloud because you can scale the database by adding shards as your data grows
and maintain that low latency of data access even as your size of the data grows.
4. MongoDB is designed to be resilient for data durability and supports replica sets which can
be geographically distributed
5. MongoDB supports Map-reduce operations and provides fast updates to the data.
FAQs: When do you want to use Hadoop Vs MongoDB for Map-reduce?
Answer: You want to use Hadoop for batch jobs, where you can fire up analytics on
offline data, whereas you can use MongoDB for realtime data analytics.
Question: How does Sharding work in MongoDB?
Answer: MongoDB sharding works by spreading writes to multiple data nodes.
Mongos, which is the mongoDB proces,s directs data to a different data node to write or read.
And show the slide – (refer to the sharding diagram)
8. Since MongoDB scales very well horizontally, it is the most widely used database in the cloud.
And given the complexity of managing mongoDB for maintaining availability, data durability
and performance, you may want to leverage platforms which provide you MongoDB-as-a-
Service, which is a web service call to provision a dedicated mongoDB server, fully sharded
and replicated, which scales automatically.
You will get a chance to use MongoDB service shortly in our platform
9. The specific MongoDB architecture that you choose will impact the performance, availability
and data durability. MongoDB is flexible and supports high availability and sharding
architectures to provide you tge level of redundancy, performance and SLA you want for your
service.
MongoDB supports replica sets and sharding deployment architectures. Replica sets provide
high availability and data durability while sharding provides scalability. You can configure
shards on the replica sets for achieving the best of both, reliability and scalability.
10. This is a replica set with three replica nodes in two datacenters or two regions of a public
cloud.
MongoDB uses “eventual consistency” which means there may be a possibility that data on
the replicas may be out of sync from the primary node. You may want to use this architecture
for data redundancy purposes rather than scaling. In this architecture, you still send reads and
writes to the primary node, which means even with multiple nodes, your application wouldn’t
necessarily scale better. To maintain this level of redundancy yet improve scalability, you can
use sharding as in the next slide.
11. This is a three shard deployment architecture which uses three replica sets and can be in a
single region or datacenter or distributed geographically.
With this architecture, you get the benefit of both, the data redundancy with replica sets and
high scalability with shards. Each shard itself can be a replica set which provides data
redundancy at each node level. But keep in mind, there is a overhead to sharding and
replication and you want to choose what’s best for your database
12. Now let’s take a look at a sample application. We have a sample Twitter app to do hands-on
experiment with. We will use MongoDB-as-a-Service on the cloud and use a sample app to
analyze twitter dat.
13.
14. Just like any database, the performance of MongoDB database must be monitored and
optimized for a given workload or application type.
These are key metrics you want to look for in MongoDb: (1) CPU (2) memory (3) Ops counters
– this is the total number of operations over a period of time. This number shows you number
of active and pending operations (4) background flush – this is the number of disk writes when
MongoDb flushes all in-memory data to the disk. You want to keep an eye on this number and
tweak if you wish to reduce the number of times or frequency of disk writes. There are other
metrics which we will see during our hands-on lab.