Cloud, Big Data and No Sql are popular buzz words today.
This presentation shows how they all fit together.
It makes sense in all of the above and show how these new technologies can help the business become more productive.
2. Agenda
What is the cloud
Data boom
No SQL
Big Data
Cloud Distributions
What’s next
3.
4. Make sense of : Cloud , Big Data and No SQL
How they fit together
Make money !!!
5. What is the cloud
Cloud Computing is an Idea …
Infrastructure is provisioned by a cloud
provider.
Automatic Scale.
Elasticity. Pay as you use.
Availability.
Simple, Automatic, Economic.
7. Lots of Data
Data is doubles every 18 month
Pictures
Web site
emails
Sensors
Geo Information
Financial Information
Science
Art
. . . (Infinite list)
8. No Limits
With the cloud it is now possible to mount any
size if cluster and conduct any computation in
any scale.
The one who will make sense of all available
data will rule the world.
The conclusion:
Use the cloud to analyze large scale of data.
10. Data has many forms
Yet data comes in many forms and shapes
Graphs
Time
Series
Documents
Blobs
Geo
Sensors
Structured
Unstructured
Web
11. No Relational
Not all types of data fit well into the relational
world.
Not all data use cases fit well into the ACID
convention
The relational model does not scale very good
Difficult to distribute
Difficult to replicate
12. The CAP Theory
During a network partition, a distributed system must choose
either Consistency or Availability.
Sharded
NoSQL
RDBMS
Replicated
NoSQL
13. NO SQL
Large family of databases
No Schema
No relations enforced
Designed for high scale and distribution
Types of NO SQL DB
Key Value
Wide Columns
Documents
Graph
14. Motivation for NO SQL
Large Scale and Distribution
Simplicity
Low cost
Good fit with the data model
Volume, Velocity and Variety
15. Important
There is no one NO SQL solution for all
use cases
There are over than 150 possible offerings…
16. The Cloud and NO SQL
All Cloud Providers have NO SQL solutions
Azure Tables
Google Big Table
Amazon DynamoDB
NO SQL Databases are deployed on a cluster
There are large number of cloud hosting offerings for
no-sql clusters
MongoHQ (MongoDB)
Cassandra on Google Compute engine
Many more
18. Big Data
What is Big?
“Big” cannot fit on a single machine.
Conclusion:
Big data has to be distributed.
19. Types of Big Data Processing
Query
General Analysis
Classification
Recommendation
Clustering
Auditing and monitoring
More…
20. Challenges
Develop a parallel algorithm
Reduce the network traffic -> bring compute to
data
Monitor and manage large number of parallel
tasks
Survive failures
Performance
Linear scale
21. Batch Processing VS Operational
Intelligence
Batch Processing
Work on existing data
Provide results within minutes
Operational Intelligence
Work on stream of data
Provide real-time results
22. Distributed File System
No one server can store Big Data files
Distribute files across cluster
Failure is part of the game
Similar API to traditional File Systems
Examples:
HDFS
GFS
Cassandra FS
Mongo FS
23. Hadoop
Big Data Analysis Platform
Batch Processing
Brings Compute tasks to data nodes
Parallel Processing using Map-Reduce
Open Source
Huge eco system
24. Hadoop Eco System
Writing a valuable Map-Reduce job for Hadoop
is not simple
Many open source projects provide
abstractions
Pig
Hive
HBase
Sqoop
Mahout
ZooKeeper
More
25. Hadoop on the Cloud
Hadoop runs on a cluster
You can use a cluster as a service on major
cloud offerings
26. Storm
Real-Time big data analytics
Process streams of data
Can be used with any programming language
Wide integration with data sources
27.
28. Check your schema
Be open to use NO-SQL data stores
Identify your use-case and find the right
database for you
Create a simple POC
29. Look for Big Data
Ask yourself: What can I gain from big data?
How the new data or analysis scope can enhance
your existing set of capabilities?
What additional opportunities for intervention or
processes optimisation does it present?
Identify your use case and find the right product
and data model.
Look for web distributions and create a simple
POC
Consistency: A read sees all previously completed writes.Availability: Reads and writes always succeed.Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.https://foundationdb.com/white-papers/the-cap-theorem/The basic idea is that if a client writes to one side of a partition, any reads that go to the other side of that partition can't possibly know about the most recent write. Now you're faced with a choice: do you respond to the reads with potentially stale information, or do you wait (potentially forever) to hear from the other side of the partition and compromise availability?