5. Lots of Data
Data is doubles every 18 month
Pictures
Web site
emails
Sensors
Geo Information
Financial Information
Science
Art
. . . (Infinite list)
6. No Limits
With the cloud it is now possible to mount any
size if cluster and conduct any computation in
any scale.
The one who will make sense of all available
data will rule the world.
The conclusion:
Use the cloud to analyze large scale of data.
8. Data has many forms
Yet data comes in many forms and shapes
Graphs Documents
Time
Series
Blobs
Geo
Sensors
Unstructured
Structured
Web
9. Problems with RDBMS
Does not scale very well
Sharding
Replication
Models data according to the relational model
Is this the best model for all data types?
Complex and Expensive
Require a DBA
Expensive to buy
Oracle
SQL
10. No Relational
Not all types of data fit well into the relational
world.
Not all data use cases fit well into the ACID
convention
The relational model does not scale very good
Difficult to distribute
Difficult to replicate
12. NO SQL
Large family of databases
No Schema
No relations enforced
Designed for high scale and distribution
Types of NO SQL DB
Key Value
Wide Columns
Documents
Graph
13. Motivation for NO SQL
Large Scale and Distribution
Simplicity
Low cost
Good fit with the data model
Volume, Velocity and Variety
14. What Is No Schema
Some data is structured, and some does not.
No SQL databases do not ENFORCE a
schema like RDBMS systems.
You can leverage data structure by creating
indexes and smart queries.
15. Types of NO SQL Databases
Key values
Wide column
Document
Graph
16. Key values
Data is ordered as a key - values pair
Query by key and values
Simple indexes (by partition key)
Examples
Azure Table Storage
Amazon DynamoDB
Key1 Key2 VaIue1 VaIue2 VaIue3 VaIue4 VaIue5
Israel 1234 1 2 3
France 2345 4 5 8
18. Wide column / Column Families
Data is ordered as a key – value groups
Store data by column
A column family is how the
data is stored on the disk
Query by keykey range only
No Indexes (on some dbs)
Examples
Google Big-Table
Cassandra
HBase
19. Example – Cassandra Data Model
Column
Key value
Super Column
Collection of columns
Column Family
Dictionary of columns
Super Column Family
Dictionary of Column Families
23. Graph databases
Data is ordered in elements and relations.
Query by relations
Supports complicated mathematical graph
calculus
Examples
Neo 4J
StarDog (used for sematic web)
24. RDF and OWL
Triple
Subject - Predicate – Object
Define facts
RDF (Resource Description Framework)
Defines some extra structure to triples.
Example: "rdf:type“ is used to say that things are of certain types.
Schema:
Defines some classes which represent the concept of subjects,
objects, predicates etc.
Enables making statements about classes of thing, and types of
relationship.
OWL
Adds semantics to the schema.
Expressed in triples.
Example: "If A isMarriedTo B" then this implies "B isMarriedTo A".
27. There is no one NO SQL solution for all
use cases
Important
There are over than 150 possible offerings…
28. Replication and Sharding
No SQL databases can span over a large
cluster
Replication
Copy the data to multiple servers
Usually each data element is copied 3 times
One master two slaves
Result: High Availability
Sharding
Split the data between servers
Horizontal partitioning of the data
Result: Horizontal scale
Replication and Sharding can be done together
29. The Cloud and NO SQL
All Cloud Providers have NO SQL solutions
Azure Tables
Google Big Table
Amazon DynamoDB
NO SQL Databases are deployed on a cluster
There are large number of cloud hosting offerings for
no-sql clusters
MongoHQ (MongoDB)
Cassandra on Google Compute engine
Many more
Consistency: A read sees all previously completed writes.Availability: Reads and writes always succeed.Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.https://foundationdb.com/white-papers/the-cap-theorem/The basic idea is that if a client writes to one side of a partition, any reads that go to the other side of that partition can't possibly know about the most recent write. Now you're faced with a choice: do you respond to the reads with potentially stale information, or do you wait (potentially forever) to hear from the other side of the partition and compromise availability?