2. The evolution of data stores
• Data modeling
• Data from the Developer’s standpoint
• Data from the DBA’s standpoint
• Impedance mismatch and the rise of ORM
Aswani Vonteddu
3. Hierarchical object graph model
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
4. Normalized for tables in RDBMS
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
5. Data – Summary
• In order to use an RDBMS,
– Designer to model data into tables
– Developer must normalize/de-no
– DBA has to speed up queries
Aswani Vonteddu
6. Impedance mismatch and the rise of ORMs (like
Hibernate)
[Table(name="Products")] [Table(name="Keywords")]
class Product class Keyword
{ {
[Column(PrimaryKey=true)]int ID;
[Column]string Title; [Column(PrimaryKey=true)]int ID;
[Column]string Author; [Column]string Keyword;
[Column]int Year; [Column(IsForeignKey=true)]int ProductID;
[Column]int Pages; }
private EntitySet<Rating> _Ratings;
[ [Table(name="Ratings")]
Association( Storage="_Ratings", class Rating
ThisKey="ID", {
OtherKey="ProductID“,
DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID;
) [Column]string Rating;
] [Column(IsForeignKey=true)]int ProductID;
ICollection<Rating> Ratings{ ... } }
private EntitySet<Keyword> _Keywords;
[…]
ICollection<Keyword> Keywords{ ... }
}
Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman
Aswani Vonteddu
7. o So what is Big Data?
o Sources
o Applications
o Technologies
Aswani Vonteddu
8. What is Big Data?
• It is not a technology in itself.
• It is information about everything that is
happening around us, every where and every
minute
• Almost all of us have contributed to Big Data
with/with out our knowledge already, and we will
continue to be doing that.
• Un-structured
Aswani Vonteddu
10. Sources
• Clickstream
• Tweets
• Facebook: pictures and comments
• Sensors
A Boeing 737 generates 240 TB of data
during a single cross country flight.
Aswani Vonteddu
13. Setting up a Big Data platform
• A Big Data platform must be equipped
with technologies for the following stages
of data processing:
• Acquisition
• Organization
• Analysis
Aswani Vonteddu
14. Technologies
• Acquisition
– NoSQL databases (DynamoDB, Cassandra)
• Very high speed writes
• Organization & Analysis
– Map Reduce (Apache Hadoop)
• Code to Data, not otherwise
• Map function and Reduce function together
perform the desired analysis
Aswani Vonteddu
15. NoSQL and why now?
• RDBMSs must ensure ACID properties
• CAP theorem says that all three of
Consistency, Availability and Partition tolerance
cannot be guaranteed by any distributed
system
• NoSQL databases are distributed, and are
better options than RDBMS for applications
that can deal with lack of one of those
properties.
Aswani Vonteddu
16. Relational Databases
• Random disk access
• Data model is totally structured, and
predefined
• Shared Everything architecture – Single
point of failure
Aswani Vonteddu
22. Cassandra
Coordinator
N
1
3. Success
1. ConsistencyLevel.ONE
2. Write
request 2. Write
N request N
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
23. Cassandra
Coordinator
N
1
3. Success
1. ConsistencyLevel.ONE
2. Write
request 2. Write
4. Success N N
request
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
24. Cassandra
Coordinator
N
1
3 or 4. Success
3 or 4. Success
1. ConsistencyLevel.TWO
2. Write
request 2. Write
N request N
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
25. Cassandra
Coordinator
N
1
3 or 4. Success
3 or 4. Success
1. ConsistencyLevel.TWO
2. Write
request 2. Write
5. Success N N
request
4 2
Replica node Responsible node
N
3
Aswani Vonteddu
26. Cassandra
• Write operation:
– Commit log
– Memtable – In-Memory storage structure
(kind of a hash table)
– SSTable on disk
– Compaction
Aswani Vonteddu
27. Cassandra
• Read operation:
– Coordinator node forwards the request
• to the node responsible
• And replica nodes based on the consistency level
requested
– Each node
• Looks up in the Memtable + all existing SSTables
• Takes the one with the latest timestamp.
– Bloom filters help speed up this operation
Aswani Vonteddu
28. Cassandra
Indexes:
• Primary index (on the key)
supported default by the
Cassandra engine
• Secondary indexes are to be
built as a new column family
with the column of interest
as the key Aswani Vonteddu
29. Document DBs
• Similar to Key-Value stores, but Values
are often documents (JSON, ION, …)
• Documents are versioned
• Example
DynamoDB
Aswani Vonteddu
30. Map Reduce
• Introduced by Google
• List processing system
• Scales to clusters with thousands of nodes
• And petabytes or Exabytes of data volumes
• Code is taken to data, not otherwise
• Data must be disjoint
• Maps the functions to nodes where the data
resides
• And Reduces the results from all nodes to build
the final result
• Example: Hadoop
Aswani Vonteddu
32. Big Data talent
• Deep analytical
– Mathematicians, Operations research
analysts, statisticians, ..
• Big data savvy
– Business and functional
managers, budget, credit and financial
analysts
• Supporting Technology
– DBAs, System & Network administrators, and
Programmers
Aswani Vonteddu
33. The DBA’s role here?
• Tremendous opportunity for the DBAs
• Like in the early 90’s when businesses
migrated from mainframes to Oracle/SQL
Server/DB2
• Where?
– Data modeling:
Vast amounts of data, re-designing DHTs is
harder than re-designing RDBMS by multiple
folds since data migration is painful
Aswani Vonteddu
34. References
[1] McKinsey, Big data: The next frontier for
innovation, competition and productivity
[2] IDC, The rise of Big Data: Managing, Storing and gaining
value from endless information
• Others
– http://slidesha.re/LF8umk
– http://slidesha.re/LF8vGY
Aswani Vonteddu
Editor's Notes
Industries: Healthcare, Telecommunications, Retail, Manufacturing, Public sector