Talk to techmeetup Aberdeen on bigdata and nosql
Some links seem to be missing from the onscreen presentation, particularly http://www.dbshards.com/dbshards/ for the sharding diagram
2. Who am I ?
Lecturer at University of Dundee
Program director of Business Intelligence and new
program Data Science (http://goo.gl/ljl6N and
http://goo.gl/uwHSi )
Geek and Hacker
4. From evil Wikipedia
“In information technology, big data[1] consists of
datasets that grow so large that they become awkward
to work with using on-hand database management
tools.”
Which doesn’t tell us much
Any definition that relies on data “size” will become
obsolete very quickly as data storage capabilities grows.
5. Lets try something different
The Three V’s
Volume
How Big is the data, Terabytes ? Petabytes?
Variety
Is it the same sort of data, what about blobs ? Does it
change ?
Velocity
How fast is it coming in ? Can we store it fast enough
and then use it ?
http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety-
velocity
6. The Twitter problem
Twitpocalypse
Overflow of status ids for 32 bit signed integers
But beyond that, can we physically store data fast
enough ?
7. Suppose we are storing 16 columns of 16 bytes
At 100 per second
0.7 Terabyte per year
Add at 1 million per second that’s
7 petabytes per year
This is volume
8. Variability
Data is sparse and can be different sizes
Over time the type of data changes
Consider click through data, as pages evolve new data
types and fields need to be stored
10. We need UDF
User Defined functions inside the dB
Or a different way of dealing with it, such as Hadoop
or MRSQL.
11. So what is NoSql
Throws away everything you know about Databases
Is a family of different databases
Lots of different “products”
BUT !
http://nosql.mypopescu.com/post/1016320617/mongo
db-is-web-scale (warning might offend)
They should only be used when it’s sensible, they are
not magic sauce.
12. NoSql types
Key-Value
Column-family
Document databases
Allow sharding across nodes
Graph
Fast for graph like data and operations
14. Sharding ?
Distribution of data across nodes
Allows performance to be spread across multiple
machines
SQL databases can be sharded
Not all NoSQL databases can be sharded
15. Cap Theorem
CAP (or Brewers) theorem says:
It’s impossible for a web service to provide the
following
Consistency
Availability
Partition tolerance
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
But see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-
changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/
17. Partitions ?
Essentially failing to achieve consistency within a set
time causes a partition.
You can sacrifice availability to ensure consistency
Partitions are rare and if you have one server, almost
never happen
Partitions are caused by networks, failed nodees
18. Eventual Consistency
Eventually all nodes will tell the same story
Isn’t this a mad idea ?
Facebook (Actually not)
The Internet is based on and Eventual Consistency dB
DNS
20. Network topology of a Cassandra
db
Multiple nodes
Cassandra can be Rack Aware
Keys are replicated across nodes
It’s essentially a DHT Distributed Hash Table
Think BitTorrent
21. CQL
Version 8 introduced CQL Cassandra Query Language
Almost looks like SQL !
http://crlog.info/2011/09/17/cassandra-query-
language-cql-v2-0-reference/ Language ref
http://www.datastax.com/docs/0.8/dml/using_cql
22. Demo
Start Cassandra
Open CQLSH
Create Keyspace
Create a columnfamily
Now we can insert !
23. So why does this work ?
Jsmith
Password: ch@ngem3a
Jbrown
Gender: Male
Phone: 01382 345078
Column store, keys with name: value pairs underneath
24. Interfacing to Cassandra
Based on Thrift
http://thrift.apache.org/
Large number of Languages supported
http://wiki.apache.org/cassandra/ClientOptions
I’ve used Java and Hector
http://prettyprint.me/
Although there is a Csharp version
http://hectorsharp.com/
25. Cassandra JDBC
Very new, difficult to know how stable it is
Needs compiling and libraries not in Cassandra !
http://code.google.com/a/apache-
extras.org/p/cassandra-jdbc/
26. Astyanax
From Netflix
Based on Hector but said to be a lot simpler!
https://github.com/Netflix/astyanax/wiki
27. jBloggyAppy a demo app of
Cassandra
All Source code on Github
https://github.com/acobley/jBoggyAppy
Feel free to use and abuse
Simple blogging App
28. A word on using OpenSource
software
Versioning !
Things Change !
Documentation is wrong !
http://prettyprint.me/
End up reading unit tests to actually program.
29. One Last thing
Dundee DDD 17th November , Big Data track
Anyone interested in speaking ?
Larryeleison must be mad that his “free” software mysql is used on the biggest website in the world.
create keyspace test with strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor=1;use test;create columnfamily users (KEY varchar Primary key, password varchar, gender varchar);INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');Select * from users;INSERT INTO users (KEY, gender) VALUES ('jbrown', 'male');INSERT INTO users (KEY, phone) VALUES ('jbrown', '01382 345078');What are we going to get ?