Stockage des données : quel système pour quel usage ?

#DevoxxMA @zouheircadi
STOCKAGE : QUEL
SYSTEME POUR QUEL
USAGE

QUI SUIS-JE
• @ZouheirCADI
•  JEE architect (big data, perf., quality, ops,
app, …)
•  Intervenant à l’ENST
• Co-organisateur Devoxx France
• (ancien …) Co-organisateur Paris Java User
Group

AGENDA
• Revu des systèmes de stockage (OLAP et
OLTP)
• RDBMS
• OLAP (Hadoop et Spark)
• OLTP
•  Key-Value : memcached
•  Document : couchdb
•  Column family
•  Search
• Conclusion

Why ?
• Share data
• Many users
• Expose a datamodel
• An organized one ?
• Scalability
• Depending on users or data processing
• Flexibility
• Embrace change

RDBMS

Key date
• 80s

RDBMS
RelaYonal Database Management Systems
were invented to let you use one set of data
in mulYple ways, including ways that are
unforeseen at the Yme the database is built
and the 1st applicaYons are wrien.

Curt Monash, analyst/blogger

RDBMS
• RelaYonal databases organize data in tables
• Which are made of many rows.
• Each row has data in each of several columns (every row
in a table has the same columns)
• RelaYonships are implicit
Emp
empno ename job deptno
7839 King President 10
7698 Blake Manager 20
deptno dname loc
10 Account NY
20 Sales CHI
Dept

RDBMS – KEY CONCEPTS

1er : Physical data independence
PHYSICAL FILES LOGICAL MODEL
fseek
fopen
fread
© hWp://www.slideshare.net/billhoweuw/dataintensive-scalable-science

2eme : Relational algebra
• Select, Project, Join
• Union, Intersec`on, Diﬀerence
© hWp://www.slideshare.net/billhoweuw/dataintensive-scalable-science

RDBMS
• Expression logique des requêtes
SELECT e.ename, d.dname
FROM EMP e
JOIN DEPT d on e.deptno = d.deptno
WHERE e.ename = ‘King’

Table
scan
Table
scan
Hash
match
Select
Table
scan
Table
scan
Nested
loops
Select
Select T1.Col2
From Table1 T1
Inner Join Table2 T2 ON T1.Col1 = T2.Col1
Select T1.Col2
From Table1 T1
Inner Join Table2 T2 ON T1.Col1 = T2.Col1
Where T1.col1 = 1
© hWps://sqlcommiWed.wordpress.com/tag/hash-match-join/

Atomicity
TransacYon are all or nothing
Consistency
Only valid data is saved
IsolaYon
TransacYon do not aﬀect each
other
Durability
Wrien data will not be lost
Transaction

Indexes
• Easy to produce
• Easy to use

Scalability
• VerYcal scalability (scale up/down)
• More resources to a single node

Scalability
• Horizontal scalability (scale out/in)
• Add more nodes to a system

Shortcommings
• Scalability (almost not scalable …)
• SPOF
• Diﬃcult to serve users worldwide

NoSQL
• NotOnlySQL
• Nothing to do with SQL
• Relaxa`on of transac`on constraints in distributed
systems
• CAP

CAP
• Consistency
•  Every read receives the most recent write or an error
• Availability
•  Every request receives a response , without garantee that
it contains the most recent version
• ParYYon tolerance
•  The system con`nue to operate despite arbitrary
par`ònning due to network failure
•  If allowed, you might sacrifice consistency
•  If not, you might sacrifice availability
• NOSQL may sacrifice consistency
hWps://en.wikipedia.org/wiki/CAP_theorem

NoSQL
• De façon plus pragmaYque
• Par`ònning (répar`òn charge)
• Replicaòn(tolérance aux pannes)
• Horizontale scalability
•  On commodity hardware
• Simple API
•  OLTP

Key dates
• 2003 octobre : GFS paper released
• 2004 décembre : MapReduce Simpliﬁed Data
processing on large clusters
• 2006 janvier : CréaYon Hadoop
• 2006 octobre : Cluster Hadoop de 600machine
chez Yahoo
• 2007 avril: Cluster Hadoop de 1000 machine
chez Yahoo
hWps://en.wikipedia.org/wiki/Apache_Hadoop

Map Reduce
• MR
• Abstrac`on
• Programming model
• ImplémentaYons
• Open source
•  Hadoop
•  Less well known : Couchdb, Inﬁnispan, Riak
• Propriétaire : Google

Map Reduce
• MapReduce is
• a high level programming model
• and an associated implementa`on
• for processing and genera`ng large data sets
• with a parallel, distributed algorithm on a cluster.
© hWps://en.wikipedia.org/wiki/MapReduce

map()
map()
map()
<key,value>
reduce()
reduce()

devoxx
morroco
devoxx
france
devoxx
poland
great
conference
great
conference
devoxx
taroudant
devoxx
morroco
devoxx
france
devoxx
poland
great
conference
great
conference
devoxx
taroudant
devoxx,1
morroco,1
devoxx,1
france,1
great,1
conference,1
great,1
taroudant,1
devoxx,1
poland,1
great,1
conference,1
devoxx,1
devoxx,1
devoxx,1
devoxx,1
morroco,1
france,1
poland,1
great,1
conference,1
great,1
conference,1
taroudant,1
devoxx,4
morroco,1
france,1
poland,1
great,2
conference,2
taroudant,1

Hadoop structure
• Data storage : HDFS
• Data processing : MAP REDUCE

Hadoop ecosystème

Hadoop conclusion
• Données read-only avec traitements simples
• Map-Reduce
• Move computaòn to data
•  Parallelizaòn and distribuòn (High scalability)
•  Fault tolerance
•  Status and monitoring
•  «one person deployment »
© hWps://en.wikipedia.org/wiki/MapReduce

When ?
BIG DATA
VOLUME
VELOCITY VARIETY

Software companies

M/R shortcomings
• Force your pipeline into Map/Reduce
tasks
• Other workﬂows (ﬁlter, join, map-reduce-map …)
• Read from disk for every M/R task
• Itera`ve algorithms
• Only naYve java programming interface
• Support for other languages : streaming module
• Interac`ve shell

Hadoop conclusion
• Gros problème de lenteur
• MapReduce est lent mais c’est actuellement la seule
alterna`ve pour faire des traitements sur HDFS
• RoadMap contradictoire des éditeurs
• Stratégie des éditeurs (Google)

Hadoop conclusion
• Map-Reduce has served a great purpose,
though: many, many companies, research
labs and individuals are successfully
bringing Map-Reduce to bear on problems
to which it is suited: brute-force processing
with an opYonal aggregaYon.
hWp://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/

Hadoop conclusion
• But more important in the longer term, to
my mind, is the way that Map-Reduce
provided the jusYﬁcaYon for re-evaluaYng
the ways in which large-scale data
processing plaworms are built (and
purchased!).
hWp://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/

Hadoop conclusion
• It’s well known in the industry that more
than 10 years ago Google invented
MapReduce, the technology at the heart of
ﬁrst-generaYon Hadoop. It’s less well
known that Google moved away from
MapReduce several years ago. Today at its
Google I/O 2014 …
hWps://www.datanami.com/2014/06/25/google-re-imagines-mapreduce-launches-dataﬂow/

Hadoop conclusion
• … Today at its Google I/O 2014 conference,
the Web giant unveiled a possible successor
to MapReduce called Dataﬂow, which it’s
selling through its hosted cloud service.
hWps://www.datanami.com/2014/06/25/google-re-imagines-mapreduce-launches-dataﬂow/

Spark

Key dates
• 2009 AMP Lab University of Berk. Cal.
• Original aim : POC de Mesos
• 2012 : 0.5.1

Worker node
Executor
Driver Node
Cache
Task Task
Driver program
Spark context
Cluster manager
Worker node
Executor
Cache
Task Task

Spark
• Resilient Distributed Datasets (RDD)
• A RDD is a resilient and distributed collecòn of
records
• MoYvaYon
• Itera`ve algorithms in machine learning
• Supports 2 types of operaYons
• Transformaòns
• Acòns

Spark - RDD
Server 1
Server 2
Server 3
RDD

Spark
• TransformaYons
• Func`ons that return another RDD
• Map
• FlapMap
• Filter
• Coalesce
• GroupByKey

Spark – Transformation : Map
Hello World
This Is Devoxx
Morocco
Held In
Casablanca
hello world
this is devoxx
morocco
held in
casablanca
.map(_toLowerCase)

Spark – Transformation : flatMap
hello
wold
this
is
.ﬂatMap(line=>line.split(«s+»))
hello world
this is devoxx
morocco
held in
casablanca
….
devoxx

Spark – Transformation : map
(hello,1)
(wold,1)
(this,1)
(is,1)
.map(word=>(word,1))
….
(devoxx,1)
hello
wold
this
is
….
devoxx

Spark – Transformation : groupByKey
(a,1)
(b,1)
(a,1)
(a,1)
(b,1)
(b,1)

(a,1)
(a,1)
(a,1)
(b,1)
(b,1)
(b,1)

(a,1)
(a,1)
(b,1)
(b,1)

(a,1)
(a,1)
(a,1)
(b,1)
(b,1)
(b,1)

Spark – Transformation : reduceByKey
(a,1)
(b,1)
(a,1)
(a,1)
(b,1)
(b,1)

(a,1)
(a,1)
(a,1)
(b,1)
(b,1)
(b,1)

(a,1)
(a,1)
(a,1)
(a,1)
(a,1)
(a,1)
(a,6)
(b,1)
(b,1)
(b,1)
(b,1)
(b,1)
(b,1)
(b,6)

Spark
• AcYons
• funcòns that trigger computaòn and return
something that isn’t an RDD
•  collect() : copy all elements to the driver
•  count()
•  collectAsMap()
•  sample()
•  take(n) : copy first n elements
•  reduce(func) : aggregates elements with func (take 2 elements, return
one)
•  saveTextAsFile(fileName) : save to local or HDFS

All in one
val sc = new SparkContext()
val docs = sc.textFile("hdfs://<path>")
val low = docs.map(line => line.toLowerCase)
val word = low.ﬂatMap(line => line.split("s+"))
val counts = words.map(word => (word,1))
val frequency = counts.reduceByKey(_ + _)
val top = frequency.map(_swap).top(N)

top.forEach(println)

Spark
• Caching
• By default, each job reprocessed from HDFS
• .cache() method on RDD trigger caching
• Called at the ﬁrst computa`on (lazy)

Spark
• Direct Acyclic Graphs (DAGs)
• Nodes are RDD
• Arrows are Transforma`ons

Spark
• Batch
• Streaming
• IteraYve
• InteracYve

GOOGLE TRENDS SPARK vs. STORM vs. HIVE

Key dates
• BigTable (Google) : 2004
• Dynamo (Amazon) : 2007

Data model
• Key-Value
• Document
• Column

Key-value
• Tableau associaYf (map)
• Query model : PUT, GET, DELETE
KEY VALUE

Document {
"id" : "987GREHLKE878YEFB",
"images": ["url1", "url2", "url3"],
"prix": »1290",
"type" : "APPARTEMENT",
"etage" : "2",
"pieces" : "2",
"chambres" : "1",
"surface" : "20",
"descrip`on": "desc ...",
"ville" : "PARIS",
"arrondissement" : "75004",
"departement" : "IDF"
}

Document
• Standard encoding format : JSON, BSON,
…
• Query model
• CRUD (CReate, Update, Delete)
• Select based on document content

{
"id" : "987GREHLKE878YEFB",
"images": ["url1", "url2", "url3"],
"prix": »1290",
"type" : "APPARTEMENT",
"etage" : "2",
"pieces" : "2",
"chambres" : "1",
"surface" : "20",
"descrip`on": "desc ...",
"ville" : "PARIS",
"arrondissement" : "75004",
"departement" : "IDF"
}

{Column}
• Column family stores
• BigTable, Hbase, Hypertable, Cassandra
• Column stores
• C-Store, Ver`ca
© hWp://dbmsmusings.blogspot.fr/2010/03/dis`nguishing-two-major-types-of_29.html

Data model
© hWp://www.slideshare.net/yellow7/cassandra-backgroundandarchitecture
Rela`onal DB Databases Tables Rows Columns
MongoDB db Collec`ons Documents Fields
Elas`cSearch Indices Types Documents Fields

Column family stores
• Persistent (distributed) maps


Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
© hWp://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-prac`ces-part-1/


Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>
© hWp://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-prac`ces-part-1/

Replication model
• Master-less
• Cassandra, DynamoDB, Riak,
• Master slave
• MongoDB, Redis, Hbase
• Master-Master (ou Master-Slave)
• CouchDB
© hWp://www.slideshare.net/yellow7/cassandra-backgroundandarchitecture

Comparison criteria
• Data model
• Query model
• ReplicaYon model
• Consistency model
• Licensing, support, community

Comparison criteria
• Data model
• Query model
• ReplicaYon model
• Consistency model
• Licensing, …

System Architecture

Pourquoi explosion schema less

• Start-up vs entreprises old school
• (avec un TTM très court)

• Allowed by business rules

Pourquoi explosion schema less : 3V

Contraintes à l’utilsation de NoSQL
• TransacYons
• On ne peut pas considérer que passer la résolu`on
des conﬂits au client soit un progrès.
• Mal nécessaire souvent dicté par le business

hWp://db-engines.com/en/ranking

hWps://www.gartner.com/doc/reprints?id=1-2PMFPEN&ct=151013&st=sb

hWps://www.google.com/trends/explore?date=2008-03-18%202016-10-18&q=RDBMS,NOSQL

URLOGRAPHIE
•  Hadoop, the definiYve guide, Third ediYon Tom White, ISBN: 978-1-449-31152-0, O'Reilly Ed.
•  hps://www.postgresql.org/about/
•  hps://blog.codeship.com/unleash-the-power-of-storing-json-in-postgres/
•  hps://opentextbc.ca/dbdesign/chapter/chapter-5-data-modelling/
•  hp://coronet.iicm.edu/is/scripts/lesson03.pdf
•  hps://opentextbc.ca/dbdesign/chapter/chapter-3-characterisYcs-and-benefits-of-a-database/
•  hp://gerardnico.com/wiki/relaYon/rdbms
•  hps://en.wikipedia.org/wiki/Scalability
•  hp://siliconangle.com/blog/2016/06/27/google-tools-up-with-its-spanner-database-looks-for-a-fight-with-
aws/
•  hp://www.caell.net/datastores/Datastores.pdf
•  hps://en.wikipedia.org/wiki/Apache_Hadoop
•  hps://en.wikipedia.org/wiki/MapReduce
•  hps://www.linkedin.com/pulse/rdbms-follows-acid-property-nosql-databases-base-does

URLOGRAPHIE
•  hps://www.quora.com/Hadoop-Why-are-companies-invesYng-so-much-into-Hadoop-if-Google-released-the-
MapReduce-paper-back-in-2004-Are-companies-just-going-to-follow-the-road-map-Google-created-Big-Table-
Pregel-Dremel-etc-It-seems-to-me-that-companies-will-always-be-behind-the-curve
•  hp://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/
•  hps://www.mapr.com/ebooks/spark/01-what-is-apache-spark.html
•  hps://www.digitalocean.com/community/tutorials/a-comparison-of-nosql-database-management-systems-
and-models
•  hps://cloud.google.com/bigtable/docs/overview
•  hps://cloud.google.com/bigtable/docs/schema-design
•  hps://en.wikipedia.org/wiki/Dremel_(so‚ware)
•  hps://www.gartner.com/doc/reprints?id=1-2PMFPEN&ct=151013&st=sb
•  hp://www.infoworld.com/arYcle/3056637/database/nosql-chips-away-at-oracle-ibm-and-microso‚-
dominance.html
•  hp://www.slideshare.net/billhoweuw/dataintensive-scalable-science

Stockage des données : quel système pour quel usage ?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Stockage des données : quel système pour quel usage ?

Similar to Stockage des données : quel système pour quel usage ? (20)

Recently uploaded

Recently uploaded (20)

Stockage des données : quel système pour quel usage ?