SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Big Data et NoSQL
au-delà du nouveau buzz
Définition
Méthodologie
Architecture!
Cas d’utilisation
Définition
Définition
«Les technologies Big Data correspondent à une nouvelle
génération de technologies et d’architectures, conçues pour
retirer une valeur économique de gigantesques volumes de
données hétéroclites, en les capturant, en les explorant et/ou en
les analysant en un temps record »
4
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
5
Méthodologie
Big Data / Méthodologie
La mise en place d’une démarche Big Data est toujours
composée de trois étapes :
Ò  Collecter, stocker les données : Partie infrastructure.
Ò  Analyser, corréler, agréger les données : Partie analytique.
Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les
données et les analyses ?
Architecture
Architecture Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
Base de données
Orientée colonne
NoSQL
Distributed File
System
Map Reduce
Base de
données SQL
SQL
Analytiques , Business Intelligent
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS
Technologie Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
HBase, Big Table,
Cassandra,
Voldemort, …
HDFS, GFS, S3,
…
Oracle, DB2,
MySQL, …
SQL
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS
Architecture Big Data Open Source pour l’entreprise
Big Data full solution
Cas d’utilisation
Cas d’utilisation BigQuery
Cas d’utilisation Hadoop
Ò  Facebook uses Hadoop to store copies of internal log and
dimension data sources and as a source for reporting/analytics
and machine learning. There are two clusters, a 1100-machine
cluster with 8800 cores and about 12 PB raw storage and a 300-
machine cluster with 2400 cores and about 3 PB raw storage.
Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000
computers running Hadoop. The biggest cluster has 4500 nodes
(2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to
support research for Ad Systems and Web Search and to do
scaling tests to support development of Hadoop on larger
clusters
Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB),
Java MapReduce, Pig, Hive and HBase
Ò  Twitter uses Hadoop to store and process tweets, log files, and
other data generated across Twitter. They use Cloudera’s CDH2
distribution of Hadoop. They use both Scala and Java to access
Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and
Cassandra.
Gartner talk
« D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans
le monde pour soutenir le Big Data, dont 1,9 millions aux Etat-
Unis », a déclaré Peter Sondergaard, senior vice-président et
responsable mondial de la recherche chez Gartner.
Wanted
« Sophisticated Statistical Analysis »
100 000 to 500 000 $
HBase
Présentation
Architecture & API
Cas d’utilisations
Etude de cas
Sécurité
Ecosystème Hadoop
Présentation
Présentation
“HBase is the Hadoop database. Think of it as
a distributed scalable Big Data store”
http://hbase.apache.org/
19
Présentation
“Project's goal is the hosting of very large
tables – billions of rows X millions of columns
– atop clusters of commodity hardware”
http://hbase.apache.org/
20
Présentation
“HBase is an open-source, distributed,
versioned, column-oriented store modeled
after Google's BigTable”
http://hbase.apache.org/
21
La trilogie Google
Ò The Google File System
http://research.google.com/archive/gfs.html
Ò MapReduce: Simplified Data Processing on
Large Cluster
http://research.google.com/archive/mapreduce.html
Ò Bigtable: A Distributed Storage System for
Structured Data
http://research.google.com/archive/bigtable.html
22
Systèmes d’exploitation / Plateforme
23
Installation / démarrage / arrêt
$ mkdir bigdata
$ cd bigdata
$ wget http://apache.claz.org/hbase/
hbase-0.92.1/hbase-0.92.1.tar.gz
…
$ tar xvfz hbase-0.92.1.tar.gz
…
$ export HBASE_HOME=`pwd`/hbase-0.92.1
$ $HBASE_HOME/bin/start-hbase.sh
…
$ $HBASE_HOME/bin/stop-hbase.sh
…
24
HBase Shell / exemple de session
$ $HBASE_HOME/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported
commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012
hbase(main):001:0> list
TABLE
0 row(s) in 0.5510 seconds
hbase(main):002:0> create 'mytable', 'cf'
0 row(s) in 1.1730 seconds
hbase(main):003:0> list
TABLE
mytable
1 row(s) in 0.0080 seconds
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello
HBase'
0 row(s) in 0.2070 seconds
hbase(main):005:0> get 'mytable', 'first'
COLUMN CELL
cf:message timestamp=1323483954406, value=hello HBase
1 row(s) in 0.0250 seconds
25
HBase Shell / Commandes
Ò  Générales
Ò  status, version
Ò  DDL
Ò  alter, alter_async, alter_status, create, describe, disable,
disable_all, drop, drop_all, enable, enable_all, exists, is_disabled,
is_enabled, list, show_filters
Ò  DML
Ò  count, delete, deleteall, get, get_counter, incr, put, scan, truncate
Ò  Outils
Ò  assign, balance_switch, balancer, close_region, compact, flush,
hlog_roll, major_compact, move, split, unassign, zk_dump
Ò  Replication
Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer,
start_replication, stop_replication
Ò  Securité
Ò  grant, revoke, user_permission
26
Architecture & API
Architecture logique
28
Table Design
29
Exemple de table
30
Ò Coordonnées des données
[rowkey, column family, column qualifier, timestamp] → value
[fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24
HBase Java API
Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor
HTableDescriptor desc = new HTableDescriptor("TableName");
HColumnDescriptor cf = new
HColumnDescriptor("Family".getBytes());
desc.addFamily(contentFamily);
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc);
31
HBase Java API
Ò HTablePool & HTableInterface
HTablePool pool = new HTablePool();
HTableInterface table = pool.getTable("TableName");
32
HBase Java API
Ò Put
byte[] cellValue = …
Put p = new Put("RowKey".getBytes());
p.add("Family".getBytes(),
"Qualifier".getBytes(),
cellValue);
table.put(p);
33
Chemin d’écriture
34
HBase Java API
Ò Get
Get g = new Get("RowKey".getBytes());
g.addColumn("Family".getBytes(), "Qualifier".getBytes());
Result r = table.get(g);
Ò Result
byte[] cellValue = r.getValue("Family".getBytes(),
"Qualifier".getBytes());
35
HBase Java API
Ò Scan
Scan scan = new Scan();
scan.setStartRow("StartRow".getBytes());
scan.setStopRow("StopRow".getBytes());
scan.addColumn("Family".getBytes(),
"Qualifier".getBytes());
ResultScanner rs = table.getScanner(scan);
for (Result r : rs) {
// …
}
36
Chemin de lecture
37
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
38
Cas d’utilisations
Cas d’utilisations
39
Cas d’utilisations
40
Vous ?
Etude de cas
Moteur de recherche
42
Table Wikipedia.fr
43
Création de la table Wikipedia
HTableDescriptor desc = new
HTableDescriptor("wikipedia");
HColumnDescriptor contentFamily = new
HColumnDescriptor("content".getBytes());
contentFamily.setMaxVersions(1);
desc.addFamily(contentFamily);
HColumnDescriptor linksFamily = new
HColumnDescriptor("links".getBytes());
linksFamily.setMaxVersions(1);
desc.addFamily(linksFamily);
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc); 44
HBase API : insertion de données par les Crawlers
Put p = new Put(Util.toBytes(url));
p.add("content".getBytes(), "text".getBytes(),
htmlParseData.getText().getBytes());
List<WebURL> links = htmlParseData.getOutgoingUrls();
int count = 0;
for (WebURL outGoingURL : links) {
p.add("links".getBytes(), Bytes.toBytes(count++),
outGoingURL.getURL().getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
table.put(p);
} catch (IOException e) {
e.printStackTrace();
} 45
Table de l’index inversé
46
MapReduce
47
MapReduce Index inversé – algorithme
method map(url, text)
for all word ∈ text do
emit(word, url)
method reduce(word, urls)
count ← 0
for all url ∈ urls do
put(word, "links", count, url)
count ← count + 1
put(word, "links", "count", count)
48
MapReduce Index inversé – Configuration
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "Reverse Index Job");
job.setJarByClass(InvertedIndex.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.addColumn("content".getBytes(), "text".getBytes());
TableMapReduceUtil.initTableMapperJob("wikipedia", scan,
Map.class, Text.class, Text.class, job);
TableMapReduceUtil.initTableReducerJob("wikipedia_ri",
Reduce.class, job);
job.setNumReduceTasks(1);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} 49
MapReduce Index inversé – Map
public static class Map extends TableMapper<Text, Text> {
private static final Pattern PATTERN =
Pattern.compile(SENTENCE_SPLITTER_REGEX);
private Text key = new Text();
private Text value = new Text();
@Override
protected void map(ImmutableBytesWritable rowkey, Result result,
Context context) {
byte[] b = result.getValue("content".getBytes(), "text".getBytes());
String text = Bytes.toString(b);
String[] words = PATTERN.split(text);
value.set(result.getRow());
for (String word : words) {
key.set(word.getBytes());
try {
context.write(key, value);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
50
MapReduce Index inversé – Reduce
public static class Reduce extends
TableReducer<Text, Text, ImmutableBytesWritable> {
@Override
protected void reduce(Text rowkey, Iterable<Text> values,
Contex context) {
Put p = new Put(rowkey.getBytes());
int count = 0;
for (Text link : values) {
p.add("links".getBytes(),
Bytes.toBytes(count++), link.getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
context.write(new ImmutableBytesWritable(rowkey.getBytes()),
p);
} catch (Exception e) {
e.printStackTrace();
}
}
}
51
Question
Comment trier par ordre d’importance les
résultats d’une recherche ?
52
Comptage des liens
53
Comptage pondéré
54
Définition récursive
55
Un peu d’algèbre linéaire
56
PageRank
57
Moteur de recherche
58
Digression 1
59
Google@Stanford, Larry Page et Sergey Brin – 1998
Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives
Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives
Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives
Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six
4gb drives
Ò Disk expansion, eight 9gb drives
Ò Disk expansion, ten 9gb drives
CPU	
   2933	
  MHz	
  (sur	
  10	
  CPUs)	
  
RAM	
   1792	
  MB	
  
HD	
   366	
  GB	
  
Digression 2
60
Google 1999 – Premier serveur de production
Une formule sans maths J
Données + science + perspicacité = valeur
61
Sécurité
Sécurité
Ò Kerberos
Ò Access Control List
63
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
64
Ecosystème Hadoop
Ecosystème Hadoop
Ò Hadoop
Ò Pig
Ò Hive
Ò Mahout
Ò Whirr
65
Thank you

Mais conteúdo relacionado

Mais procurados

Hadoop introduction seminar presentation
Hadoop introduction seminar presentationHadoop introduction seminar presentation
Hadoop introduction seminar presentationpuneet yadav
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapperpsoo1978
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
 
CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010Jonathan Weiss
 
CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010Jonathan Weiss
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDBDavid Coallier
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
introduction to Mongodb
introduction to Mongodbintroduction to Mongodb
introduction to MongodbASIT
 
MongoD Essentials
MongoD EssentialsMongoD Essentials
MongoD Essentialszahid-mian
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 

Mais procurados (20)

Hadoop introduction seminar presentation
Hadoop introduction seminar presentationHadoop introduction seminar presentation
Hadoop introduction seminar presentation
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
MongoDB
MongoDBMongoDB
MongoDB
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDB
 
Advance MySQL Docstore Features
Advance MySQL Docstore FeaturesAdvance MySQL Docstore Features
Advance MySQL Docstore Features
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010
 
CouchDB on Rails
CouchDB on RailsCouchDB on Rails
CouchDB on Rails
 
CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
introduction to Mongodb
introduction to Mongodbintroduction to Mongodb
introduction to Mongodb
 
MongoD Essentials
MongoD EssentialsMongoD Essentials
MongoD Essentials
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Intro To Couch Db
Intro To Couch DbIntro To Couch Db
Intro To Couch Db
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 

Semelhante a Valtech - Big Data & NoSQL : au-delà du nouveau buzz

Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOUnlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOnadine39280
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generatorPayal Jain
 
HBase Secondary Indexing
HBase Secondary Indexing HBase Secondary Indexing
HBase Secondary Indexing Gino McCarty
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - PolybaseŁukasz Grala
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 

Semelhante a Valtech - Big Data & NoSQL : au-delà du nouveau buzz (20)

Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOUnlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
HBase Secondary Indexing
HBase Secondary Indexing HBase Secondary Indexing
HBase Secondary Indexing
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - Polybase
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 

Mais de Valtech

Valtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrationsValtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrationsValtech
 
CES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendancesCES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendancesValtech
 
Stéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturelStéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturelValtech
 
Valtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raisonValtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raisonValtech
 
Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015Valtech
 
Valtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entrepriseValtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entrepriseValtech
 
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015Valtech
 
Valtech - Architecture Agile des SI
Valtech - Architecture Agile des SIValtech - Architecture Agile des SI
Valtech - Architecture Agile des SIValtech
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en actionValtech
 
Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015Valtech
 
CES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectésCES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectésValtech
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en actionValtech
 
Valtech - Economie Collaborative
Valtech - Economie CollaborativeValtech - Economie Collaborative
Valtech - Economie CollaborativeValtech
 
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014Valtech
 
[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014Valtech
 
Valtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaSValtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaSValtech
 
[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World Congress[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World CongressValtech
 
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014Valtech
 
[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectés[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectésValtech
 
Valtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud AzureValtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud AzureValtech
 

Mais de Valtech (20)

Valtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrationsValtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrations
 
CES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendancesCES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendances
 
Stéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturelStéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturel
 
Valtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raisonValtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raison
 
Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015
 
Valtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entrepriseValtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entreprise
 
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
 
Valtech - Architecture Agile des SI
Valtech - Architecture Agile des SIValtech - Architecture Agile des SI
Valtech - Architecture Agile des SI
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en action
 
Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015
 
CES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectésCES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectés
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en action
 
Valtech - Economie Collaborative
Valtech - Economie CollaborativeValtech - Economie Collaborative
Valtech - Economie Collaborative
 
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
 
[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014
 
Valtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaSValtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaS
 
[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World Congress[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World Congress
 
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
 
[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectés[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectés
 
Valtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud AzureValtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud Azure
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Valtech - Big Data & NoSQL : au-delà du nouveau buzz

  • 1. Big Data et NoSQL au-delà du nouveau buzz
  • 4. Définition «Les technologies Big Data correspondent à une nouvelle génération de technologies et d’architectures, conçues pour retirer une valeur économique de gigantesques volumes de données hétéroclites, en les capturant, en les explorant et/ou en les analysant en un temps record » 4
  • 5. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 5 Méthodologie
  • 6. Big Data / Méthodologie La mise en place d’une démarche Big Data est toujours composée de trois étapes : Ò  Collecter, stocker les données : Partie infrastructure. Ò  Analyser, corréler, agréger les données : Partie analytique. Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les données et les analyses ?
  • 8. Architecture Big Data Audio, Vidéo, Image Docs, Texte, XML Web logs, Clicks, Social, Graphs, RSS, Capteurs, Graphs, RSS, Spatial, GPS Autres Base de données Orientée colonne NoSQL Distributed File System Map Reduce Base de données SQL SQL Analytiques , Business Intelligent COLLECTER LESDONNEES STOCKAGE&ORGANISATIONEXTRACTION ANALYSER & VISUALISER BUSINESS
  • 9. Technologie Big Data Audio, Vidéo, Image Docs, Texte, XML Web logs, Clicks, Social, Graphs, RSS, Capteurs, Graphs, RSS, Spatial, GPS Autres HBase, Big Table, Cassandra, Voldemort, … HDFS, GFS, S3, … Oracle, DB2, MySQL, … SQL COLLECTER LESDONNEES STOCKAGE&ORGANISATIONEXTRACTION ANALYSER & VISUALISER BUSINESS
  • 10. Architecture Big Data Open Source pour l’entreprise
  • 11. Big Data full solution
  • 14. Cas d’utilisation Hadoop Ò  Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a 300- machine cluster with 2400 cores and about 3 PB raw storage. Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase Ò  Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.
  • 15. Gartner talk « D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans le monde pour soutenir le Big Data, dont 1,9 millions aux Etat- Unis », a déclaré Peter Sondergaard, senior vice-président et responsable mondial de la recherche chez Gartner. Wanted « Sophisticated Statistical Analysis » 100 000 to 500 000 $
  • 16. HBase
  • 17. Présentation Architecture & API Cas d’utilisations Etude de cas Sécurité Ecosystème Hadoop
  • 19. Présentation “HBase is the Hadoop database. Think of it as a distributed scalable Big Data store” http://hbase.apache.org/ 19
  • 20. Présentation “Project's goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware” http://hbase.apache.org/ 20
  • 21. Présentation “HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's BigTable” http://hbase.apache.org/ 21
  • 22. La trilogie Google Ò The Google File System http://research.google.com/archive/gfs.html Ò MapReduce: Simplified Data Processing on Large Cluster http://research.google.com/archive/mapreduce.html Ò Bigtable: A Distributed Storage System for Structured Data http://research.google.com/archive/bigtable.html 22
  • 24. Installation / démarrage / arrêt $ mkdir bigdata $ cd bigdata $ wget http://apache.claz.org/hbase/ hbase-0.92.1/hbase-0.92.1.tar.gz … $ tar xvfz hbase-0.92.1.tar.gz … $ export HBASE_HOME=`pwd`/hbase-0.92.1 $ $HBASE_HOME/bin/start-hbase.sh … $ $HBASE_HOME/bin/stop-hbase.sh … 24
  • 25. HBase Shell / exemple de session $ $HBASE_HOME/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012 hbase(main):001:0> list TABLE 0 row(s) in 0.5510 seconds hbase(main):002:0> create 'mytable', 'cf' 0 row(s) in 1.1730 seconds hbase(main):003:0> list TABLE mytable 1 row(s) in 0.0080 seconds hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase' 0 row(s) in 0.2070 seconds hbase(main):005:0> get 'mytable', 'first' COLUMN CELL cf:message timestamp=1323483954406, value=hello HBase 1 row(s) in 0.0250 seconds 25
  • 26. HBase Shell / Commandes Ò  Générales Ò  status, version Ò  DDL Ò  alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, is_disabled, is_enabled, list, show_filters Ò  DML Ò  count, delete, deleteall, get, get_counter, incr, put, scan, truncate Ò  Outils Ò  assign, balance_switch, balancer, close_region, compact, flush, hlog_roll, major_compact, move, split, unassign, zk_dump Ò  Replication Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication, stop_replication Ò  Securité Ò  grant, revoke, user_permission 26
  • 30. Exemple de table 30 Ò Coordonnées des données [rowkey, column family, column qualifier, timestamp] → value [fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24
  • 31. HBase Java API Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor HTableDescriptor desc = new HTableDescriptor("TableName"); HColumnDescriptor cf = new HColumnDescriptor("Family".getBytes()); desc.addFamily(contentFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc); 31
  • 32. HBase Java API Ò HTablePool & HTableInterface HTablePool pool = new HTablePool(); HTableInterface table = pool.getTable("TableName"); 32
  • 33. HBase Java API Ò Put byte[] cellValue = … Put p = new Put("RowKey".getBytes()); p.add("Family".getBytes(), "Qualifier".getBytes(), cellValue); table.put(p); 33
  • 35. HBase Java API Ò Get Get g = new Get("RowKey".getBytes()); g.addColumn("Family".getBytes(), "Qualifier".getBytes()); Result r = table.get(g); Ò Result byte[] cellValue = r.getValue("Family".getBytes(), "Qualifier".getBytes()); 35
  • 36. HBase Java API Ò Scan Scan scan = new Scan(); scan.setStartRow("StartRow".getBytes()); scan.setStopRow("StopRow".getBytes()); scan.addColumn("Family".getBytes(), "Qualifier".getBytes()); ResultScanner rs = table.getScanner(scan); for (Result r : rs) { // … } 36
  • 38. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 38 Cas d’utilisations
  • 44. Création de la table Wikipedia HTableDescriptor desc = new HTableDescriptor("wikipedia"); HColumnDescriptor contentFamily = new HColumnDescriptor("content".getBytes()); contentFamily.setMaxVersions(1); desc.addFamily(contentFamily); HColumnDescriptor linksFamily = new HColumnDescriptor("links".getBytes()); linksFamily.setMaxVersions(1); desc.addFamily(linksFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc); 44
  • 45. HBase API : insertion de données par les Crawlers Put p = new Put(Util.toBytes(url)); p.add("content".getBytes(), "text".getBytes(), htmlParseData.getText().getBytes()); List<WebURL> links = htmlParseData.getOutgoingUrls(); int count = 0; for (WebURL outGoingURL : links) { p.add("links".getBytes(), Bytes.toBytes(count++), outGoingURL.getURL().getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { table.put(p); } catch (IOException e) { e.printStackTrace(); } 45
  • 46. Table de l’index inversé 46
  • 48. MapReduce Index inversé – algorithme method map(url, text) for all word ∈ text do emit(word, url) method reduce(word, urls) count ← 0 for all url ∈ urls do put(word, "links", count, url) count ← count + 1 put(word, "links", "count", count) 48
  • 49. MapReduce Index inversé – Configuration public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "Reverse Index Job"); job.setJarByClass(InvertedIndex.class); Scan scan = new Scan(); scan.setCaching(500); scan.addColumn("content".getBytes(), "text".getBytes()); TableMapReduceUtil.initTableMapperJob("wikipedia", scan, Map.class, Text.class, Text.class, job); TableMapReduceUtil.initTableReducerJob("wikipedia_ri", Reduce.class, job); job.setNumReduceTasks(1); System.exit(job.waitForCompletion(true) ? 0 : 1); } 49
  • 50. MapReduce Index inversé – Map public static class Map extends TableMapper<Text, Text> { private static final Pattern PATTERN = Pattern.compile(SENTENCE_SPLITTER_REGEX); private Text key = new Text(); private Text value = new Text(); @Override protected void map(ImmutableBytesWritable rowkey, Result result, Context context) { byte[] b = result.getValue("content".getBytes(), "text".getBytes()); String text = Bytes.toString(b); String[] words = PATTERN.split(text); value.set(result.getRow()); for (String word : words) { key.set(word.getBytes()); try { context.write(key, value); } catch (Exception e) { e.printStackTrace(); } } } } 50
  • 51. MapReduce Index inversé – Reduce public static class Reduce extends TableReducer<Text, Text, ImmutableBytesWritable> { @Override protected void reduce(Text rowkey, Iterable<Text> values, Contex context) { Put p = new Put(rowkey.getBytes()); int count = 0; for (Text link : values) { p.add("links".getBytes(), Bytes.toBytes(count++), link.getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { context.write(new ImmutableBytesWritable(rowkey.getBytes()), p); } catch (Exception e) { e.printStackTrace(); } } } 51
  • 52. Question Comment trier par ordre d’importance les résultats d’une recherche ? 52
  • 56. Un peu d’algèbre linéaire 56
  • 59. Digression 1 59 Google@Stanford, Larry Page et Sergey Brin – 1998 Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six 4gb drives Ò Disk expansion, eight 9gb drives Ò Disk expansion, ten 9gb drives CPU   2933  MHz  (sur  10  CPUs)   RAM   1792  MB   HD   366  GB  
  • 60. Digression 2 60 Google 1999 – Premier serveur de production
  • 61. Une formule sans maths J Données + science + perspicacité = valeur 61
  • 64. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 64 Ecosystème Hadoop