SlideShare uma empresa Scribd logo
1 de 34
COURBOSPARK:
DECISION TREE FOR
TIME-SERIES ON SPARK
Christophe Salperwyck – EDF R&D
Simon Maby – OCTO Technology - @simonmaby
Xdata project: www.xdata.fr, grants from
"Investissement d'Avenir" program, 'Big Data' call
| 2
AGENDA
1. PROBLEM DESCRIPTION
2. IMPLEMENTATION
• Courbotree: presentation of the algorithm
• From mllib to courbospark
3. PERFORMANCES
• Configuration (cluster description, spark config…)
4. FEEDBACK ON SPARK/MLLIB
| 3
FRENCH METERS DATA
| 4
• 1 measure every 10 min
• 35 million customers
• Time-series: 144 points x 365 days
 Annual data volume: 1800 billion records, 120 TB
of raw data
BIG DATA!
| 5
LOAD CURVES CLASSIFICATION
Contract type Region … Equipment type Load Curve
9KVA 75 … Elec
6KVA 22 … Gas
… … … … …
12KVA 34 … Elec
| 6
WHY A DECISION TREE?
• Easy to understand
• Ability to explore the model
• Ability to choose the
expressivity of the model
| 7
Goal: find the most different curves depending on an explanatory
feature
How to split? we can either:
• Minimize curves dispersion (intra inertia)
or
• Maximize differences between average curves (inter inertia)
SPLIT CRITERIA: INERTIA
| 8
MAXIMIZE DIFFERENCES BETWEEN AVERAGE
CURVES (feature: Equipment Type)
Electrical
Gas
Hour
PinW
ArgMax(d)
mean
| 9
EXISTING DISTRIBUTED DECISION TREE
Scalable Distributed Decision Trees in Spark MLLib
Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet
Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp-
content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf
A MapReduce Implementation of C4.5 Decision Tree Algorithm
Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49-
60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013.
http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
| 10
MLLIB DECISION TREE PARALLELIZATION
| 11
Step 1:
compute average
curves
[0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[
Host 1 Host 2 Host 3
[0:10[ [10:20[
Host 1
Step 2:
collect and find
the best split
HORIZONTAL STRATEGY
| 12
To build the tree:
• Criteria: entropy, Gini, variance
• Data structure: LabelPoint
FROM MLLIB TO COURBOSPARK
| 13
To build the tree:
• Criteria: entropy, Gini, variance, inertia (to compare time-series)
• Data structure: LabelPoint, TimeSeries
• Finding split point for nominal features
For data visualization of the tree:
• Quantile on the nodes and leaves
• Lost of inertia
• Number of curves per nodes, leaves
FROM MLLIB TO COURBOSPARK
| 14
DEALING WITH NOMINAL FEATURES
Current implementation for regression:
 order the categories by their mean on the target
A BC D
Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
| 15
NOMINAL VALUES: TYPE OF CONTRACT
4 CATEGORIES {A, B, C, D}
A B
C D?
| 16
DEALING WITH NOMINAL FEATURES
Hard to order curves…
Solution 1:
Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D},
{AC}/{BD}…
Problem:
Combinatory problem depending on n the number of
different categories. Complexity is O(2n)
| 17
DEALING WITH NOMINAL FEATURES
Solution 2:
Agglomerative Hierarchical Clustering. Bottom up approach.
Complexity is O(n3) - we don’t expect n > 100
| 18
HOW TO
Algorithm parameters
Configure spark context
Load the data file
Learn the model
| 19
LOOKING FOR THE TEST CONFIGURATION
For a constant global capacity on 12 nodes:
•120 cores + 120 GB RAM
#Executors RAM per exec. Cores per exec. Performance on
100Gb data
12 10 GB 10 22 minutes
24 5 GB 5 17 minutes
60 2 GB 2 12 minutes
120 1 GB 1 15 minutes
| 20
SCALABILITY TO #CONTAINERS
| 21
SCALABILITY TO #CONTAINERS
| 22
SCALABILITY TO #CONTAINERS
| 23
SCALABILITY TO #LINES
| 24
FRAMEWORK STABILITY
Tested on:
• 10GB, 100GB, 200GB, 300GB,
400GB, 500GB, 1TB
• Categorical and continuous
variables
• Bin sizes from 100 to 1000
| 25
SCALABILITY TO #COLUMNS
| 26
SCALABILITY TO #CATEGORIES
| 27
| 28
REAL LIFE DATASET
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000 1200 1400
Timeinminutes
Data in GB
• 9 executors with 20 GB and 8 cores
• 10 to 1000 millions load curves (10 numerical and 10 categorical features)
| 29
• spark.default.parallelism
• spark.executor.memory
• spark.storage.memoryfraction
• spark.akka.framesize
TUNING
| 30
Developers view
• Flawless transition from local to cluster mode
• Debug mode with an IDE
• Good performances need knowledge
FEEDBACKS
| 31
HEY SCALA <3
| 32
Data Scientists view
• The API is not very data oriented
• …but now we have SparkSQL and Dataframes!
• IPython + pySpark
• Feature engineering VS model engineering
FEEDBACKS
| 33
OPS view
• Better than mapReduce
• Performances are predictable for tested code
• YARNed
• Lots of releases, MlLib code is evolving quickly
FEEDBACKS
| 34
FUTURE WORKS
• Unbalanced trees
• Improve performance
• Other criteria for time-series comparison
• Missing values in explanatory features

Mais conteúdo relacionado

Mais procurados

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Creating fishnet using_arc_gis
Creating fishnet using_arc_gisCreating fishnet using_arc_gis
Creating fishnet using_arc_gisAshok Peddi
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechRob Emanuele
 
DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会Masashi Shibata
 
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S..."Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...Edge AI and Vision Alliance
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKeisuke Hosaka
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechRob Emanuele
 
Neo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurtNeo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurtPeter Neubauer
 
Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXLuis Bermudez
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Keisuke Hosaka
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DCCCRinc
 
Advanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServerAdvanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServerGeoSolutions
 
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial AnalyticsVisionGEOMATIQUE2014
 

Mais procurados (14)

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Creating fishnet using_arc_gis
Creating fishnet using_arc_gisCreating fishnet using_arc_gis
Creating fishnet using_arc_gis
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会
 
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S..."Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Neo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurtNeo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurt
 
Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDX
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Advanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServerAdvanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServer
 
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
 

Destaque

Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz wordPetit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz wordOCTO Technology
 
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !OCTO Technology
 
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...OCTO Technology
 
Petit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projetsPetit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projetsOCTO Technology
 
Hackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleursHackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleursOCTO Technology
 
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...OCTO Technology
 
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de LyonOCTO Technology
 
Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"OCTO Technology
 
Solution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échangeSolution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échangeOCTO Technology
 
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...OCTO Technology
 
Petit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture HackingPetit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture HackingOCTO Technology
 
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquableVERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquableOCTO Technology
 
La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 OCTO Technology
 
La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?OCTO Technology
 
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...OCTO Technology
 
Petit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le BookPetit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le BookOCTO Technology
 
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitalePetit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitaleOCTO Technology
 
#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - Total#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - TotalOCTO Technology
 
Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !OCTO Technology
 
Engineering Data Scientist
Engineering Data ScientistEngineering Data Scientist
Engineering Data ScientistVincent HOLLEY
 

Destaque (20)

Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz wordPetit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
 
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
 
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
 
Petit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projetsPetit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projets
 
Hackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleursHackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleurs
 
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
 
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
 
Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"
 
Solution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échangeSolution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échange
 
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
 
Petit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture HackingPetit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture Hacking
 
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquableVERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
 
La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4
 
La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?
 
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
 
Petit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le BookPetit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le Book
 
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitalePetit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
 
#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - Total#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - Total
 
Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !
 
Engineering Data Scientist
Engineering Data ScientistEngineering Data Scientist
Engineering Data Scientist
 

Semelhante a COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK

CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkDataWorks Summit
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarData Con LA
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Community
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Community
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processorinside-BigData.com
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...Amazon Web Services
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! Embarcadero Technologies
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Storage
 
Seagate – Next Level Storage (Webinar mit Boston Server & Storage, 2018 09-28)
Seagate – Next Level Storage (Webinar mit Boston Server & Storage,  2018 09-28)Seagate – Next Level Storage (Webinar mit Boston Server & Storage,  2018 09-28)
Seagate – Next Level Storage (Webinar mit Boston Server & Storage, 2018 09-28)BOSTON Server & Storage Solutions GmbH
 

Semelhante a COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK (20)

CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processor
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9
 
Seagate – Next Level Storage (Webinar mit Boston Server & Storage, 2018 09-28)
Seagate – Next Level Storage (Webinar mit Boston Server & Storage,  2018 09-28)Seagate – Next Level Storage (Webinar mit Boston Server & Storage,  2018 09-28)
Seagate – Next Level Storage (Webinar mit Boston Server & Storage, 2018 09-28)
 

Mais de OCTO Technology

Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudLe Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudOCTO Technology
 
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...OCTO Technology
 
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...OCTO Technology
 
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...OCTO Technology
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Technology
 
OCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Technology
 
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...OCTO Technology
 
OCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Technology
 
Comptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanComptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanOCTO Technology
 
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? OCTO Technology
 
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...OCTO Technology
 
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...OCTO Technology
 
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionLe Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionOCTO Technology
 
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...OCTO Technology
 
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...OCTO Technology
 
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...OCTO Technology
 
RefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsRefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsOCTO Technology
 
RefCard RESTful API Design
RefCard RESTful API DesignRefCard RESTful API Design
RefCard RESTful API DesignOCTO Technology
 
RefCard API Architecture Strategy
RefCard API Architecture StrategyRefCard API Architecture Strategy
RefCard API Architecture StrategyOCTO Technology
 

Mais de OCTO Technology (20)

Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudLe Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
 
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
 
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
 
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeurs
 
OCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture Test
 
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
 
OCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend web
 
Refcard GraphQL
Refcard GraphQLRefcard GraphQL
Refcard GraphQL
 
Comptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanComptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/Leaseplan
 
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
 
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
 
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
 
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionLe Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
 
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
 
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
 
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
 
RefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsRefCard Tests sur tous les fronts
RefCard Tests sur tous les fronts
 
RefCard RESTful API Design
RefCard RESTful API DesignRefCard RESTful API Design
RefCard RESTful API Design
 
RefCard API Architecture Strategy
RefCard API Architecture StrategyRefCard API Architecture Strategy
RefCard API Architecture Strategy
 

Último

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Último (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK

  • 1. COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK Christophe Salperwyck – EDF R&D Simon Maby – OCTO Technology - @simonmaby Xdata project: www.xdata.fr, grants from "Investissement d'Avenir" program, 'Big Data' call
  • 2. | 2 AGENDA 1. PROBLEM DESCRIPTION 2. IMPLEMENTATION • Courbotree: presentation of the algorithm • From mllib to courbospark 3. PERFORMANCES • Configuration (cluster description, spark config…) 4. FEEDBACK ON SPARK/MLLIB
  • 4. | 4 • 1 measure every 10 min • 35 million customers • Time-series: 144 points x 365 days  Annual data volume: 1800 billion records, 120 TB of raw data BIG DATA!
  • 5. | 5 LOAD CURVES CLASSIFICATION Contract type Region … Equipment type Load Curve 9KVA 75 … Elec 6KVA 22 … Gas … … … … … 12KVA 34 … Elec
  • 6. | 6 WHY A DECISION TREE? • Easy to understand • Ability to explore the model • Ability to choose the expressivity of the model
  • 7. | 7 Goal: find the most different curves depending on an explanatory feature How to split? we can either: • Minimize curves dispersion (intra inertia) or • Maximize differences between average curves (inter inertia) SPLIT CRITERIA: INERTIA
  • 8. | 8 MAXIMIZE DIFFERENCES BETWEEN AVERAGE CURVES (feature: Equipment Type) Electrical Gas Hour PinW ArgMax(d) mean
  • 9. | 9 EXISTING DISTRIBUTED DECISION TREE Scalable Distributed Decision Trees in Spark MLLib Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp- content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf A MapReduce Implementation of C4.5 Decision Tree Algorithm Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49- 60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf Distributed Decision Tree Learning for Mining Big Data Streams Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013. http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
  • 10. | 10 MLLIB DECISION TREE PARALLELIZATION
  • 11. | 11 Step 1: compute average curves [0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[ Host 1 Host 2 Host 3 [0:10[ [10:20[ Host 1 Step 2: collect and find the best split HORIZONTAL STRATEGY
  • 12. | 12 To build the tree: • Criteria: entropy, Gini, variance • Data structure: LabelPoint FROM MLLIB TO COURBOSPARK
  • 13. | 13 To build the tree: • Criteria: entropy, Gini, variance, inertia (to compare time-series) • Data structure: LabelPoint, TimeSeries • Finding split point for nominal features For data visualization of the tree: • Quantile on the nodes and leaves • Lost of inertia • Number of curves per nodes, leaves FROM MLLIB TO COURBOSPARK
  • 14. | 14 DEALING WITH NOMINAL FEATURES Current implementation for regression:  order the categories by their mean on the target A BC D Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
  • 15. | 15 NOMINAL VALUES: TYPE OF CONTRACT 4 CATEGORIES {A, B, C, D} A B C D?
  • 16. | 16 DEALING WITH NOMINAL FEATURES Hard to order curves… Solution 1: Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D}, {AC}/{BD}… Problem: Combinatory problem depending on n the number of different categories. Complexity is O(2n)
  • 17. | 17 DEALING WITH NOMINAL FEATURES Solution 2: Agglomerative Hierarchical Clustering. Bottom up approach. Complexity is O(n3) - we don’t expect n > 100
  • 18. | 18 HOW TO Algorithm parameters Configure spark context Load the data file Learn the model
  • 19. | 19 LOOKING FOR THE TEST CONFIGURATION For a constant global capacity on 12 nodes: •120 cores + 120 GB RAM #Executors RAM per exec. Cores per exec. Performance on 100Gb data 12 10 GB 10 22 minutes 24 5 GB 5 17 minutes 60 2 GB 2 12 minutes 120 1 GB 1 15 minutes
  • 20. | 20 SCALABILITY TO #CONTAINERS
  • 21. | 21 SCALABILITY TO #CONTAINERS
  • 22. | 22 SCALABILITY TO #CONTAINERS
  • 24. | 24 FRAMEWORK STABILITY Tested on: • 10GB, 100GB, 200GB, 300GB, 400GB, 500GB, 1TB • Categorical and continuous variables • Bin sizes from 100 to 1000
  • 26. | 26 SCALABILITY TO #CATEGORIES
  • 27. | 27
  • 28. | 28 REAL LIFE DATASET 0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000 1200 1400 Timeinminutes Data in GB • 9 executors with 20 GB and 8 cores • 10 to 1000 millions load curves (10 numerical and 10 categorical features)
  • 29. | 29 • spark.default.parallelism • spark.executor.memory • spark.storage.memoryfraction • spark.akka.framesize TUNING
  • 30. | 30 Developers view • Flawless transition from local to cluster mode • Debug mode with an IDE • Good performances need knowledge FEEDBACKS
  • 32. | 32 Data Scientists view • The API is not very data oriented • …but now we have SparkSQL and Dataframes! • IPython + pySpark • Feature engineering VS model engineering FEEDBACKS
  • 33. | 33 OPS view • Better than mapReduce • Performances are predictable for tested code • YARNed • Lots of releases, MlLib code is evolving quickly FEEDBACKS
  • 34. | 34 FUTURE WORKS • Unbalanced trees • Improve performance • Other criteria for time-series comparison • Missing values in explanatory features