SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 1
Université Paris-Saclay / CNRS
BALÁZS KÉGL
RAMP	

DATA CHALLENGES WITH	

MODULARIZATION AND CODE SUBMISSION	

LESSONS LEARNED
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• A short history of RAMPs
• motivations, design principles, and the current tool
• Three data challenges	

• anomaly detection in the LHC ATLAS detector	

• classifying and quantifying drug preparations for cancer therapy
• time series forecasting of El Niño	

• What have we learned?	

• number of participants, incentives?	

• open vs closed?	

• blending vs human ingenuity
2
OUTLINE
Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
3
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
4
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative, building interfaces, matching
people, helping them launching projects
345 affiliated researchers, 50 laboratories
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 5
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• data challenges
• ultrawalls and interactive visualization
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
!
• data science RAMPs and TSs
• IT platform for linked data
THE DATA SCIENCE ECOSYSTEM
https://medium.com/@balazskegl/the-data-science-ecosystem-678459ba6013
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• We have realized how much data scientists are ill-equipped
to manage the data science development process
• collaborating with domain scientists or business units on data-driven
problem/product formulation
• bench for logging experiments, guiding (human) model search and tuning
• managing data science teams and collaborating with each other
• collaborating with AI (out-of-the-box AutoML does not work: the search
space is too big)	

• managing the productionalization of a prototype model
6
LACK OF TOOLS
7
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
http://www.ramp.studio
• Organizers have no direct access to solutions
• Emphasize competition: participants cannot build on each
other’s solutions
• No modularization: ideas go unnoticed unless packaged into
a top submission
8
LIMITATIONS OF DATA CHALLENGES
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Challenge with code submission
• Following Nielsen’s three crowdsourcing
principles:	

• modularity: workflows are sliced into workflow
element modules that can be tackled independently	

• encourage small contributions: e.g., copy another
submission, add features, change the
hyperparameters, resubmit	

• rich and well structured information commons:
open and download each other’s code, discuss on
slack
9
RAMP
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Roughly two formats	

• single day hackatons with 20-50 participants, open leaderboard, 15
minute timeout	

• 1-3 week course challenges up 150 students (but no limit really):
closed phase with 1-3 submissions per day followed by an open
phase with 15 minute timeout	

• 600+ users, 5000+ models
10
RAMP	

RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

RAPID ANALYTICS AND MODEL PROTOTYPING
11
frontend
DB
backend
users
submissions
score
problems
workflow
starting kit
crossval
data workflow
train+test+blend
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

12
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

13
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

14
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

15
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

16
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

17
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

18
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

19
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

20
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

21
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP	

22
23
Three recent RAMPs
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
ANOMALY DETECTION IN THE LHC
ATLAS DETECTOR	

24
reconstruction
+simulated anomalies
classifier
anomaly
(isSkewed = 1)
correct

(isSkewed = 0)
?
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS 	

AHEAD
25
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS 	

AHEAD
26
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
feature
extractor
x
(a fixed length feature vector) regressor
• We give the full series to the feature extractor	

• It could look ahead in the future (even inadvertently)	

• Checking lookahead by a randomized test
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON
MOLECULAR SPECTRA
27
chemotherapy
drug in
elastic pocket
laser
spectrometer
molecular
spectra
feature
extractor 1
feature
extractor 2
regressor
concentration
classifier
drug type
28
Analyzing the analysis
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
OPEN PHASE LETS PARTICIPANTS CATCH UP
29
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 30
T-SNE ON TEST PREDICTIONS	

starting kit
the crowd
early influencers
inventors
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 31
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 32
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 33
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 34
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 35
the single day hackaton ceiling
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 36
the single day hackaton floor
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 37
the single day hackaton floor
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Open phase helps novice participants to catch up: the goal of teaching!
• Sometimes also makes the best and blended score better	

• Human blending often beats machine blending	

• Human feature engineering easily beats deep learning on some data
• Course RAMPs beat single day hackatons significantly	

• larger number of students?	

• longer RAMPs?	

• novice and master-level students are better than data science researchers?	

• stronger incentives?	

• closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
38
WHAT WE LEARNED
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 39
closed phase followed by open
the second diversity gap
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 40
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 41
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Fast development of analytics solutions	

• Teaching support	

• Networking	

• Support for collaborative team work
42
THE RAMP TOOL
A prototyping tool for collaborative
development of data science workflows
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• More RAMPs, sign up at http://www.ramp.studio

if interested	

• Tools to manage the crowd, redirect group attention,
etc.	

• Human - AI interaction: “hyperopt this module for me,
would you?”	

• Streamlining deployment
43
WHAT’S NEXT

Mais conteúdo relacionado

Destaque

Brochure Meca-19102016-bd
Brochure Meca-19102016-bdBrochure Meca-19102016-bd
Brochure Meca-19102016-bd
Camille Volant
 
Animation obtention, conversion et séparation des aromatiques
Animation obtention, conversion et séparation des aromatiquesAnimation obtention, conversion et séparation des aromatiques
Animation obtention, conversion et séparation des aromatiques
Tarik Taleb Bendiab
 
Baroffio y karsa
Baroffio y karsaBaroffio y karsa
Baroffio y karsa
jeanpyXD
 
Chromium and insulin sensitivity
Chromium and insulin sensitivityChromium and insulin sensitivity
Chromium and insulin sensitivity
fpgg
 
Ig system & equipment oil tankers
Ig system &  equipment oil tankersIg system &  equipment oil tankers
Ig system & equipment oil tankers
jabbar2002pk200
 

Destaque (18)

Santosh_Kr_Yadav_RAIM08
Santosh_Kr_Yadav_RAIM08Santosh_Kr_Yadav_RAIM08
Santosh_Kr_Yadav_RAIM08
 
Brochure Meca-19102016-bd
Brochure Meca-19102016-bdBrochure Meca-19102016-bd
Brochure Meca-19102016-bd
 
Protection des métaux contre la corrosion
Protection des métaux contre la corrosionProtection des métaux contre la corrosion
Protection des métaux contre la corrosion
 
TALAT Lecture 5203: Anodizing of Aluminium
TALAT Lecture 5203: Anodizing of AluminiumTALAT Lecture 5203: Anodizing of Aluminium
TALAT Lecture 5203: Anodizing of Aluminium
 
Hot & Sexy Naila Nayem | You can't Believe She is a Bangladeshi Ramp Model
Hot & Sexy Naila Nayem | You can't Believe She is a Bangladeshi Ramp ModelHot & Sexy Naila Nayem | You can't Believe She is a Bangladeshi Ramp Model
Hot & Sexy Naila Nayem | You can't Believe She is a Bangladeshi Ramp Model
 
Présentation de la plate-forme d'éco-conception CORINE
Présentation de la plate-forme d'éco-conception CORINEPrésentation de la plate-forme d'éco-conception CORINE
Présentation de la plate-forme d'éco-conception CORINE
 
Animation obtention, conversion et séparation des aromatiques
Animation obtention, conversion et séparation des aromatiquesAnimation obtention, conversion et séparation des aromatiques
Animation obtention, conversion et séparation des aromatiques
 
Baroffio y karsa
Baroffio y karsaBaroffio y karsa
Baroffio y karsa
 
Deep oxidation of heterogeneous VOCs: practice and feedback
Deep oxidation of heterogeneous VOCs: practice and feedbackDeep oxidation of heterogeneous VOCs: practice and feedback
Deep oxidation of heterogeneous VOCs: practice and feedback
 
Effect of Nanoporous Anodic Aluminum Oxide (AAO) Characteristics On Solar Abs...
Effect of Nanoporous Anodic Aluminum Oxide (AAO) Characteristics On Solar Abs...Effect of Nanoporous Anodic Aluminum Oxide (AAO) Characteristics On Solar Abs...
Effect of Nanoporous Anodic Aluminum Oxide (AAO) Characteristics On Solar Abs...
 
Chromium and insulin sensitivity
Chromium and insulin sensitivityChromium and insulin sensitivity
Chromium and insulin sensitivity
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Science of corrosion
Science of corrosionScience of corrosion
Science of corrosion
 
Ig system & equipment oil tankers
Ig system &  equipment oil tankersIg system &  equipment oil tankers
Ig system & equipment oil tankers
 
40 cfr 261.4(b)(6) The RCRA Exclusion From Hazardous Waste for Trivalent Chro...
40 cfr 261.4(b)(6) The RCRA Exclusion From Hazardous Waste for Trivalent Chro...40 cfr 261.4(b)(6) The RCRA Exclusion From Hazardous Waste for Trivalent Chro...
40 cfr 261.4(b)(6) The RCRA Exclusion From Hazardous Waste for Trivalent Chro...
 
Removal of chromium
Removal of chromiumRemoval of chromium
Removal of chromium
 
Principles of corrosion
Principles of corrosionPrinciples of corrosion
Principles of corrosion
 
10 major industrial applications of sulfuric acid
10 major industrial applications of sulfuric acid10 major industrial applications of sulfuric acid
10 major industrial applications of sulfuric acid
 

Mais de Balázs Kégl

Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
Balázs Kégl
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
Balázs Kégl
 

Mais de Balázs Kégl (13)

Data-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsData-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural nets
 
Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loop
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflows
 
A historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksA historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricks
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team work
 
RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submission
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
What is wrong with data challenges
What is wrong with data challengesWhat is wrong with data challenges
What is wrong with data challenges
 
The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
The Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceThe Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data Science
 

Último

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Último (20)

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

RAMP: data challenges with modularization and code submission

  • 1. Center for Data Science Paris-Saclay B. Kégl (CNRS) 1 Université Paris-Saclay / CNRS BALÁZS KÉGL RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED
  • 2. Center for Data Science Paris-Saclay B. Kégl (CNRS) • A short history of RAMPs • motivations, design principles, and the current tool • Three data challenges • anomaly detection in the LHC ATLAS detector • classifying and quantifying drug preparations for cancer therapy • time series forecasting of El Niño • What have we learned? • number of participants, incentives? • open vs closed? • blending vs human ingenuity 2 OUTLINE
  • 3. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 3 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  • 4. Center for Data Science Paris-Saclay B. Kégl (CNRS) Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA 4 Center for Data Science Paris-Saclay A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects 345 affiliated researchers, 50 laboratories
  • 5. Center for Data Science Paris-Saclay B. Kégl (CNRS) 5 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • data challenges • ultrawalls and interactive visualization • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain ! • data science RAMPs and TSs • IT platform for linked data THE DATA SCIENCE ECOSYSTEM https://medium.com/@balazskegl/the-data-science-ecosystem-678459ba6013
  • 6. Center for Data Science Paris-Saclay B. Kégl (CNRS) • We have realized how much data scientists are ill-equipped to manage the data science development process • collaborating with domain scientists or business units on data-driven problem/product formulation • bench for logging experiments, guiding (human) model search and tuning • managing data science teams and collaborating with each other • collaborating with AI (out-of-the-box AutoML does not work: the search space is too big) • managing the productionalization of a prototype model 6 LACK OF TOOLS
  • 7. 7 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP) http://www.ramp.studio
  • 8. • Organizers have no direct access to solutions • Emphasize competition: participants cannot build on each other’s solutions • No modularization: ideas go unnoticed unless packaged into a top submission 8 LIMITATIONS OF DATA CHALLENGES
  • 9. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Challenge with code submission • Following Nielsen’s three crowdsourcing principles: • modularity: workflows are sliced into workflow element modules that can be tackled independently • encourage small contributions: e.g., copy another submission, add features, change the hyperparameters, resubmit • rich and well structured information commons: open and download each other’s code, discuss on slack 9 RAMP
  • 10. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Roughly two formats • single day hackatons with 20-50 participants, open leaderboard, 15 minute timeout • 1-3 week course challenges up 150 students (but no limit really): closed phase with 1-3 submissions per day followed by an open phase with 15 minute timeout • 600+ users, 5000+ models 10 RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
  • 11. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP RAPID ANALYTICS AND MODEL PROTOTYPING 11 frontend DB backend users submissions score problems workflow starting kit crossval data workflow train+test+blend
  • 12. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 12
  • 13. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 13
  • 14. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 14
  • 15. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 15
  • 16. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 16
  • 17. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 17
  • 18. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 18
  • 19. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 19
  • 20. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 20
  • 21. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 21
  • 22. Center for Data Science Paris-Saclay B. Kégl (CNRS) RAMP 22
  • 24. Center for Data Science Paris-Saclay B. Kégl (CNRS) ANOMALY DETECTION IN THE LHC ATLAS DETECTOR 24 reconstruction +simulated anomalies classifier anomaly (isSkewed = 1) correct
 (isSkewed = 0) ?
  • 25. Center for Data Science Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 25
  • 26. Center for Data Science Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 26 … 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… … feature extractor x (a fixed length feature vector) regressor • We give the full series to the feature extractor • It could look ahead in the future (even inadvertently) • Checking lookahead by a randomized test
  • 27. Center for Data Science Paris-Saclay B. Kégl (CNRS) CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA 27 chemotherapy drug in elastic pocket laser spectrometer molecular spectra feature extractor 1 feature extractor 2 regressor concentration classifier drug type
  • 29. Center for Data Science Paris-Saclay B. Kégl (CNRS) OPEN PHASE LETS PARTICIPANTS CATCH UP 29
  • 30. Center for Data Science Paris-Saclay B. Kégl (CNRS) 30 T-SNE ON TEST PREDICTIONS starting kit the crowd early influencers inventors
  • 31. Center for Data Science Paris-Saclay B. Kégl (CNRS) 31
  • 32. Center for Data Science Paris-Saclay B. Kégl (CNRS) 32
  • 33. Center for Data Science Paris-Saclay B. Kégl (CNRS) 33
  • 34. Center for Data Science Paris-Saclay B. Kégl (CNRS) 34
  • 35. Center for Data Science Paris-Saclay B. Kégl (CNRS) 35 the single day hackaton ceiling what you achieved with a well tuned deep net the diversity gap the human blender gap
  • 36. Center for Data Science Paris-Saclay B. Kégl (CNRS) 36 the single day hackaton floor
  • 37. Center for Data Science Paris-Saclay B. Kégl (CNRS) 37 the single day hackaton floor
  • 38. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Open phase helps novice participants to catch up: the goal of teaching! • Sometimes also makes the best and blended score better • Human blending often beats machine blending • Human feature engineering easily beats deep learning on some data • Course RAMPs beat single day hackatons significantly • larger number of students? • longer RAMPs? • novice and master-level students are better than data science researchers? • stronger incentives? • closed phase preceding an open phase (vs pure open RAMP) helps to create diversity? 38 WHAT WE LEARNED
  • 39. Center for Data Science Paris-Saclay B. Kégl (CNRS) 39 closed phase followed by open the second diversity gap
  • 40. Center for Data Science Paris-Saclay B. Kégl (CNRS) 40
  • 41. Center for Data Science Paris-Saclay B. Kégl (CNRS) 41
  • 42. Center for Data Science Paris-Saclay B. Kégl (CNRS) • Fast development of analytics solutions • Teaching support • Networking • Support for collaborative team work 42 THE RAMP TOOL A prototyping tool for collaborative development of data science workflows
  • 43. Center for Data Science Paris-Saclay B. Kégl (CNRS) • More RAMPs, sign up at http://www.ramp.studio
 if interested • Tools to manage the crowd, redirect group attention, etc. • Human - AI interaction: “hyperopt this module for me, would you?” • Streamlining deployment 43 WHAT’S NEXT