http://www.ramp.studio
1) A short history of RAMPs: Rapid Analytics and Model Prototyping
a) motivations
b) design principles
c) the current tool
2) Three data challenges
a) anomaly detection in the LHC ATLAS detector
b) classifying and regressing on molecular spectra
c) time series forecasting of El Niño
3) What have we learned?
a) Course RAMPs beat single day hackatons significantly
i) larger number of students?
ii) longer RAMPs?
iii) M.Sc. students are better than data science researchers?
iv) stronger incentives?
v) closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
b) Open phase helps novice participants to catch up: the goal of teaching!
c) Sometimes also makes the best and blended score better
d) Human blending often beats machine blending
e) Human feature engineering easily beats deep learning on some data
RAMP: data challenges with modularization and code submission
1. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 1
Université Paris-Saclay / CNRS
BALÁZS KÉGL
RAMP
DATA CHALLENGES WITH
MODULARIZATION AND CODE SUBMISSION
LESSONS LEARNED
2. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• A short history of RAMPs
• motivations, design principles, and the current tool
• Three data challenges
• anomaly detection in the LHC ATLAS detector
• classifying and quantifying drug preparations for cancer therapy
• time series forecasting of El Niño
• What have we learned?
• number of participants, incentives?
• open vs closed?
• blending vs human ingenuity
2
OUTLINE
3. Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
3
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
4. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
4
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative, building interfaces, matching
people, helping them launching projects
345 affiliated researchers, 50 laboratories
5. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 5
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• data challenges
• ultrawalls and interactive visualization
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering
clouds/grids
high-performance
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
!
• data science RAMPs and TSs
• IT platform for linked data
THE DATA SCIENCE ECOSYSTEM
https://medium.com/@balazskegl/the-data-science-ecosystem-678459ba6013
6. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• We have realized how much data scientists are ill-equipped
to manage the data science development process
• collaborating with domain scientists or business units on data-driven
problem/product formulation
• bench for logging experiments, guiding (human) model search and tuning
• managing data science teams and collaborating with each other
• collaborating with AI (out-of-the-box AutoML does not work: the search
space is too big)
• managing the productionalization of a prototype model
6
LACK OF TOOLS
8. • Organizers have no direct access to solutions
• Emphasize competition: participants cannot build on each
other’s solutions
• No modularization: ideas go unnoticed unless packaged into
a top submission
8
LIMITATIONS OF DATA CHALLENGES
9. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Challenge with code submission
• Following Nielsen’s three crowdsourcing
principles:
• modularity: workflows are sliced into workflow
element modules that can be tackled independently
• encourage small contributions: e.g., copy another
submission, add features, change the
hyperparameters, resubmit
• rich and well structured information commons:
open and download each other’s code, discuss on
slack
9
RAMP
10. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Roughly two formats
• single day hackatons with 20-50 participants, open leaderboard, 15
minute timeout
• 1-3 week course challenges up 150 students (but no limit really):
closed phase with 1-3 submissions per day followed by an open
phase with 15 minute timeout
• 600+ users, 5000+ models
10
RAMP
RAPID ANALYTICS AND MODEL PROTOTYPING
11. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAMP
RAPID ANALYTICS AND MODEL PROTOTYPING
11
frontend
DB
backend
users
submissions
score
problems
workflow
starting kit
crossval
data workflow
train+test+blend
24. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
ANOMALY DETECTION IN THE LHC
ATLAS DETECTOR
24
reconstruction
+simulated anomalies
classifier
anomaly
(isSkewed = 1)
correct
(isSkewed = 0)
?
25. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS
AHEAD
25
26. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS
AHEAD
26
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
feature
extractor
x
(a fixed length feature vector) regressor
• We give the full series to the feature extractor
• It could look ahead in the future (even inadvertently)
• Checking lookahead by a randomized test
27. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON
MOLECULAR SPECTRA
27
chemotherapy
drug in
elastic pocket
laser
spectrometer
molecular
spectra
feature
extractor 1
feature
extractor 2
regressor
concentration
classifier
drug type
35. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 35
the single day hackaton ceiling
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
36. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 36
the single day hackaton floor
37. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 37
the single day hackaton floor
38. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Open phase helps novice participants to catch up: the goal of teaching!
• Sometimes also makes the best and blended score better
• Human blending often beats machine blending
• Human feature engineering easily beats deep learning on some data
• Course RAMPs beat single day hackatons significantly
• larger number of students?
• longer RAMPs?
• novice and master-level students are better than data science researchers?
• stronger incentives?
• closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
38
WHAT WE LEARNED
39. Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 39
closed phase followed by open
the second diversity gap
42. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• Fast development of analytics solutions
• Teaching support
• Networking
• Support for collaborative team work
42
THE RAMP TOOL
A prototyping tool for collaborative
development of data science workflows
43. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
• More RAMPs, sign up at http://www.ramp.studio
if interested
• Tools to manage the crowd, redirect group attention,
etc.
• Human - AI interaction: “hyperopt this module for me,
would you?”
• Streamlining deployment
43
WHAT’S NEXT