1. Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay
Center for Data Science
BALÁZS KÉGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:
THE GOOD, THE BAD AND THE UGLY
2. 2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3. 3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.
4. 4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
! Crowdsourcing
! is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
! To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
5. Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
6. Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
12. Center for Data Science
Paris-Saclay
• Summary of our conclusions after the HiggsML challenge
• The good, the bad and the ugly
• Elaborating on some of the points
• Rapid Analytics and Model Prototyping
• an experimental format we have been developing
12
OUTLINE
13. Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
14. Center for Data Science
Paris-Saclay
• Publicity, awareness
• both in physics (about the technology) and in ML (about the problem)
• Triggering open data
• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
• Learning a lot from Gábor on how to win a challenge
• Gábor getting hired by Google Deep Mind
• Benchmarking
• Tool dissemination (xgboost, keras)
14
THE GOOD
15. Center for Data Science
Paris-Saclay
• No direct access to code
• No direct access to data scientists
• No fundamentally new ideas
• No incentive to collaborate
15
THE BAD
16. Center for Data Science
Paris-Saclay
• 18 months to prepare
• legal issues, access to data
• problem formulation: intellectually way more interesting than the
challenge itself, but difficult to “market” or to crowdsource
• once a problem is formalized/formatted to challenge, the problem is
solved (“learning is easy” - GaelVaroquaux)
16
THE UGLY
17. Center for Data Science
Paris-Saclay
• We asked the wrong question, on purpose!
• because the right questions are complex and don’t fit the challenge
setup
• would have led to way less participation
• would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY
18. Center for Data Science
Paris-Saclay
• The HiggsML challenge on Kaggle
• https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
19. Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
20. Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS
20
• HEPML workshop @NIPS14
• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42
• CERN Open Data
• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
• DataScience@LHC
• http://indico.cern.ch/event/395374/
• Flavors of physics challenge
• https://www.kaggle.com/c/flavours-of-physics
21. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
22. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
22
• Sophisticated cross validation, CV bagging
• Sophisticated calibration and model averaging
• The first step: pro participants check if the effort is worthy,
risk assessment
• variance estimate of the score
• Don’t use the public leaderboard score for model selection
• None of Gábor’s 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
23. Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
24. Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
25. Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery significance
flux × time
selection
expected background
say, b = 100 events
total count,
say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection
thresholdselection threshold
26. Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
• OK, so let’s design an objective function that can take background
systematics into consideration
• Likelihood with unknown background b ⇠ N(µb, b)
L(µs, µb) = P(n, b|µs, µb, b) =
(µs + µb)n
n!
e (µs+µb) 1
p
2⇡ b
e (b µb)2
/2 b
2
• Profile likelihood ratio (0) =
L(0, ˆˆµb)
L(ˆµs, ˆµb)
• The new Approximate Median Significance (by Glen Cowan)
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
26
27. Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?
28. Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
• The new Approximate Median Significance
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
New AMS
ATLAS
Old AMS
29. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
29
• Sophisticated cross validation, CV bagging
• Sophisticated calibration and model averaging
• The first step: pro participants check if the effort is worthy,
risk assessment
• variance estimate of the score
• Don’t use the public leaderboard score for model selection
• None of Gábor’s 200 out-of-the-ordinary ideas worked
30. Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
31. Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
32. Center for Data Science
Paris-Saclay
• Challenges are useful for
• generating visibility in the data science community about novel
application domains
• benchmarking in a fair way state-of-the-art techniques on
well-defined problems
• finding talented data scientists
• Limitations
• not necessary adapted to solving complex and open-ended
data science problems in realistic environments
• no direct access to solutions and data scientist
• no incentive to collaboration
32
DATA CHALLENGES
34. Center for Data Science
Paris-Saclay
• Direct access to code, prototyping
• Incentivizing diversity
• Incentivizing collaboration
• Training
• Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)
35. Center for Data Science
Paris-Saclay
• Our experience with the HiggsML challenge
• Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science
• Collaboration with management scientists specializing
in managing innovation
• Michel Nielsen’s book: Reinventing Discovery
• 5+ iterations so far
35
WHERE DOES IT COME FROM?
36. Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
37. Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
38. 38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
39. Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
40. Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering
clouds/grids
high-performance
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science RAMPs and TSs
• IT platform for linked data
• annotation tools
• SaaS data science platform
41. Center for Data Science
Paris-Saclay
• Modularizing the collaboration
• independent subtasks
• reduces barriers
• broadens the range of available expertise
• Encouraging small contributions
• Rich and well-structured information commons
• so people can build on earlier work
41
NIELSEN’S CROWDSOURCING PRINCIPLES
42. Center for Data Science
Paris-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants
• preparation is similar to challenges
• Goals
• focusing and motivating top talents
• promoting collaboration, speed, and efficiency
• solving (prototyping) real problems
43. 43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants
• focusing on a single subject (deep learning, model tuning, functional
data, etc.)
• preparing RAMPs
55. 55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book
• Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
• Come to our CIML WS tomorrow