What is wrong with data challenges

Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay

BALÁZS KÉGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:

THE GOOD, THE BAD AND THE UGLY

2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?

3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.

4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
!  Crowdsourcing
!  is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
!  To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5

Paris-Saclay
CROWDSOURCING ANNOTATION
5

Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6

Paris-Saclay
CROWDSOURCING MATH
7

Paris-Saclay
CROWDSOURCING ANALYTICS
8

Paris-Saclay
OPEN SOURCE
9

Paris-Saclay
NEW PUBLICATION MODELS
10

Paris-Saclay
THE BOOK TO READ
11

Paris-Saclay
• Summary of our conclusions after the HiggsML challenge

• The good, the bad and the ugly

• Elaborating on some of the points

• Rapid Analytics and Model Prototyping

• an experimental format we have been developing
12
OUTLINE

Paris-Saclay13
CIML WORKSHOP TOMORROW

Paris-Saclay
• Publicity, awareness

• both in physics (about the technology) and in ML (about the problem)

• Triggering open data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• Learning a lot from Gábor on how to win a challenge

• Gábor getting hired by Google Deep Mind

• Benchmarking
• Tool dissemination (xgboost, keras)
14
THE GOOD

Paris-Saclay
• No direct access to code

• No direct access to data scientists

• No fundamentally new ideas

• No incentive to collaborate
15
THE BAD

Paris-Saclay
• 18 months to prepare

• legal issues, access to data

• problem formulation: intellectually way more interesting than the
challenge itself, but difﬁcult to “market” or to crowdsource

• once a problem is formalized/formatted to challenge, the problem is
solved (“learning is easy” - GaelVaroquaux)
16
THE UGLY

Paris-Saclay
• We asked the wrong question, on purpose!

• because the right questions are complex and don’t ﬁt the challenge
setup

• would have led to way less participation

• would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY

Paris-Saclay
• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS

Paris-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14

Paris-Saclay
AWARENESS DYNAMICS

20
• HEPML workshop @NIPS14

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42

• CERN Open Data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• DataScience@LHC

• http://indico.cern.ch/event/395374/

• Flavors of physics challenge

• https://www.kaggle.com/c/ﬂavours-of-physics

Paris-Saclay
LEARNING FROM THE WINNER

21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Paris-Saclay

22
• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The ﬁrst step: pro participants check if the effort is worthy,
risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Paris-Saclay
BENCHMARKING
23
15

Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?

Paris-Saclay
count (per year)
background
signal
probability
background
signal
25
Goal: optimize the expected discovery significance
flux × time
selection
expected background

say, b = 100 events
total count,

say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection

thresholdselection threshold

Paris-Saclay
How to handle systematic (model) uncertainties?
• OK, so let’s design an objective function that can take background
systematics into consideration
• Likelihood with unknown background b ⇠ N(µb, b)
L(µs, µb) = P(n, b|µs, µb, b) =
(µs + µb)n
n!
e (µs+µb) 1
p
2⇡ b
e (b µb)2
/2 b
2
• Proﬁle likelihood ratio (0) =
L(0, ˆˆµb)
L(ˆµs, ˆµb)
• The new Approximate Median Signiﬁcance (by Glen Cowan)
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
26

Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?

Paris-Saclay28
How to handle systematic (model) uncertainties?
• The new Approximate Median Signiﬁcance
AMS =
s
2
✓
(s + b) ln
s + b
b0
s b + b0
◆
+
(b b0)2
b
2
where
b0 =
1
2
⇣
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
⌘
1 / 1
New AMS
ATLAS
Old AMS

Paris-Saclay

29
• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The ﬁrst step: pro participants check if the effort is worthy,
risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score

Paris-Saclay
VARIANCE OF THE TEST SCORE
31

Paris-Saclay
• Challenges are useful for

• generating visibility in the data science community about novel
application domains

• benchmarking in a fair way state-of-the-art techniques on
well-deﬁned problems

• ﬁnding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended
data science problems in realistic environments

• no direct access to solutions and data scientist

• no incentive to collaboration
32
DATA CHALLENGES

33
We decided to design something better

Paris-Saclay
• Direct access to code, prototyping

• Incentivizing diversity

• Incentivizing collaboration
• Training
• Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)

Paris-Saclay
• Our experience with the HiggsML challenge

• Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science

• Collaboration with management scientists specializing
in managing innovation

• Michel Nielsen’s book: Reinventing Discovery

• 5+ iterations so far
35
WHERE DOES IT COME FROM?

Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion

Paris-Saclay37
Paris-Saclay
A multi-disciplinary initiative to deﬁne, structure, and manage
the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA

38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics 
machine learning
signal processing
data visualization
databases
Tool building
software engineering 
clouds/grids
high-performance 
computing
optimization

Paris-Saclay39
https://medium.com/@balazskegl

Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics 
machine learning
signal processing
data visualization
databases
• interdisciplinary projects
• matchmaking tool
• design and innovation strategy workshops
• data challenges
• coding sprints
• Open Software Initiative
• code consolidator and engineering projects
software engineering 
clouds/grids
high-performance 
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
• data science RAMPs and TSs
• IT platform for linked data
• annotation tools
• SaaS data science platform

Paris-Saclay
• Modularizing the collaboration

• independent subtasks

• reduces barriers

• broadens the range of available expertise

• Encouraging small contributions

• Rich and well-structured information commons

• so people can build on earlier work
41
NIELSEN’S CROWDSOURCING PRINCIPLES

Paris-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants

• preparation is similar to challenges
• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efﬁciency

• solving (prototyping) real problems

43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional
data, etc.)

• preparing RAMPs

44
ANALYTICS TOOLS TO PROMOTE

COLLABORATION AND CODE REUSE

Paris-Saclay45
ANALYTICS TOOL TO PROMOTE

COLLABORATION AND CODE REUSE

Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46

Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
47

Paris-Saclay
2015 Apr 10
Classifying variable stars
48

Paris-Saclay
VARIABLE STARS
49

Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%

Paris-Saclay
2015 June 16 and Sept 26
Predicting El Nino
51

52
RMSE improvement: 0.9˚C to 0.4˚C

53
2015 October 8
Insect classiﬁcation

54
accuracy improvement: 30% to 70%

55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book

• Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
• Come to our CIML WS tomorrow

Paris-Saclay56
THANK YOU!

What is wrong with data challenges

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a What is wrong with data challenges

Semelhante a What is wrong with data challenges (20)

Último

Último (20)

What is wrong with data challenges