Modern machine learning methods that could be useful for particle physics.
Personal summary of the "Connecting the dots 2015" conference at Berkeley lab and ideas for what particle physics could try.
1. Connecting the Dots 2015
Tuesday Meeting
Tim Head
École Polytechnique Fédérale de Lausanne
24 March 2015
2. Question: What is pattern recognition in sparsely sampled data?
Obvious answer: Track reconstruction!
Interesting answer: Computer vision, track reconstruction, space object tracking, face
recognition, jet reconstruction, self driving cars, ''Ok, Google ...''
Tim Head (EPFL) 24 March 2015 2
4. 1. Is an aggressive R&D in this field sufficiently motivated?
2. Which are the most promising directions we should explore?
1. Associative Memory ASICs vs. FPGAs
2. Retina/Hough transform
3. Tracklets
4. Cellular Automata
5. GPUs
6. Commercial CPUs
7. .....
What is the future of fast track finding for trigger applications
beyond Atlas and CMS Phase II Upgrade?
Where charm leads,
beauty goes.
Followed by the Higgs.
Luciano
Ristori
Tim Head (EPFL) 24 March 2015 4
5. • In the post-Higgs era, in absence of of new physics, the key to progress in our
field will be precision measurements
• The HL-LHC at 1035 will produce ~1014 Beauty and Charm decays/year. If we
can harvest most of them we could bring the precision of CP violation
measurement in rare decays from the present ~ 10–2 to below ~10–4
• To do this we will need to change the way we perform experiments
• 1014 x 1 MB = 1020 bytes = 105 PB/year -> No way!
• We need to read out the detector for every single crossing, perform an almost
complete analysis in real time and retain only the information relevant to the
process of interest (e.g. few tracks involved in the decay)
• This involves finding all tracks down to low momentum, identifying decay
vertices, computing invariant masses...the complexity of this problem is
10-100 times worse than what we are now trying to solve for CMS Phase II
• 1014 x 1 KB = 100 PB/year -> Possible!
Is an aggressive R&D in this field sufficiently motivated?
an example
To stay ahead, we
need completely
new ideas.
Luciano
Ristori
Tim Head (EPFL) 24 March 2015 5
6. It is all About Representation
1.5 1.0 0.5 0.0 0.5 1.0 1.5
X
1.5
1.0
0.5
0.0
0.5
1.0
1.5
Y
Original Data
Separating black from
white is hard work ...
Tim Head (EPFL) 24 March 2015 6
7. It is all About Representation
1.5 1.0 0.5 0.0 0.5 1.0 1.5
X
1.5
1.0
0.5
0.0
0.5
1.0
1.5
Y
Original Data
2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
One dimensional representation
Separating black from
white is hard work ...
... until you learn
about spherical co-
ordinates.
Tim Head (EPFL) 24 March 2015 6
10. Jet Clustering 101The HEP Problem at Hand
8
QCD
QCD
QCD
QCD
QCD
QCD
QCD
QCD
Decay products of the
W and Z all end up in
the same jet.
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 9
11. N-subjettinessHEP Approach to Boosted Particle Tagging
• “Substructure” techniques to analyze constituents of jet, e.g.
– Is it a 1-prong, 2-prong, or 3-prong like decay?
– Is the energy split evenly amongst “sub-jets”?
– Many sub-structure related variables / algorithms
• Example substructure variable:
– N-subjettiness τ21=τ2 / τ1
– Continuous version of subjet counting
• Example Classification problem:
Separate W boson jet from a QCD light jet 9
21
τ
0.2 0.4 0.6 0.8 1 1.2 1.4NormalisedEntries
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22 ATLAS Simulation Preliminary
=8 TeVs
jets with R=1.0t
anti-k
Trimmed
| < 1.2
TRUTH
η|
< 350 GeV
TRUTH
T
200 < p
Window
RECO
M
QCD jets
W jets
N-subjettiness: after
a lot of thinking, cook
up a variable that can
separate QCD from W
jets.
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 10
12. N-subjettinessThe Jet-Image
• Jets built from calorimeter towers
• Build NxN grid of towers containing the jet (here 25x25)
• The Jet-Image à calorimeter towers like pixels in image! 11
Example
Jet
from
Wàqq’
decay
Jet
Jet-‐Image
Calorimeter towers are
like the pixels of an
image.
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 11
13. N-subjettinessClass Averages
14
0.0 0.5 1.0 1.5 2.0 2.5
Q2
0.0
0.5
1.0
1.5
2.0
2.5
Q1
Cell
Coefficient
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
0.0 0.5 1.0 1.5 2.0 2.5
Q2
0.0
0.5
1.0
1.5
2.0
2.5
Q1
Cell
Coefficient
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
Average W jet Average Light jet from QCD
How can we extract the important features?
How can we convert this into discrimination power?
After some prepro-
cessing, there is a dif-
ference!
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 12
14. N-subjettinessFisher Discriminant
• Finds direction that maximizes
between-class scatter / within-class scatter
– Extract “most important” feature, a, for discrimination for this metric
– This can be written as a “Generalized” eigenvalue problem
• If data is high dimensional, e.g. 625 elements, then St has huge
number of independent components, e.g. 192,495!
– Not enough data to build full rank matrix à Must regularize!
– Details of analytic solution: Z. Zhang et. al. Regularized Discriminant Analysis, Ridge Regression and Beyond,
Journal of Machine Learning Research 11 (2010) 2199-2228
16
A complicated way of
saying ...
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 13
15. Fisher's Linear Discriminant
4 3 2 1 0 1 2 3 4
6
4
2
0
2
4
6
Find an axis along
which we can separ-
ate the data.
Tim Head (EPFL) 24 March 2015 14
16. Fisher's Linear Discriminant
4 3 2 1 0 1 2 3 4
6
4
2
0
2
4
6
Find an axis along
which we can separ-
ate the data.
Tim Head (EPFL) 24 March 2015 15
17. PerformancePerformance
23
0 10 20 30 40 50 60 70 80 90 100
Signal Efficiency [%]
1
3
6
10
30
60
100
BackgroundRejection
Fisher-Jet
N-subjettiness (⌧2/⌧1)
We did not have to
think long and hard
about a variable, and
are competitive!
M
ichaelKagan
Tim Head (EPFL) 24 March 2015 16
18. Computer Vision Applied Blindly
• By mapping concepts from images to jets you gain access to well studied CV
techniques
• No need to think up ''clever'' variables a priori
flexible method!
• Computers can discover good ways to represent the data ''by themselves''
• Fisher's Linear Discriminant was state of the art in 1997, things have moved on
since then!
Tim Head (EPFL) 24 March 2015 17
19. What About YouTube?
Let a computer watch YouTube and
it will learn that cats are a useful
thing (variable) to know about.
Tim Head (EPFL) 24 March 2015 18
21. Deep Learningdetecting the higgs boson
A two-class supervised learning problem:
Higgs-production Primary background
Machine learning classifier:
∙ 28 features
∙ 21 low-level features
∙ 7 high-level features derived by physicists
∙ 10M simulated collisions for training (50% each)
∙ 500k validation set
∙ 500k test set
3
Do the seven high
level variables help?
PeterSadow
ski
Tim Head (EPFL) 24 March 2015 20
22. Deep Learningdetecting the higgs boson
∙ Current approach: shallow models
∙ Boosted decision trees* (BDT)
∙ Shallow neural networks (NN)
∙ Our approach: deep neural networks (DNN)
BDT NN DNN
*Used for Higgs discovery in 2012
4
Things we knew in the
80s have finally star-
ted working!
PeterSadow
ski
Tim Head (EPFL) 24 March 2015 21
23. Deep Learningdeep learning for particle collider data analysis
Motivated by successes of deep learning in vision and speech.
∙ Huge progress on benchmark supervised learning tasks
∙ Replacement of engineered features with learned features
Engineered features Learned features
2
Deep Neural Networks
can learn better rep-
resentations of the
data without human
input.
PeterSadow
ski
Tim Head (EPFL) 24 March 2015 22
24. Deep Learningdetecting the higgs boson
Area Under ROC Curve for Test Set
Technique Low-level features All features
BDT 0.73 0.81
NN 0.733 (0.007) 0.816 (0.004)
DNN 0.880 (0.001) 0.885 (0.002)
Deep learning improves AUC by 8% over shallow methods.
Deep learning does not require engineered features.
Baldi et al, Nature Communications 2014
6
No, adding high level
features does not im-
prove performance.
PeterSadow
ski
Tim Head (EPFL) 24 March 2015 23
26. The Physics Equivalent of the Cat
What variables does
NN learn when you
show it physics? We
should find out!
Tim Head (EPFL) 24 March 2015 25
27. Learn Expensive Parts of the Simulation
detecting the higgs boson
Mean Squared Error of networks trained to compute 7 high-level
features from 21 low-level features.
Technique Feature Regression MSE
Linear Regression 0.1468
NN 0.0885
DNN 3 layers 0.0821
DNN 4 layers 0.0818
DNN 5 layers 0.0815
DNN 6 layers 0.0812
High-level features easier to learn with deep nets
9
Use a NN with multiple
regression outputs to
learn a fast simulation
of some parts of the
simulation?
PeterSadow
ski
Tim Head (EPFL) 24 March 2015 26
28. Isolation or Flavour Tagging
Can we use "jets-are-
like-images" ideas for
this?
Tim Head (EPFL) 24 March 2015 27
30. The End
• It is all about representation.
• A small conference with
unusual mix of attendants.
check the agenda for more
on traditional tracking, etc
• LHCb is leading the way when
it comes to ''real time''
tracking, others are following.
• To stay ahead of the other
experiments we should
investigate these new ML
tools.
Tim Head (EPFL) 24 March 2015 29