SlideShare a Scribd company logo
1 of 101
Download to read offline
BIG 
DATA 
COMPETITION: 
MAXIMIZING 
YOUR 
POTENTIAL 
EXAMPLED 
WITH 
THE 
2014 
HIGGS 
BOSON 
MACHINE 
LEARNING 
CHALLENGE 
Dr. 
Cheng 
CHEN 
email: 
cchen@goDCI.com 
twitter: 
@cheng_chen_us 
Development 
Consulting 
International 
LLC 
goDCI.com 
this 
presentation 
is 
copyright 
protected 
© 1
PRESENTER 
Ohio State University, Tongji University 
Ph.D. Civil Engineering 
M.S. Applied Statistics 
Minor Computer Science 
Advanced trainings: 
City and Regional Planning 
Industrial and Systems Engineering 
Mathematics 
Passion: (this) machine learning 
2
HIGGS 
BOSON 
MACHINE 
LEARNING 
CHALLENGE 
• Goal: improve the procedure that produces the selection region of Higgs Boson 
• 4 Month Duration 
• 1,785 teams 
• Many machine learning experts, statisticians, and physicist 
• Top 5 are from 5 different countries 
3 
Hungary 
Netherlands 
France 
Russia 
http://www.kaggle.com/c/higgs-­‐boson/leaderboard U.S.A/China
Background 
4 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
cross 
validate innovate 
find 
Train Select Optimize 
Validate 
apply 
fine-­‐tune 
©
Background 
5 
Data 
Model 
Understand 
read 
visualize 
Explore Enhance 
reduce 
generate 
find 
Train Select Optimize 
innovate 
read 
discuss 
Validate 
apply 
fine-­‐tune 
cross 
validate 
©
READ 
AND 
DISCUSS 
6
• a.k.a 
HIGGS 
BOSON 
the 
God 
Particle 
(explains 
some 
mass) 
• A 
fundamental 
particle 
theorized 
in 
1964 
in 
the 
Standard 
Model 
of 
Particle 
Physics 
• “Considered” 
discovered 
in 
2011 
– 
2013 
in 
LHC 
by 
CERN 
• A 
number 
of 
prestigious 
awards 
in 
2013, 
including 
a 
Nobel 
prize 
7 
A 
"definitive" 
answer 
might 
require 
"another 
few 
years" 
after 
the 
collider's 
2015 
restart. 
deputy 
chair 
of 
physics 
at 
Brookhaven 
National 
Laboratory 
http://en.wikipedia.org/wiki/Higgs_boson 
http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg
CERN: 
THE 
EUROPEAN 
ORGANIZATION 
• Established 
FOR 
NUCLEAR 
RESEARCH 
in 
1954 
• Birth 
of 
World 
Wide 
Web 
(1989) 
8 
maps.google.com
• 27 
LARGE 
HADRON 
COLLIDER 
(LHC) 
km 
(17 
mi) 
in 
circumference 
• 175 
meters 
(574 
ft) 
beneath 
ground 
• Built 
from 
1998 
to 
2008 
• Over 
10,000 
scientists 
and 
engineers 
• Over 
100 
counties 
• Seven 
particle 
detectors 
https://www.llnl.gov/news/llnl-­‐set-­‐host-­‐international-­‐lattice-­‐physics-­‐conference 9 
http://en.wikipedia.org/wiki/Large_Hadron_Collider 
http://en.wikipedia.org/wiki/Large_Hadron_Collider
• 46 
meters 
long 
• 25 
meters 
in 
diameter 
• Weighs 
about 
7,000 
tonnes 
• Contains 
some 
3000 
km 
of 
cable 
• Involves 
ATLAS 
roughly 
3,000 
physicists 
from 
over 
175 
institutions 
in 
38 
countries. 
10 
http://en.wikipedia.org/wiki/Large_Hadron_Collider 
http://higgsml.lal.in2p3.fr/documentation/
• 46 
meters 
long 
• 25 
meters 
in 
diameter 
• Weighs 
about 
7,000 
tonnes 
• Contains 
some 
3000 
km 
of 
cable 
• Involves 
ATLAS 
roughly 
3,000 
physicists 
from 
over 
175 
institutions 
in 
38 
countries. 
11 
http://en.wikipedia.org/wiki/Large_Hadron_Collider 
http://higgsml.lal.in2p3.fr/documentation/
• 46 
meters 
long 
• 25 
meters 
in 
diameter 
• Weighs 
about 
7,000 
tonnes 
• Contains 
some 
3000 
km 
of 
cable 
• Involves 
ATLAS 
roughly 
3,000 
physicists 
from 
over 
175 
institutions 
in 
38 
countries. 
12 
http://en.wikipedia.org/wiki/Large_Hadron_Collider 
http://higgsml.lal.in2p3.fr/documentation/
• Higgs 
CHALLENGES 
IN 
DETECTION 
OF 
HIGGS 
BOSON 
Boson 
can 
not 
be 
measured 
directly 
(decays 
immediately 
into 
lighter 
particles) 
• Other 
particles 
can 
decay 
into 
the 
same 
set 
of 
lighter 
particles 
• PRODUCTION 
and 
DECAY 
of 
Higgs 
Boson 
depends 
on 
the 
mass, 
while 
mass 
was 
not 
predicted 
by 
theory 
(now 
we 
know 
it 
is 
close 
to 
125 
Gev) 
13 
Seeing 
a 
circular 
shaped 
shadow 
does 
not 
mean 
the 
real 
object 
is 
a 
sphere 
ball 
https://www2.physics.ox.ac.uk/sites/default/files/2012-­‐03-­‐27/sinead_farrington_pdf_17376.pdf
CURRENT 
DETECTION 
MECHANISM 
• Raw 
data 
collected 
from 
LHC 
• Hundreds 
of 
millions 
of 
proton-­‐proton 
collisions 
(event) 
per 
second 
• 400 
events 
of 
interest 
are 
selected 
per 
second 
– Signal 
event 
(i.e. 
Higgs 
Boson) 
– Background 
event 
(i.e. 
other 
particles) 
• Events 
in 
Ad 
Hoc 
selection 
region 
(in 
certain 
channels) 
exceeding 
background 
noise 
14 
Needs 
improvement 
in 
significance 
and 
robustness 
in 
selection 
criteria
SIMPLIFICATIONS 
FOR 
COMPETITION 
• Simulated 
Data 
• Fixed 
mass 
(125 
GeV) 
• Simplified 
decay 
channel 
– Next 
Slide 
• Simplified 
background 
events 
(three 
representative 
types 
only) 
–Decay 
of 
the 
Z 
boson 
(91.2 
GeV) 
into 
Tau-­‐Tau 
–Decay 
of 
a 
pair 
of 
top 
quarks 
into 
lepton 
and 
hadronic 
tau 
–“Decay” 
of 
the 
W 
boson 
into 
lepton 
and 
hadronic 
tau 
due 
to 
imperfections 
in 
the 
particle 
identification 
procedure 
• Simplified 
objective 
function 
(significance 
score) 
15
• Decay 
SIMPLIFIED 
DECAY 
CHANNEL 
of 
Tau-­‐Tau 
Channel 
only 
• One 
tau 
decays 
into 
lepton 
and 
two 
neutrino 
• The 
other 
tau 
decays 
into 
hadronic 
tau 
and 
a 
neutrino 
• (Note: 
Neutrinos 
can 
not 
be 
detected) 
hadronic tau: 
a bunch of hadrons 
16
• Decay 
SIMPLIFIED 
DECAY 
CHANNEL 
of 
Tau-­‐Tau 
Channel 
only 
• One 
tau 
decays 
into 
lepton 
and 
two 
neutrino 
• The 
other 
tau 
decays 
into 
hadronic 
tau 
and 
a 
neutrino 
• (Note: 
Neutrinos 
can 
not 
be 
detected) 
hadronic tau: 
a bunch of hadrons 
17
• Decay 
SIMPLIFIED 
DECAY 
CHANNEL 
of 
Tau-­‐Tau 
Channel 
only 
• One 
tau 
decays 
into 
lepton 
and 
two 
neutrino 
• The 
other 
tau 
decays 
into 
hadronic 
tau 
and 
a 
neutrino 
• (Note: 
Neutrinos 
can 
not 
be 
detected) 
18 
Jets MET 
vectorized 
momenta 
are 
given 
hadronic tau: 
a bunch of hadrons
Background 
19 
Data 
Model 
Understand 
read 
visualize 
Explore Enhance 
reduce 
generate 
find 
Train Select Optimize 
innovate 
read 
discuss 
Validate 
apply 
fine-­‐tune 
cross 
validate 
©
• 250,000 
training 
• 550,000 
testing 
• 30 
variables 
– 17 
Primitive 
• Momenta 
• Direction 
– 13 
Derived 
DATA 
DIMENSION 
20 
4 rows in training data 
EventId 
DER_ma 
ss_MMC 
DER_ma 
ss_trans 
verse_m 
et_lep 
DER_ma 
ss_vis 
DER_pt_ 
h 
DER_del 
taeta_jet 
_jet 
DER_ma 
ss_jet_je 
t 
DER_pro 
deta_jet_ 
jet 
DER_del 
tar_tau_l 
ep 
DER_pt_ 
tot 
DER_su 
m_pt 
100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76 
100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157 
100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814 
100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968 
EventId 
DER_pt_ 
ratio_lep 
_tau 
DER_me 
t_phi_ce 
ntrality 
DER_lep 
_eta_cen 
trality 
PRI_tau_ 
pt 
PRI_tau_ 
eta 
PRI_tau_ 
phi 
PRI_lep_ 
pt 
PRI_lep_ 
eta 
PRI_lep_ 
phi 
PRI_met 
100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824 
100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704 
100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283 
100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082 
EventId 
PRI_met 
_phi 
PRI_met 
_sumet 
PRI_jet_ 
num 
PRI_jet_l 
eading_ 
pt 
PRI_jet_l 
eading_e 
ta 
PRI_jet_l 
eading_ 
phi 
PRI_jet_ 
subleadi 
ng_pt 
PRI_jet_ 
subleadi 
ng_eta 
PRI_jet_ 
subleadi 
ng_phi 
PRI_jet_ 
all_pt 
100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497 
100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226 
100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251 
100003 0.06 86.062 0 NA NA NA NA NA NA 0 
EventId Weight Label 
100000 0.00265331s133733 
100001 2.23358448b717 
100002 2.34738894b364 
100003 5.44637821b192 
Data 
loaded 
correctly 
Notice 
NA 
values
MISSING 
VALUES 
21 
col_name NA_count 
NA_pct 
1 EventId 
2 DER_mass_MMC 
38,114 
15% 
3 DER_mass_transverse_met_lep 
4 DER_mass_vis 
5 DER_pt_h 
6 DER_deltaeta_jet_jet 
177,457 
71% 
7 DER_mass_jet_jet 
177,457 
71% 
8 DER_prodeta_jet_jet 
177,457 
71% 
9 DER_deltar_tau_lep 
10 DER_pt_tot 
11 DER_sum_pt 
12 DER_pt_ratio_lep_tau 
13 DER_met_phi_centrality 
14 DER_lep_eta_centrality 
177,457 
71% 
15 PRI_tau_pt 
16 PRI_tau_eta 
17 PRI_tau_phi 
18 PRI_lep_pt 
19 PRI_lep_eta 
20 PRI_lep_phi 
21 PRI_met 
22 PRI_met_phi 
23 PRI_met_sumet 
24 PRI_jet_num 
25 PRI_jet_leading_pt 
99,913 
40% 
26 PRI_jet_leading_eta 
99,913 
40% 
27 PRI_jet_leading_phi 
99,913 
40% 
28 PRI_jet_subleading_pt 
177,457 
71% 
29 PRI_jet_subleading_eta 
177,457 
71% 
30 PRI_jet_subleading_phi 
177,457 
71% 
31 PRI_jet_all_pt 
32 Weight 
33 Label
MISSING 
VALUES 
22 
col_name NA_count 
NA_pct 
1 EventId 
2 DER_mass_MMC 
38,114 
15% 
3 DER_mass_transverse_met_lep 
4 DER_mass_vis 
5 DER_pt_h 
6 DER_deltaeta_jet_jet 
177,457 
71% 
7 DER_mass_jet_jet 
177,457 
71% 
8 DER_prodeta_jet_jet 
177,457 
71% 
9 DER_deltar_tau_lep 
10 DER_pt_tot 
11 DER_sum_pt 
12 DER_pt_ratio_lep_tau 
13 DER_met_phi_centrality 
14 DER_lep_eta_centrality 
177,457 
71% 
15 PRI_tau_pt 
16 PRI_tau_eta 
17 PRI_tau_phi 
18 PRI_lep_pt 
19 PRI_lep_eta 
20 PRI_lep_phi 
21 PRI_met 
22 PRI_met_phi 
23 PRI_met_sumet 
24 PRI_jet_num 
25 PRI_jet_leading_pt 
99,913 
40% 
26 PRI_jet_leading_eta 
99,913 
40% 
27 PRI_jet_leading_phi 
99,913 
40% 
28 PRI_jet_subleading_pt 
177,457 
71% 
29 PRI_jet_subleading_eta 
177,457 
71% 
30 PRI_jet_subleading_phi 
177,457 
71% 
31 PRI_jet_all_pt 
32 Weight 
33 Label 
Notice 
the 
consistency 
in 
missing 
values
HOW 
TO 
HANDLE 
MISSING 
VALUES 
• Assign 
a 
value 
– Generate 
a 
random 
value 
– Fit 
a 
value 
(mean, 
median, 
nearest 
neighbor, 
etc.) 
– Fix 
a 
value 
(domain 
knowledge) 
• Remove 
the 
record 
• Leave 
as 
is 
23
HOW 
TO 
HANDLE 
MISSING 
VALUES 
• Assign 
a 
value 
– Generate 
a 
random 
value 
– Fit 
a 
value 
(mean, 
median, 
nearest 
neighbor, 
etc.) 
– Fix 
a 
value 
(domain 
knowledge) 
• Remove 
the 
record 
• Leave 
as 
is 
24
HISTOGRAM 
PRI_jet_leading_pt 
Count 
Log 
transformation 
Count 
Inverse 
transformation 
Count 
Density 
is 
more 
meaningful 
in 
the 
range 
of 
x No 
fuzzy 
jump 
at 
the 
edge 25
HISTOGRAM 
(CONT’D) 
DER_pt_h 
Count 
Log 
transformation 
Bi-­‐modality 
is 
revealed 26 
Count 
Inverse 
transformation 
Count
INTERACTIVE 
VISUALIZATION 
R 
SHINY 
27 
http://chencheng.shinyapps.DEMO io/demo_higgs
INTERACTIVE 
VISUALIZATION 
R 
SHINY 
28 
http://chencheng.shinyapps.DEMO io/demo_higgs
INTERACTIVE 
VISUALIZATION 
R 
SHINY 
29 
Use 
a 
reasonable 
number 
of 
bins 
to 
display 
the 
underlying 
distribution 
http://chencheng.shinyapps.DEMO io/demo_higgs
INTERACTIVE 
VISUALIZATION 
R 
SHINY 
30 
Use 
a 
reasonable 
transformation 
to 
display 
the 
underlying 
distribution 
http://chencheng.shinyapps.DEMO io/demo_higgs
HISTOGRAM 
(CONT’D) 
31 
Count 
Transformations 
aPrReI 
_stoamu_eetitma es 
not 
necessary
32 
Do 
that 
for 
all 
30 
variables
PAIRWISE 
CORRELATIONS 
33 
Count 
Count 
BKG 
SGN 
PRI_lep_phi 
& 
PRI_met_phi
PAIRWISE 
CORRELATIONS 
34 
Count 
BKG 
SGN 
PRI_lep_phi 
& 
PRI_met_phi 
Set 
transparency 
parameter 
appropriately 
to 
reveal 
important 
pattCeronusnt
PAIRWISE 
CORRELATIONS 
35 
Count 
BKG 
SGN 
PRI_lep_phi 
& 
PRI_met_phi 
Correlation 
coefficient 
== 
0 
does 
not 
mean 
no 
correlation Count
PAIRWISE 
CORRELATIONS 
36 
Count 
Count 
BKG 
SGN 
PRI_lep_phi 
& 
PRI_met_phi
FEATURE 
ENHANCEMENT 
ROTATION 
BKG 
SGN 
rotated 
PRI_lep_phi 
& 
PRI_met_phi 
Validate 
visual 
“evidence” 
from 
various 
perspectives 37
FEATURE 
ENHANCEMENT 
ROTATION 
BKG 
SGN 
rotated 
PRI_lep_phi 
& 
PRI_met_phi 
Validate 
visual 
“evidence” 
from 
various 
perspectives 38
PAIRWISE 
VARIABLES 
— 
LOW 
RES. 
39 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
DER_deltar_tau_lep
PAIRWISE 
VARIABLES 
— 
HIGH 
RES. 
Try 
High 
Resolution 40 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
DER_deltar_tau_lep
PAIRWISE 
VARIABLES 
— 
HIGH 
RES. 
Curve 
fitting 41 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
DER_deltar_tau_lep
FEATURE 
ENHANCEMENT 
CURVE 
FITTING 
Enhance 
a 
variable 
based 
on 
correlation 
with 
another 
variable 42 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
DER_deltar_tau_lep
FEATURE 
ENHANCEMENT 
ROTATION 
BY 
PRI_TAU_PHI 
43 
Domain 
Knowledge 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
PRI_lep_phi
FEATURE 
ENHANCEMENT 
ROTATION 
BY 
PRI_TAU_PHI 
Feature 
enhancement 
by 
applying 
domain 
knowledge 
44 
Count 
Count 
BKG 
SGN 
DER_pt_h 
& 
PRI_lep_phi 
Domain 
Knowledge
FEATURE 
ENHANCEMENT 
ROTATION 
45 
Count 
Count 
BKG 
SGN 
PRI_jet_leading_eta 
& 
PRI_jet_subleading_eta
• Select 
DATA 
DRILL 
DOWN 
variable(s): 
One 
var. 
for 
histogram, 
two 
var. 
for 
scatter 
plot 
46 
http://chencheng.shinyapps.DEMO io/demo_higgs
• Dynamically 
DATA 
DRILL 
DOWN 
select 
a 
subset 
of 
data 
— 
PRI_jet_num 
= 
2 
47 
http://chencheng.shinyapps.DEMO io/demo_higgs
• Patterns 
DATA 
DRILL 
DOWN 
in 
the 
subset 
data 
— 
PRI_jet_leading_eta 
& 
PRI_jet_subleading_eta 
48 
http://chencheng.shinyapps.DEMO io/demo_higgs
• Dynamically 
DATA 
DRILL 
DOWN 
select 
a 
subset 
of 
data 
— 
PRI_jet_num 
= 
3 
49 
http://chencheng.shinyapps.DEMO io/demo_higgs
• Patterns 
DATA 
DRILL 
DOWN 
in 
the 
subset 
data 
— 
PRI_jet_leading_eta 
& 
PRI_jet_subleading_eta 
50 
http://chencheng.shinyapps.DEMO io/demo_higgs
• Patterns 
DATA 
DRILL 
DOWN 
in 
the 
subset 
data 
— 
PRI_jet_leading_eta 
& 
PRI_jet_subleading_eta 
51 
PRI_jet_num 
= 
2 PRI_jet_num 
= 
3 
Interactive 
data 
visualization 
techniques 
are 
helpful 
http://chencheng.shinyapps.DEMO io/demo_higgs
52 
Do 
that 
for 
all 
30 
* 
29 
~= 
900 
pairs
PARTICLE 
LOCATION 
— 
(0, 
S) 
53 
Animation 
Convert 
numerical 
data 
back 
into 
actual 
object 
with 
meaning
PARTICLE 
LOCATION 
— 
(0, 
B) 
54 
Animation
INSPIRATION 
FROM 
ANIMATION 
• Distance 
ratio 
between 
MET-­‐Lep 
and 
Tau-­‐Lep 
d(MET, 
Lep)/d(Tau, 
Lep) 
55 
Inspiration 
from 
meaningful 
visualization 
can 
be 
helpful 
Count 
dist_ratio_met_lep_tau 
BKG 
SGN
INSPIRATION 
FROM 
ANIMATION 
• Distance 
ratio 
between 
MET-­‐Lep 
and 
Tau-­‐Lep 
d(MET, 
Lep)/d(Tau, 
Lep) 
BKG 
SGN 
56 
Adjust 
visualization 
for 
better 
efficiency 
Count 
dist_ratio_met_lep_tau 
Count 
dist_ratio_met_lep_tau 
BKG 
SGN
• Variable 
reduction 
– Simple 
rotation 
– Transformation 
– Domain 
knowledge 
– … 
• Feature 
generation 
– Domain 
knowledge 
– Inspiration 
from 
various 
visualizations 
– Statistical 
approaches 
–… 
FEATURE 
ENHANCEMENT 
45 
degree 
rotation 
Curve 
fitting 
Rotation 
by 
phi 
distance_ratio 
Principle 
component 
analysis 
57
Background 
58 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate 
©
• Gradient 
boosting 
tree 
• Neural 
network 
• Bayesian 
network 
• Support 
vector 
machine 
• Generalized 
additive 
model 
MODELS 
59
• Gradient 
boosting 
tree 
• Neural 
network 
• Bayesian 
network 
• Support 
vector 
machine 
• Generalized 
additive 
model 
MODELS 
60
• Decision 
GRADIENT 
BOOSTING 
TREE 
tree 
– Build 
many 
shallow 
trees 
• Boosting 
– Build 
trees 
based 
on 
residual 
• Bagging 
– Each 
tree 
uses 
a 
subset 
of 
the 
data 
• Ensembling 
– Combine 
the 
trees 
61
• Decision 
GRADIENT 
BOOSTING 
TREE 
tree 
– Build 
many 
shallow 
trees 
• Boosting 
– Build 
trees 
based 
on 
residual 
• Bagging 
– Each 
tree 
uses 
a 
subset 
of 
the 
data 
• Ensembling 
– Combine 
the 
trees 
62
• Regression 
tree 
DECISION 
TREE 
63 
1.0 
0.5 
0.0 
−0.5 
−1.0 
0.0 2.5 5.0 7.5 10.0 
x 
y
• Regression 
tree 
DECISION 
TREE 
64 
1.0 
0.5 
0.0 
−0.5 
−1.0 
0.0 2.5 5.0 7.5 10.0 
x 
y 
Depth 
= 
1 
| 
x< 6.614 
x>=6.614 
0.19 
n=100 
−0.08 
n=64 
0.66 
n=36 
Regression Tree with Node Depth = 1
• Regression 
tree 
DECISION 
TREE 
65 
0.19 
n=100 
| 
x< 6.614 
x>=6.614 
x>=3.049 x>=8.953 
x< 3.049 x< 8.953 
−0.08 
n=64 
−0.53 
n=40 
0.67 
n=24 
0.66 
n=36 
0.086 
n=7 
0.8 
n=29 
Regression Tree with Node Depth = 2 
1.0 
0.5 
0.0 
−0.5 
−1.0 
0.0 2.5 5.0 7.5 10.0 
x 
y 
Depth 
= 
2
• Regression 
tree 
DECISION 
TREE 
66 
| 
x< 6.614 
x>=3.049 
x< 5.862 
x>=8.953 
x< 7.207 
x>=6.614 
x< 3.049 
x>=5.862 
x< 8.953 
x>=7.207 
0.19 
n=100 
−0.08 
n=64 
−0.53 
n=40 
−0.67 
n=32 
0.045 
n=8 
0.67 
n=24 
0.66 
n=36 
0.086 
n=7 
0.8 
n=29 
0.57 
n=7 
0.87 
n=22 
Regression Tree with Node Depth = 3 
1.0 
0.5 
0.0 
−0.5 
−1.0 
0.0 2.5 5.0 7.5 10.0 
x 
y 
Depth 
= 
3
• Regression 
tree 
DECISION 
TREE 
67 
| 
x< 6.614 
x>=3.049 
x< 5.862 
x>=3.594 
x>=8.953 
x< 7.207 
x>=6.614 
x< 3.049 
x>=5.862 
x< 3.594 
x< 8.953 
x>=7.207 
0.19 
n=100 
−0.08 
n=64 
−0.53 
n=40 
−0.67 
n=32 
−0.8 
n=25 
−0.23 
n=7 
0.045 
n=8 
0.67 
n=24 
0.66 
n=36 
0.086 
n=7 
0.8 
n=29 
0.57 
n=7 
0.87 
n=22 
Regression Tree with Node Depth = 4 
1.0 
0.5 
0.0 
−0.5 
−1.0 
0.0 2.5 5.0 7.5 10.0 
x 
y 
Depth 
= 
4
DECISION 
TREE 
X0 
= 
X; 
Y0 
= 
Y; 
latest_model 
= 
train_tree(X, 
Y); 
for 
ii 
= 
1:NUM_ITER 
Index_train 
= 
random(1:NUM_REC, 
FRAC_TRAIN 
* 
NUM_REC) 
X 
= 
X0[Index_train]; 
Y 
= 
Y0[Index_train]; 
v_resid 
= 
Y 
-­‐ 
wts 
* 
latest_model(X); 
tree(ii) 
= 
train_tree(X, 
v_pseudo_resid, 
wts); 
latest_model 
+= 
LARNING_RATE 
* 
tree(ii) 
68 
base 
model
GRADIENT 
BOOSTING 
TREE 
(V. 
1) 
X0 
= 
X; 
Y0 
= 
Y; 
latest_model 
= 
train_tree(X, 
Y); 
for 
ii 
= 
1:NUM_ITER 
Index_train 
= 
random(1:NUM_REC, 
FRAC_TRAIN 
* 
NUM_REC) 
X 
= 
X0[Index_train]; 
Y 
= 
Y0[Index_train]; 
v_resid 
= 
Y 
-­‐ 
latest_model(X); 
tree_add= 
train_tree(X, 
v_resid); 
latest_model 
+= 
LARNING_RATE 
* 
tree_add 
get 
the 
residuals 
fit 
a 
tree 
for 
residuals 
additive 
model 
69
(STOCHASTIC) 
GRADIENT 
BOOSTING 
TREE 
X0 
= 
X; 
Y0 
= 
Y; 
latest_model 
= 
train_tree(X, 
Y); 
for 
ii 
= 
1:NUM_ITER 
Index_train 
= 
random(1:NUM_REC, 
FRAC_TRAIN 
* 
NUM_REC) 
X 
= 
X0[Index_train]; 
Y 
= 
Y0[Index_train]; 
v_resid 
= 
Y 
-­‐ 
latest_model(X); 
tree_add 
= 
train_tree(X, 
v_resid); 
latest_model 
+= 
LARNING_RATE 
* 
tree_add 
get 
sampled 
index 
sampled 
records 
as 
input 
70 
store 
input
(STOCHASTIC) 
GRADIENT 
BOOSTING 
TREE 
WITH 
WEIGHT 
X0 
= 
X; 
Y0 
= 
Y; 
latest_model 
= 
train_tree(X, 
Y, 
wts); 
for 
ii 
= 
1:NUM_ITER 
Index_train 
= 
random(1:NUM_REC, 
FRAC_TRAIN 
* 
NUM_REC) 
X 
= 
X0[Index_train]; 
Y 
= 
Y0[Index_train]; 
v_resid 
= 
Y 
-­‐ 
wts 
* 
latest_model(X); 
tree_add 
= 
train_tree(X, 
v_resid, 
wts); 
latest_model 
+= 
LARNING_RATE 
* 
tree_add 
71
(GENERAL) 
GRADIENT 
BOOSTING 
X0 
= 
X; 
Y0 
= 
Y; 
latest_model 
= 
train_base_model(X, 
Y, 
wts); 
for 
ii 
= 
1:NUM_ITER 
Index_train 
= 
random(1:NUM_REC, 
FRAC_TRAIN 
* 
NUM_REC) 
X 
= 
X0[Index_train]; 
Y 
= 
Y0[Index_train]; 
v_pseudo_resid 
= 
get_pseudo_residual(X, 
Y, 
wts, 
latest_model, 
LOSS_FUNCTION_TYPE); 
model_add_base 
= 
train_base_model(X, 
v_pseudo_resid, 
wts); 
alpha 
= 
linear_search(cost_function, 
model_add_base, 
X, 
Y, 
wts); 
latest_model 
+= 
LARNING_RATE 
* 
(alpha 
* 
model_add_base) 
[Stochastic Gradient Boosting] Jerome H. Friedman, 1999 
72
Background 
73 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate 
©
APPLYING 
GBM 
IN 
R 
gbm_model 
= 
gbm.fit( 
x=train[,x_vars, 
with 
= 
FALSE], 
y=train$Label, 
distribution 
= 
char_distr, 
w 
= 
w, 
n.trees 
= 
n_trees, 
interaction.depth 
= 
num_inter, 
n.minobsinnode 
= 
min_obs_node, 
shrinkage 
= 
shrinkage_rate, 
bag.fraction 
= 
frac_bag) 
74
VARIABLE 
IMPORTANCE 
75 
Relative 
Importance
APPLY 
MODEL 
ON 
TEST 
DATA 
76 
EventId Score RankOrder Class 
1 0.98 501 s 
2 0.42 259,579 b 
3 0.46 264,125 b 
. . . . 
. . . . 
449,998 0.86 31,154 s 
449,999 0.12 489,251 b 
550,000 0.79 110,154 b
Background 
77 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate
GRADIENT 
BOOSTING 
PARAMETERS 
• Number 
of 
iteration 
• Minimum 
observation 
for 
each 
node 
• Fraction 
of 
bagging 
(0.5 
~ 
0.8) 
• Learning 
rate 
(<0.1) 
• Depth 
of 
tree 
(4 
~ 
8) 
78
Background 
79 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate
• Split 
training 
data 
– 70% 
CROSS 
VALIDATION 
for 
training 
– 30% 
for 
cross 
validation 
• Train 
model 
(70%) 
• Measure 
performance 
(30%) 
80
PERFORMANCE 
BASED 
ON 
AMS 
81 
Trade-­‐off 
between: 
Ratio 
of 
Signal/Background 
events 
Number 
of 
records 
in 
selection 
region 
EventId Score RankOrd 
er 
Class truth 
1 0.98 501 S S 
2 0.42 259,579 B 
3 0.46 264,125 B 
. . . . 
. . . . 
449,998 0.86 31,154 S B 
449,999 0.12 489,251 B 
550,000 0.79 110,154 B 
Selection 
Region 
s 
= 
sum(S) 
b= 
sum(B)
PERFORMANCE 
BASED 
ON 
AMS 
82 
Percentile 
AMS 
AMS 
percentage 
of 
signal
COMPARE 
TWO 
MODEL 
RESULTS 
Percentile 
83 
Training 
Cross 
validation 
Percentile 
AMS 
AMS 
percentage 
of 
signal
Percentile 
84 
COMPARE 
TWO 
MODEL 
RESULTS 
Training 
Cross 
validation 
Percentile 
AMS 
AMS 
percentage 
of 
signal
AMS 
BY 
NUM. 
ITERATION 
85 
Percentile 
AMS 
Animation
Background 
86 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate
s 
b 
>> 
4 
HEAT 
MAP 
OF 
AMS 
ON 
B-­‐S 
PLAN 
87
OPTIMIZATION 
BASED 
ON 
OBJECTIVE 
FUNCTION 
Percentile 
88 
A 
B 
C 
AMS
HEAT 
MAP 
OF 
AMS 
ON 
B-­‐S 
PLAN 
89 
s 
b 
A 
B 
C
HEAT 
MAP 
OF 
AMS 
ON 
B-­‐S 
PLAN 
90 
s 
b 
A 
B 
C 
Inspiration 
from 
Lagrangian 
Method 
Weight 
signal 
and 
background 
events 
by 
partial 
derivatives 
of 
AMS 
function
AMS 
CURVE 
ON 
B-­‐S 
PLAN 
91 
A 
B 
C 
Inspiration 
from 
Lagrangian 
Method 
Weight 
signal 
and 
background 
events 
by 
partial 
derivatives 
of 
AMS 
function 
s 
partial 
derivative 
of 
AMS 
against 
s 
partial 
derivative 
of 
AMS 
against 
b 
b 
Ratio 
of 
the 
derivatives 
==> 
relative 
weight
IMPROVEMENT 
DUE 
TO 
WEIGHTING 
92 
AMS* 
Num_Iterations 
AMS
IMPROVEMENT 
DUE 
TO 
WEIGHTING 
(CONT’D) 
93 
Num_Iterations 
AMS* 
AMS
AUGMENTED 
GRADIENT 
BOOSTING 
94 
Apply 
GBM 
Weight 
Adjustment 
©
AUGMENTED 
GRADIENT 
BOOSTING 
95 
Apply 
GBM 
Weight 
Adjustment 
Remove 
very 
high 
and 
very 
low 
score 
records 
from 
train 
and 
test 
©
IMPROVEMENT 
DUE 
TO 
ELIMINATION 
96 
Num_Iterations 
AMS* 
AMS
IMPROVEMENT 
DUE 
TO 
ELIMINATION 
(CONT’D) 
97 
Num_Iterations 
AMS* 
AMS
AUGMENTED 
GRADIENT 
BOOSTING 
98 
Apply 
ML 
Model 
Weight 
Adjustment 
Remove 
very 
high 
and 
very 
low 
score 
records 
from 
train 
and 
test 
©
Background 
99 
Data 
Model 
Understand 
read 
visualize 
read 
discuss 
Explore Enhance 
reduce 
generate 
apply innovate 
fine-­‐tune 
Train Select Optimize 
Validate 
find 
cross 
validate
• Version 
OTHER 
TOPICS 
control 
(Git, 
Source 
Tree) 
– Effectively 
implement 
many 
different 
ideas 
• File 
organization 
– Efficiently 
pull 
out 
the 
file 
needed 
• Effective 
code 
(R, 
Python) 
– it 
matters 
so 
much 
when 
dealing 
with 
big 
data 
100
Thanks 
you 
for 
your 
participation! 
Any 
Questions? 
goDCI.com

More Related Content

Similar to Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs Boson Machine Learning Challenge

Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
06 approaching multisensor inspection and robotic systems for dry cask storag...
06 approaching multisensor inspection and robotic systems for dry cask storag...06 approaching multisensor inspection and robotic systems for dry cask storag...
06 approaching multisensor inspection and robotic systems for dry cask storag...leann_mays
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"Inhacking
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersMilan Milošević
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325Jason Price, PhD
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O IntroductionRiccardo Rigon
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014multimediaeval
 
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...University of Victoria Talk - Metocean analysis and Machine Learning for Impr...
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...Aaron Barker
 

Similar to Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs Boson Machine Learning Challenge (20)

Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
06 approaching multisensor inspection and robotic systems for dry cask storag...
06 approaching multisensor inspection and robotic systems for dry cask storag...06 approaching multisensor inspection and robotic systems for dry cask storag...
06 approaching multisensor inspection and robotic systems for dry cask storag...
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parameters
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325CARLI Usage Stats Keynote 20130325
CARLI Usage Stats Keynote 20130325
 
Energy saving policies final
Energy saving policies finalEnergy saving policies final
Energy saving policies final
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
Lecture 1 velmurugan
Lecture 1 velmuruganLecture 1 velmurugan
Lecture 1 velmurugan
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O Introduction
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014
 
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...University of Victoria Talk - Metocean analysis and Machine Learning for Impr...
University of Victoria Talk - Metocean analysis and Machine Learning for Impr...
 

Recently uploaded

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Recently uploaded (20)

CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 

Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs Boson Machine Learning Challenge

  • 1. BIG DATA COMPETITION: MAXIMIZING YOUR POTENTIAL EXAMPLED WITH THE 2014 HIGGS BOSON MACHINE LEARNING CHALLENGE Dr. Cheng CHEN email: cchen@goDCI.com twitter: @cheng_chen_us Development Consulting International LLC goDCI.com this presentation is copyright protected © 1
  • 2. PRESENTER Ohio State University, Tongji University Ph.D. Civil Engineering M.S. Applied Statistics Minor Computer Science Advanced trainings: City and Regional Planning Industrial and Systems Engineering Mathematics Passion: (this) machine learning 2
  • 3. HIGGS BOSON MACHINE LEARNING CHALLENGE • Goal: improve the procedure that produces the selection region of Higgs Boson • 4 Month Duration • 1,785 teams • Many machine learning experts, statisticians, and physicist • Top 5 are from 5 different countries 3 Hungary Netherlands France Russia http://www.kaggle.com/c/higgs-­‐boson/leaderboard U.S.A/China
  • 4. Background 4 Data Model Understand read visualize read discuss Explore Enhance reduce generate cross validate innovate find Train Select Optimize Validate apply fine-­‐tune ©
  • 5. Background 5 Data Model Understand read visualize Explore Enhance reduce generate find Train Select Optimize innovate read discuss Validate apply fine-­‐tune cross validate ©
  • 7. • a.k.a HIGGS BOSON the God Particle (explains some mass) • A fundamental particle theorized in 1964 in the Standard Model of Particle Physics • “Considered” discovered in 2011 – 2013 in LHC by CERN • A number of prestigious awards in 2013, including a Nobel prize 7 A "definitive" answer might require "another few years" after the collider's 2015 restart. deputy chair of physics at Brookhaven National Laboratory http://en.wikipedia.org/wiki/Higgs_boson http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg
  • 8. CERN: THE EUROPEAN ORGANIZATION • Established FOR NUCLEAR RESEARCH in 1954 • Birth of World Wide Web (1989) 8 maps.google.com
  • 9. • 27 LARGE HADRON COLLIDER (LHC) km (17 mi) in circumference • 175 meters (574 ft) beneath ground • Built from 1998 to 2008 • Over 10,000 scientists and engineers • Over 100 counties • Seven particle detectors https://www.llnl.gov/news/llnl-­‐set-­‐host-­‐international-­‐lattice-­‐physics-­‐conference 9 http://en.wikipedia.org/wiki/Large_Hadron_Collider http://en.wikipedia.org/wiki/Large_Hadron_Collider
  • 10. • 46 meters long • 25 meters in diameter • Weighs about 7,000 tonnes • Contains some 3000 km of cable • Involves ATLAS roughly 3,000 physicists from over 175 institutions in 38 countries. 10 http://en.wikipedia.org/wiki/Large_Hadron_Collider http://higgsml.lal.in2p3.fr/documentation/
  • 11. • 46 meters long • 25 meters in diameter • Weighs about 7,000 tonnes • Contains some 3000 km of cable • Involves ATLAS roughly 3,000 physicists from over 175 institutions in 38 countries. 11 http://en.wikipedia.org/wiki/Large_Hadron_Collider http://higgsml.lal.in2p3.fr/documentation/
  • 12. • 46 meters long • 25 meters in diameter • Weighs about 7,000 tonnes • Contains some 3000 km of cable • Involves ATLAS roughly 3,000 physicists from over 175 institutions in 38 countries. 12 http://en.wikipedia.org/wiki/Large_Hadron_Collider http://higgsml.lal.in2p3.fr/documentation/
  • 13. • Higgs CHALLENGES IN DETECTION OF HIGGS BOSON Boson can not be measured directly (decays immediately into lighter particles) • Other particles can decay into the same set of lighter particles • PRODUCTION and DECAY of Higgs Boson depends on the mass, while mass was not predicted by theory (now we know it is close to 125 Gev) 13 Seeing a circular shaped shadow does not mean the real object is a sphere ball https://www2.physics.ox.ac.uk/sites/default/files/2012-­‐03-­‐27/sinead_farrington_pdf_17376.pdf
  • 14. CURRENT DETECTION MECHANISM • Raw data collected from LHC • Hundreds of millions of proton-­‐proton collisions (event) per second • 400 events of interest are selected per second – Signal event (i.e. Higgs Boson) – Background event (i.e. other particles) • Events in Ad Hoc selection region (in certain channels) exceeding background noise 14 Needs improvement in significance and robustness in selection criteria
  • 15. SIMPLIFICATIONS FOR COMPETITION • Simulated Data • Fixed mass (125 GeV) • Simplified decay channel – Next Slide • Simplified background events (three representative types only) –Decay of the Z boson (91.2 GeV) into Tau-­‐Tau –Decay of a pair of top quarks into lepton and hadronic tau –“Decay” of the W boson into lepton and hadronic tau due to imperfections in the particle identification procedure • Simplified objective function (significance score) 15
  • 16. • Decay SIMPLIFIED DECAY CHANNEL of Tau-­‐Tau Channel only • One tau decays into lepton and two neutrino • The other tau decays into hadronic tau and a neutrino • (Note: Neutrinos can not be detected) hadronic tau: a bunch of hadrons 16
  • 17. • Decay SIMPLIFIED DECAY CHANNEL of Tau-­‐Tau Channel only • One tau decays into lepton and two neutrino • The other tau decays into hadronic tau and a neutrino • (Note: Neutrinos can not be detected) hadronic tau: a bunch of hadrons 17
  • 18. • Decay SIMPLIFIED DECAY CHANNEL of Tau-­‐Tau Channel only • One tau decays into lepton and two neutrino • The other tau decays into hadronic tau and a neutrino • (Note: Neutrinos can not be detected) 18 Jets MET vectorized momenta are given hadronic tau: a bunch of hadrons
  • 19. Background 19 Data Model Understand read visualize Explore Enhance reduce generate find Train Select Optimize innovate read discuss Validate apply fine-­‐tune cross validate ©
  • 20. • 250,000 training • 550,000 testing • 30 variables – 17 Primitive • Momenta • Direction – 13 Derived DATA DIMENSION 20 4 rows in training data EventId DER_ma ss_MMC DER_ma ss_trans verse_m et_lep DER_ma ss_vis DER_pt_ h DER_del taeta_jet _jet DER_ma ss_jet_je t DER_pro deta_jet_ jet DER_del tar_tau_l ep DER_pt_ tot DER_su m_pt 100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76 100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157 100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814 100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968 EventId DER_pt_ ratio_lep _tau DER_me t_phi_ce ntrality DER_lep _eta_cen trality PRI_tau_ pt PRI_tau_ eta PRI_tau_ phi PRI_lep_ pt PRI_lep_ eta PRI_lep_ phi PRI_met 100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824 100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704 100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283 100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082 EventId PRI_met _phi PRI_met _sumet PRI_jet_ num PRI_jet_l eading_ pt PRI_jet_l eading_e ta PRI_jet_l eading_ phi PRI_jet_ subleadi ng_pt PRI_jet_ subleadi ng_eta PRI_jet_ subleadi ng_phi PRI_jet_ all_pt 100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497 100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226 100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251 100003 0.06 86.062 0 NA NA NA NA NA NA 0 EventId Weight Label 100000 0.00265331s133733 100001 2.23358448b717 100002 2.34738894b364 100003 5.44637821b192 Data loaded correctly Notice NA values
  • 21. MISSING VALUES 21 col_name NA_count NA_pct 1 EventId 2 DER_mass_MMC 38,114 15% 3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71% 7 DER_mass_jet_jet 177,457 71% 8 DER_prodeta_jet_jet 177,457 71% 9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71% 15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40% 26 PRI_jet_leading_eta 99,913 40% 27 PRI_jet_leading_phi 99,913 40% 28 PRI_jet_subleading_pt 177,457 71% 29 PRI_jet_subleading_eta 177,457 71% 30 PRI_jet_subleading_phi 177,457 71% 31 PRI_jet_all_pt 32 Weight 33 Label
  • 22. MISSING VALUES 22 col_name NA_count NA_pct 1 EventId 2 DER_mass_MMC 38,114 15% 3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71% 7 DER_mass_jet_jet 177,457 71% 8 DER_prodeta_jet_jet 177,457 71% 9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71% 15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40% 26 PRI_jet_leading_eta 99,913 40% 27 PRI_jet_leading_phi 99,913 40% 28 PRI_jet_subleading_pt 177,457 71% 29 PRI_jet_subleading_eta 177,457 71% 30 PRI_jet_subleading_phi 177,457 71% 31 PRI_jet_all_pt 32 Weight 33 Label Notice the consistency in missing values
  • 23. HOW TO HANDLE MISSING VALUES • Assign a value – Generate a random value – Fit a value (mean, median, nearest neighbor, etc.) – Fix a value (domain knowledge) • Remove the record • Leave as is 23
  • 24. HOW TO HANDLE MISSING VALUES • Assign a value – Generate a random value – Fit a value (mean, median, nearest neighbor, etc.) – Fix a value (domain knowledge) • Remove the record • Leave as is 24
  • 25. HISTOGRAM PRI_jet_leading_pt Count Log transformation Count Inverse transformation Count Density is more meaningful in the range of x No fuzzy jump at the edge 25
  • 26. HISTOGRAM (CONT’D) DER_pt_h Count Log transformation Bi-­‐modality is revealed 26 Count Inverse transformation Count
  • 27. INTERACTIVE VISUALIZATION R SHINY 27 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 28. INTERACTIVE VISUALIZATION R SHINY 28 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 29. INTERACTIVE VISUALIZATION R SHINY 29 Use a reasonable number of bins to display the underlying distribution http://chencheng.shinyapps.DEMO io/demo_higgs
  • 30. INTERACTIVE VISUALIZATION R SHINY 30 Use a reasonable transformation to display the underlying distribution http://chencheng.shinyapps.DEMO io/demo_higgs
  • 31. HISTOGRAM (CONT’D) 31 Count Transformations aPrReI _stoamu_eetitma es not necessary
  • 32. 32 Do that for all 30 variables
  • 33. PAIRWISE CORRELATIONS 33 Count Count BKG SGN PRI_lep_phi & PRI_met_phi
  • 34. PAIRWISE CORRELATIONS 34 Count BKG SGN PRI_lep_phi & PRI_met_phi Set transparency parameter appropriately to reveal important pattCeronusnt
  • 35. PAIRWISE CORRELATIONS 35 Count BKG SGN PRI_lep_phi & PRI_met_phi Correlation coefficient == 0 does not mean no correlation Count
  • 36. PAIRWISE CORRELATIONS 36 Count Count BKG SGN PRI_lep_phi & PRI_met_phi
  • 37. FEATURE ENHANCEMENT ROTATION BKG SGN rotated PRI_lep_phi & PRI_met_phi Validate visual “evidence” from various perspectives 37
  • 38. FEATURE ENHANCEMENT ROTATION BKG SGN rotated PRI_lep_phi & PRI_met_phi Validate visual “evidence” from various perspectives 38
  • 39. PAIRWISE VARIABLES — LOW RES. 39 Count Count BKG SGN DER_pt_h & DER_deltar_tau_lep
  • 40. PAIRWISE VARIABLES — HIGH RES. Try High Resolution 40 Count Count BKG SGN DER_pt_h & DER_deltar_tau_lep
  • 41. PAIRWISE VARIABLES — HIGH RES. Curve fitting 41 Count Count BKG SGN DER_pt_h & DER_deltar_tau_lep
  • 42. FEATURE ENHANCEMENT CURVE FITTING Enhance a variable based on correlation with another variable 42 Count Count BKG SGN DER_pt_h & DER_deltar_tau_lep
  • 43. FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI 43 Domain Knowledge Count Count BKG SGN DER_pt_h & PRI_lep_phi
  • 44. FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI Feature enhancement by applying domain knowledge 44 Count Count BKG SGN DER_pt_h & PRI_lep_phi Domain Knowledge
  • 45. FEATURE ENHANCEMENT ROTATION 45 Count Count BKG SGN PRI_jet_leading_eta & PRI_jet_subleading_eta
  • 46. • Select DATA DRILL DOWN variable(s): One var. for histogram, two var. for scatter plot 46 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 47. • Dynamically DATA DRILL DOWN select a subset of data — PRI_jet_num = 2 47 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 48. • Patterns DATA DRILL DOWN in the subset data — PRI_jet_leading_eta & PRI_jet_subleading_eta 48 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 49. • Dynamically DATA DRILL DOWN select a subset of data — PRI_jet_num = 3 49 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 50. • Patterns DATA DRILL DOWN in the subset data — PRI_jet_leading_eta & PRI_jet_subleading_eta 50 http://chencheng.shinyapps.DEMO io/demo_higgs
  • 51. • Patterns DATA DRILL DOWN in the subset data — PRI_jet_leading_eta & PRI_jet_subleading_eta 51 PRI_jet_num = 2 PRI_jet_num = 3 Interactive data visualization techniques are helpful http://chencheng.shinyapps.DEMO io/demo_higgs
  • 52. 52 Do that for all 30 * 29 ~= 900 pairs
  • 53. PARTICLE LOCATION — (0, S) 53 Animation Convert numerical data back into actual object with meaning
  • 54. PARTICLE LOCATION — (0, B) 54 Animation
  • 55. INSPIRATION FROM ANIMATION • Distance ratio between MET-­‐Lep and Tau-­‐Lep d(MET, Lep)/d(Tau, Lep) 55 Inspiration from meaningful visualization can be helpful Count dist_ratio_met_lep_tau BKG SGN
  • 56. INSPIRATION FROM ANIMATION • Distance ratio between MET-­‐Lep and Tau-­‐Lep d(MET, Lep)/d(Tau, Lep) BKG SGN 56 Adjust visualization for better efficiency Count dist_ratio_met_lep_tau Count dist_ratio_met_lep_tau BKG SGN
  • 57. • Variable reduction – Simple rotation – Transformation – Domain knowledge – … • Feature generation – Domain knowledge – Inspiration from various visualizations – Statistical approaches –… FEATURE ENHANCEMENT 45 degree rotation Curve fitting Rotation by phi distance_ratio Principle component analysis 57
  • 58. Background 58 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate ©
  • 59. • Gradient boosting tree • Neural network • Bayesian network • Support vector machine • Generalized additive model MODELS 59
  • 60. • Gradient boosting tree • Neural network • Bayesian network • Support vector machine • Generalized additive model MODELS 60
  • 61. • Decision GRADIENT BOOSTING TREE tree – Build many shallow trees • Boosting – Build trees based on residual • Bagging – Each tree uses a subset of the data • Ensembling – Combine the trees 61
  • 62. • Decision GRADIENT BOOSTING TREE tree – Build many shallow trees • Boosting – Build trees based on residual • Bagging – Each tree uses a subset of the data • Ensembling – Combine the trees 62
  • 63. • Regression tree DECISION TREE 63 1.0 0.5 0.0 −0.5 −1.0 0.0 2.5 5.0 7.5 10.0 x y
  • 64. • Regression tree DECISION TREE 64 1.0 0.5 0.0 −0.5 −1.0 0.0 2.5 5.0 7.5 10.0 x y Depth = 1 | x< 6.614 x>=6.614 0.19 n=100 −0.08 n=64 0.66 n=36 Regression Tree with Node Depth = 1
  • 65. • Regression tree DECISION TREE 65 0.19 n=100 | x< 6.614 x>=6.614 x>=3.049 x>=8.953 x< 3.049 x< 8.953 −0.08 n=64 −0.53 n=40 0.67 n=24 0.66 n=36 0.086 n=7 0.8 n=29 Regression Tree with Node Depth = 2 1.0 0.5 0.0 −0.5 −1.0 0.0 2.5 5.0 7.5 10.0 x y Depth = 2
  • 66. • Regression tree DECISION TREE 66 | x< 6.614 x>=3.049 x< 5.862 x>=8.953 x< 7.207 x>=6.614 x< 3.049 x>=5.862 x< 8.953 x>=7.207 0.19 n=100 −0.08 n=64 −0.53 n=40 −0.67 n=32 0.045 n=8 0.67 n=24 0.66 n=36 0.086 n=7 0.8 n=29 0.57 n=7 0.87 n=22 Regression Tree with Node Depth = 3 1.0 0.5 0.0 −0.5 −1.0 0.0 2.5 5.0 7.5 10.0 x y Depth = 3
  • 67. • Regression tree DECISION TREE 67 | x< 6.614 x>=3.049 x< 5.862 x>=3.594 x>=8.953 x< 7.207 x>=6.614 x< 3.049 x>=5.862 x< 3.594 x< 8.953 x>=7.207 0.19 n=100 −0.08 n=64 −0.53 n=40 −0.67 n=32 −0.8 n=25 −0.23 n=7 0.045 n=8 0.67 n=24 0.66 n=36 0.086 n=7 0.8 n=29 0.57 n=7 0.87 n=22 Regression Tree with Node Depth = 4 1.0 0.5 0.0 −0.5 −1.0 0.0 2.5 5.0 7.5 10.0 x y Depth = 4
  • 68. DECISION TREE X0 = X; Y0 = Y; latest_model = train_tree(X, Y); for ii = 1:NUM_ITER Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC) X = X0[Index_train]; Y = Y0[Index_train]; v_resid = Y -­‐ wts * latest_model(X); tree(ii) = train_tree(X, v_pseudo_resid, wts); latest_model += LARNING_RATE * tree(ii) 68 base model
  • 69. GRADIENT BOOSTING TREE (V. 1) X0 = X; Y0 = Y; latest_model = train_tree(X, Y); for ii = 1:NUM_ITER Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC) X = X0[Index_train]; Y = Y0[Index_train]; v_resid = Y -­‐ latest_model(X); tree_add= train_tree(X, v_resid); latest_model += LARNING_RATE * tree_add get the residuals fit a tree for residuals additive model 69
  • 70. (STOCHASTIC) GRADIENT BOOSTING TREE X0 = X; Y0 = Y; latest_model = train_tree(X, Y); for ii = 1:NUM_ITER Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC) X = X0[Index_train]; Y = Y0[Index_train]; v_resid = Y -­‐ latest_model(X); tree_add = train_tree(X, v_resid); latest_model += LARNING_RATE * tree_add get sampled index sampled records as input 70 store input
  • 71. (STOCHASTIC) GRADIENT BOOSTING TREE WITH WEIGHT X0 = X; Y0 = Y; latest_model = train_tree(X, Y, wts); for ii = 1:NUM_ITER Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC) X = X0[Index_train]; Y = Y0[Index_train]; v_resid = Y -­‐ wts * latest_model(X); tree_add = train_tree(X, v_resid, wts); latest_model += LARNING_RATE * tree_add 71
  • 72. (GENERAL) GRADIENT BOOSTING X0 = X; Y0 = Y; latest_model = train_base_model(X, Y, wts); for ii = 1:NUM_ITER Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC) X = X0[Index_train]; Y = Y0[Index_train]; v_pseudo_resid = get_pseudo_residual(X, Y, wts, latest_model, LOSS_FUNCTION_TYPE); model_add_base = train_base_model(X, v_pseudo_resid, wts); alpha = linear_search(cost_function, model_add_base, X, Y, wts); latest_model += LARNING_RATE * (alpha * model_add_base) [Stochastic Gradient Boosting] Jerome H. Friedman, 1999 72
  • 73. Background 73 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate ©
  • 74. APPLYING GBM IN R gbm_model = gbm.fit( x=train[,x_vars, with = FALSE], y=train$Label, distribution = char_distr, w = w, n.trees = n_trees, interaction.depth = num_inter, n.minobsinnode = min_obs_node, shrinkage = shrinkage_rate, bag.fraction = frac_bag) 74
  • 75. VARIABLE IMPORTANCE 75 Relative Importance
  • 76. APPLY MODEL ON TEST DATA 76 EventId Score RankOrder Class 1 0.98 501 s 2 0.42 259,579 b 3 0.46 264,125 b . . . . . . . . 449,998 0.86 31,154 s 449,999 0.12 489,251 b 550,000 0.79 110,154 b
  • 77. Background 77 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate
  • 78. GRADIENT BOOSTING PARAMETERS • Number of iteration • Minimum observation for each node • Fraction of bagging (0.5 ~ 0.8) • Learning rate (<0.1) • Depth of tree (4 ~ 8) 78
  • 79. Background 79 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate
  • 80. • Split training data – 70% CROSS VALIDATION for training – 30% for cross validation • Train model (70%) • Measure performance (30%) 80
  • 81. PERFORMANCE BASED ON AMS 81 Trade-­‐off between: Ratio of Signal/Background events Number of records in selection region EventId Score RankOrd er Class truth 1 0.98 501 S S 2 0.42 259,579 B 3 0.46 264,125 B . . . . . . . . 449,998 0.86 31,154 S B 449,999 0.12 489,251 B 550,000 0.79 110,154 B Selection Region s = sum(S) b= sum(B)
  • 82. PERFORMANCE BASED ON AMS 82 Percentile AMS AMS percentage of signal
  • 83. COMPARE TWO MODEL RESULTS Percentile 83 Training Cross validation Percentile AMS AMS percentage of signal
  • 84. Percentile 84 COMPARE TWO MODEL RESULTS Training Cross validation Percentile AMS AMS percentage of signal
  • 85. AMS BY NUM. ITERATION 85 Percentile AMS Animation
  • 86. Background 86 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate
  • 87. s b >> 4 HEAT MAP OF AMS ON B-­‐S PLAN 87
  • 88. OPTIMIZATION BASED ON OBJECTIVE FUNCTION Percentile 88 A B C AMS
  • 89. HEAT MAP OF AMS ON B-­‐S PLAN 89 s b A B C
  • 90. HEAT MAP OF AMS ON B-­‐S PLAN 90 s b A B C Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function
  • 91. AMS CURVE ON B-­‐S PLAN 91 A B C Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function s partial derivative of AMS against s partial derivative of AMS against b b Ratio of the derivatives ==> relative weight
  • 92. IMPROVEMENT DUE TO WEIGHTING 92 AMS* Num_Iterations AMS
  • 93. IMPROVEMENT DUE TO WEIGHTING (CONT’D) 93 Num_Iterations AMS* AMS
  • 94. AUGMENTED GRADIENT BOOSTING 94 Apply GBM Weight Adjustment ©
  • 95. AUGMENTED GRADIENT BOOSTING 95 Apply GBM Weight Adjustment Remove very high and very low score records from train and test ©
  • 96. IMPROVEMENT DUE TO ELIMINATION 96 Num_Iterations AMS* AMS
  • 97. IMPROVEMENT DUE TO ELIMINATION (CONT’D) 97 Num_Iterations AMS* AMS
  • 98. AUGMENTED GRADIENT BOOSTING 98 Apply ML Model Weight Adjustment Remove very high and very low score records from train and test ©
  • 99. Background 99 Data Model Understand read visualize read discuss Explore Enhance reduce generate apply innovate fine-­‐tune Train Select Optimize Validate find cross validate
  • 100. • Version OTHER TOPICS control (Git, Source Tree) – Effectively implement many different ideas • File organization – Efficiently pull out the file needed • Effective code (R, Python) – it matters so much when dealing with big data 100
  • 101. Thanks you for your participation! Any Questions? goDCI.com