2. ML Overview
● “AI”, “ML”, lots of hype -
but what does it
actually mean?
● Systems that learn from
their experience over
time, without explicit
programming
● We aren’t in the
business of building
brains… (99% us aren’t
at least) and you
shouldn’t be either
6. The Problem
● Given a set of
observations we want
to be able to predict
what class a new point
belongs to
● e.g. does this patient
have disease X given a
set of measurements
7. Example Algorithm: Random Forests
● We basically subdivide
the data into two with
an axis aligned line
8. Example Algorithm: Random Forests
● We continue
subdividing the data in
the areas which have a
bad mix of classes
9. Example Algorithm: Random Forests
● We build many of these
decision trees
● Each perform poorly
individually
● Their combined vote is
powerful
12. The Problem
● Given a set of
observations we want
to be able to predict
what value a new point
belongs to
● e.g. how profitable will
our website be next
month? What’s the
value of my house?
13. Example Algorithm: Gaussian Processes
● We pick a method of
how we wish to join the
dots
● Simplest case we fit a
line to the data
● Infinite functions can
join the dots - simpler
the better (Occam’s
Razor)
14. Example Algorithm: Gaussian Processes
● The ‘kernel’ describes what type of trends we expect and how to interpolate
https://github.com/jkfitzsimons/IPyNotebook_MachineLearning/blob/master/Just%20Another%20Kernel%20Cookbook....ipynb
15. Example Algorithm: Gaussian Processes
● The ‘kernel’ describes what type of trends we expect and how to interpolate
18. Example Algorithm: Autoencoders
● The observations have
an extremely complex
relationship to the
output
● We have a lot of data
● Most of the data is
redundant
● We wish to learn the
useful latent features
24. Yes, but how?
● How does one actually
go about using it in any
practical setting?
● Many applications
invisible - hard to see
the actual process
● There are principles
and general concerns
● Four main issues: data,
pipelining, error risk,
institutionalization
25. #1: All comes down to data
● Quantity is important, but it’s far from being the only thing
● Hygiene is key - structured is better than unstructured, complete is better than partial
● Bottleneck is often knowing what data is important, matched to goals
● Data scientists spend 80%+ of their time cleaning + preprocessing data, before any
analysis is done
● Side note: Data science != machine learning; some highly competent data scientists
are skilled in ML methods, but they may not necessarily be able to create new
algorithms
26. #2: Data pipelining
● Having the data is no good if you can’t get it to where it needs to be
● Operating in-place is the ultimate, but extremely difficult
● The data lake problem: lake grows exponentially, replication
● Define streaming vs batch (examples of streaming vs batch)
27. #3: Error risk
● Machine learning models
are never 100% accurate
● What happens when the
model is wrong?
● Play out consequences,
their magnitude, and scope
● The best applications have
low risk high gain
28. #4: Institutionalization
● Every project must consider how the results will be used
● Who will use the results? Will the results be factored into decision-making, or will action be taken
automatically?
● It’s not just about “doing machine learning”, it’s about creating a culture that uses ML as a core tool
● Data-driven decision making, only more evolved
● Leaders in the space make it so that every person in their organization can answer the “why” question
30. The Upshot
● Google dropped energy usage in data centers by 40%, which translates to $100M USD / year
● Self-driving cars are reality now (Uber, Tesla, countless others)
● IBM Watson being used for developing cancer treatments and providing supporting diagnoses
● Better security: access control at Amazon
● Genome sequencing (makes heavy use of various ML methods)
● CERN, LHC: Collision data (Higgs Boson, anyone?)
● George Washington University: automatically learning optimal climate models
<shameless plug>
Dubai Holding: increase profit margins by 25% in real estate businesses, $12B AED
</shameless plug>