This document provides an overview of machine learning concepts and example algorithms. It discusses how machine learning systems can learn from experience without explicit programming. It then covers classification and regression problems and provides examples of random forests and Gaussian processes algorithms. The document also discusses feature learning with examples of autoencoders and PCA. Finally, it discusses practical considerations for applying machine learning, including the importance of data quality, data pipelines, managing error risk, and institutionalizing machine learning applications.
2. ML Overview
● “AI”, “ML”, lots of hype -
but what does it
actually mean?
● Systems that learn from
their experience over
time, without explicit
programming
● We aren’t in the
business of building
brains… (99% us aren’t
at least) and you
shouldn’t be either
6. The Problem
● Given a set of
observations we want
to be able to predict
what class a new point
belongs to
● e.g. does this patient
have disease X given a
set of measurements
7. Example Algorithm: Random Forests
● We basically subdivide
the data into two with
an axis aligned line
8. Example Algorithm: Random Forests
● We continue
subdividing the data in
the areas which have a
bad mix of classes
9. Example Algorithm: Random Forests
● We build many of these
decision trees
● Each perform poorly
individually
● Their combined vote is
powerful
12. The Problem
● Given a set of
observations we want
to be able to predict
what value a new point
belongs to
● e.g. how profitable will
our website be next
month? What’s the
value of my house?
13. Example Algorithm: Gaussian Processes
● We pick a method of
how we wish to join the
dots
● Simplest case we fit a
line to the data
● Infinite functions can
join the dots - simpler
the better (Occam’s
Razor)
14. Example Algorithm: Gaussian Processes
● The ‘kernel’ describes what type of trends we expect and how to interpolate
https://github.com/jkfitzsimons/IPyNotebook_MachineLearning/blob/master/Just%20Another%20Kernel%20Cookbook....ipynb
15. Example Algorithm: Gaussian Processes
● The ‘kernel’ describes what type of trends we expect and how to interpolate
18. Example Algorithm: Autoencoders
● The observations have
an extremely complex
relationship to the
output
● We have a lot of data
● Most of the data is
redundant
● We wish to learn the
useful latent features
24. Yes, but how?
● How does one actually
go about using it in any
practical setting?
● Many applications
invisible - hard to see
the actual process
● There are principles
and general concerns
● Four main issues: data,
pipelining, error risk,
institutionalization
25. #1: All comes down to data
● Quantity is important, but it’s far from being the only thing
● Hygiene is key - structured is better than unstructured, complete is better than partial
● Bottleneck is often knowing what data is important, matched to goals
● Data scientists spend 80%+ of their time cleaning + preprocessing data, before any
analysis is done
● Side note: Data science != machine learning; some highly competent data scientists
are skilled in ML methods, but they may not necessarily be able to create new
algorithms
26. #2: Data pipelining
● Having the data is no good if you can’t get it to where it needs to be
● Operating in-place is the ultimate, but extremely difficult
● The data lake problem: lake grows exponentially, replication
● Define streaming vs batch (examples of streaming vs batch)
27. #3: Error risk
● Machine learning models
are never 100% accurate
● What happens when the
model is wrong?
● Play out consequences,
their magnitude, and scope
● The best applications have
low risk high gain
28. #4: Institutionalization
● Every project must consider how the results will be used
● Who will use the results? Will the results be factored into decision-making, or will action be taken
automatically?
● It’s not just about “doing machine learning”, it’s about creating a culture that uses ML as a core tool
● Data-driven decision making, only more evolved
● Leaders in the space make it so that every person in their organization can answer the “why” question
30. The Upshot
● Google dropped energy usage in data centers by 40%, which translates to $100M USD / year
● Self-driving cars are reality now (Uber, Tesla, countless others)
● IBM Watson being used for developing cancer treatments and providing supporting diagnoses
● Better security: access control at Amazon
● Genome sequencing (makes heavy use of various ML methods)
● CERN, LHC: Collision data (Higgs Boson, anyone?)
● George Washington University: automatically learning optimal climate models
<shameless plug>
Dubai Holding: increase profit margins by 25% in real estate businesses, $12B AED
</shameless plug>