# Ml masterclass

25 de Oct de 2016
1 de 31

### Ml masterclass

• 1. Machine Learning Masterclass +
• 2. ML Overview ● “AI”, “ML”, lots of hype - but what does it actually mean? ● Systems that learn from their experience over time, without explicit programming ● We aren’t in the business of building brains… (99% us aren’t at least) and you shouldn’t be either
• 3. ML Fun: Where’s Wally (Waldo)
• 4. ML Fun: Where’s Wally (Waldo)
• 5. Classification
• 6. The Problem ● Given a set of observations we want to be able to predict what class a new point belongs to ● e.g. does this patient have disease X given a set of measurements
• 7. Example Algorithm: Random Forests ● We basically subdivide the data into two with an axis aligned line
• 8. Example Algorithm: Random Forests ● We continue subdividing the data in the areas which have a bad mix of classes
• 9. Example Algorithm: Random Forests ● We build many of these decision trees ● Each perform poorly individually ● Their combined vote is powerful
• 10. Many Algorithms
• 11. Regression
• 12. The Problem ● Given a set of observations we want to be able to predict what value a new point belongs to ● e.g. how profitable will our website be next month? What’s the value of my house?
• 13. Example Algorithm: Gaussian Processes ● We pick a method of how we wish to join the dots ● Simplest case we fit a line to the data ● Infinite functions can join the dots - simpler the better (Occam’s Razor)
• 14. Example Algorithm: Gaussian Processes ● The ‘kernel’ describes what type of trends we expect and how to interpolate https://github.com/jkfitzsimons/IPyNotebook_MachineLearning/blob/master/Just%20Another%20Kernel%20Cookbook....ipynb
• 15. Example Algorithm: Gaussian Processes ● The ‘kernel’ describes what type of trends we expect and how to interpolate
• 16. Feature Learning
• 18. Example Algorithm: Autoencoders ● The observations have an extremely complex relationship to the output ● We have a lot of data ● Most of the data is redundant ● We wish to learn the useful latent features
• 19. Example Algorithm: PCA (EigenFaces)
• 20. Example Algorithm: PCA (EigenFaces)
• 24. Yes, but how? ● How does one actually go about using it in any practical setting? ● Many applications invisible - hard to see the actual process ● There are principles and general concerns ● Four main issues: data, pipelining, error risk, institutionalization
• 25. #1: All comes down to data ● Quantity is important, but it’s far from being the only thing ● Hygiene is key - structured is better than unstructured, complete is better than partial ● Bottleneck is often knowing what data is important, matched to goals ● Data scientists spend 80%+ of their time cleaning + preprocessing data, before any analysis is done ● Side note: Data science != machine learning; some highly competent data scientists are skilled in ML methods, but they may not necessarily be able to create new algorithms
• 26. #2: Data pipelining ● Having the data is no good if you can’t get it to where it needs to be ● Operating in-place is the ultimate, but extremely difficult ● The data lake problem: lake grows exponentially, replication ● Define streaming vs batch (examples of streaming vs batch)
• 27. #3: Error risk ● Machine learning models are never 100% accurate ● What happens when the model is wrong? ● Play out consequences, their magnitude, and scope ● The best applications have low risk high gain
• 28. #4: Institutionalization ● Every project must consider how the results will be used ● Who will use the results? Will the results be factored into decision-making, or will action be taken automatically? ● It’s not just about “doing machine learning”, it’s about creating a culture that uses ML as a core tool ● Data-driven decision making, only more evolved ● Leaders in the space make it so that every person in their organization can answer the “why” question
• 29. A lot of work!
• 30. The Upshot ● Google dropped energy usage in data centers by 40%, which translates to \$100M USD / year ● Self-driving cars are reality now (Uber, Tesla, countless others) ● IBM Watson being used for developing cancer treatments and providing supporting diagnoses ● Better security: access control at Amazon ● Genome sequencing (makes heavy use of various ML methods) ● CERN, LHC: Collision data (Higgs Boson, anyone?) ● George Washington University: automatically learning optimal climate models <shameless plug> Dubai Holding: increase profit margins by 25% in real estate businesses, \$12B AED </shameless plug>