O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Intro to Data Science for Non-Data Scientists

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
8 minute intro to data science
8 minute intro to data science
Carregando em…3
×

Confira estes a seguir

1 de 48 Anúncio

Intro to Data Science for Non-Data Scientists

Baixar para ler offline

Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a Intro to Data Science for Non-Data Scientists (20)

Mais de Sri Ambati (20)

Anúncio

Mais recentes (20)

Intro to Data Science for Non-Data Scientists

  1. 1. H2O.ai
 Machine Intelligence Data Science for Non-Data Scientists Erin LeDell Ph.D. Silicon Valley Big Data Science August 2015
  2. 2. H2O.ai
 Machine Intelligence H2O.ai H2O Company H2O Software • Team: 35. Founded in 2012, Mountain View, CA • Stanford Math & Systems Engineers • Open Source Software
 • Ease of Use via Web Interface • R, Python, Scala, Spark & Hadoop Interfaces • Distributed Algorithms Scale to Big Data
  3. 3. H2O.ai
 Machine Intelligence Scientific Advisory Council Dr. Trevor Hastie Dr. Rob Tibshirani Dr. Stephen Boyd • John A. Overdeck Professor of Mathematics, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Co-author with John Chambers, Statistical Models in S • Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar) • Professor of Statistics and Health Research and Policy, Stanford University • PhD in Statistics, Stanford University • COPPS Presidents’ Award recipient • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Author, Regression Shrinkage and Selection via the Lasso • Co-author, An Introduction to the Bootstrap • Professor of Electrical Engineering and Computer Science, Stanford University • PhD in Electrical Engineering and Computer Science, UC Berkeley • Co-author, Convex Optimization • Co-author, Linear Matrix Inequalities in System and Control Theory • Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
  4. 4. H2O.ai
 Machine Intelligence What is Data Science? Problem Formulation • Identify an outcome of interest and the type of task: classification / regression / clustering • Identify the potential predictor variables • Identify the independent sampling units • Conduct research experiment (e.g. Clinical Trial) • Collect examples / randomly sample the population • Transform, clean, impute, filter, aggregate data • Prepare the data for machine learning — X, Y • Modeling using a machine learning algorithm (training) • Model evaluation and comparison • Sensitivity & Cost Analysis • Translate results into action items • Feed results into research pipeline Collect & Process Data Machine Learning Insights & Action
  5. 5. H2O.ai
 Machine Intelligence Source: marketingdistillery.com
  6. 6. H2O.ai
 Machine Intelligence What is Machine Learning? What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959) ✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014) ✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field. Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone. What it’s not:
  7. 7. H2O.ai
 Machine Intelligence Classification Clustering Machine Learning Overview • Predict a real-valued response (viral load, weight) • Gaussian, Gamma, Poisson and Tweedie • MSE and R^2 • Multi-class or Binary classification • Ranking • Accuracy and AUC • Unsupervised learning (no training labels) • Partition the data / identify clusters • AIC and BIC Regression
  8. 8. H2O.ai
 Machine Intelligence Machine Learning Workflow Source: NLTK Example of a supervised machine learning workflow.
  9. 9. H2O.ai
 Machine Intelligence ML Model Performance Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the “training set” and evaluate performance on the “test set” or “validation set.” • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  10. 10. H2O.ai
 Machine Intelligence What is Deep Learning? What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015) ✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.” ✤ Very useful for complex input data such as images, video, audio. Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective. What it’s not:
  11. 11. H2O.ai
 Machine Intelligence Deep Learning Architecture Example of a deep neural net architecture.
  12. 12. H2O.ai
 Machine Intelligence What is Ensemble Learning? What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015) ✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees. ✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm. Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time. What it’s not:
  13. 13. H2O.ai
 Machine Intelligence Where to learn more? • H2O Online Training (free): http://learn.h2o.ai • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events • Machine Learning & Data Science courses: http://coursebuffet.com
  14. 14. Customers ! Community ! Evangelists November 9, 10, 11 Computer History Museum H 2 O W O R L D . H 2 O . A I ! 20% off registration using code: h2ocommunity !
  15. 15. H2O.ai
 Machine Intelligence Questions? @ledell on Twitter, GitHub erin@h2o.ai http://www.stat.berkeley.edu/~ledell
  16. 16. Data Science for Non-Data Scientists 
 
 aka. How the Business Views Data Science Chen Huang August 20, 2015
  17. 17. Agenda •  Introduction •  Data Science Primer •  Working with Data Scientists •  Decoding the Data Science Lingo •  Q&A
  18. 18. Introduction •  Who am I? •  Why am I giving this talk?
  19. 19. Who am I? •  Data Strategist •  Career in Business Intelligence, Analytics, and Big Data •  Various roles •  Consultant •  Developer •  Business and Data Analyst •  Product Manager •  Functional and Technical Trainer •  Client Services •  Worked in various industries •  Health care, pharmaceutics, communications and high tech, consumer products, automotive, finance, government contracting August, 2015 – San Francisco, CA
  20. 20. Why am I giving this talk? July, 2011 – Beijing, China
  21. 21. Data Science Primer •  What can Data Science do for the Business? •  Applications of Data Science •  Data-Driven Decisions •  What does a Data Scientist do? •  Data Science Skills
  22. 22. What can Data Science do for the Business? A: Data science! Extracting useful information and knowledge from large volumes of data in order to improve business decision-making or providing the business insights to make data-driven decisions DataBusiness
  23. 23. What can Data do? Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  24. 24. Applications of Data Science Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  25. 25. Data-Driven Decisions •  Practice of basing decisions on data, rather than purely on intuition •  There is evidence that data-driven decision making and big data technologies substantially improve business performance
  26. 26. The Art and Science of Data Science •  Discover unknowns in data •  Obtain predictive, actionable insights •  Communicate business data stories •  Build confidence in decision making •  Create valuable Data Products that has business impacts http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do
  27. 27. What does a Data Scientist do? •  Data curiosity. Explore data. Discover unknowns •  Understand data relationships •  Understand the business, has domain knowledge •  Can tell relevant stories with data •  Holistic view of the business •  Knows machine learning, statistics, probability •  Can hack and code •  Define and test an hypothesis, run experiences •  Asks good questions http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  28. 28. Data Science Skills Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  29. 29. Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
  30. 30. Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
  31. 31. Working with Data Scientists •  Collaboration •  Data Science Cycle •  Organizational Models for Data Science Teams
  32. 32. Working with Data Scientists Data Science Business Data Engineering
  33. 33. Data Science Cycle Image: https://en.wikipedia.org/wiki/Data_science
  34. 34. Organizational Models for Data Science Teams Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129
  35. 35. Decoding the Data Science Lingo
  36. 36. Machine Learning •  A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data. •  Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task. •  Machine Learning programs are also designed to learn and improve over time when exposed to new data. •  Everything! Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  37. 37. Unsupervised Learning Data Science Definition: •  Where a program, given a dataset, can automatically find patterns and relationships within the dataset. •  The business will decide how deeply or many categories there are. •  Clustering or grouping of like data. •  Examples: k-means clustering, hierarchical clustering Business Application: •  Customer segmentation •  Understanding users and behaviors •  Classifying unknown and pre- defined images into categories Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  38. 38. Supervised Learning •  Where a program is “trained” on a pre-defined dataset. •  Based off its training data the program can make accurate decisions when given new data. •  Classifying Twitter sentiments •  Recommender systems Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  39. 39. Score •  Number of ways to evaluate how well the model assigns the correct class value to the test instances. •  Confidence gauge Data Science Definition: Business Application: Definition: https://mlcorner.wordpress.com/tag/scoring/
  40. 40. Score Cont. •  True Positive (TP):    If the instance is positive and it is classified as positive False •  Negative (FN): If the instance is positive but it is classified as negative True •  Negative (TN):  If the instance is negative and it is classified as negative False •  Positive (FP):   If the instance is negative but it is classified as positive •  Classification problems: •  Precision = the number of times you correctly classify = TP/(TP+FP) •  Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN +FP+FN) •  Recall or Sensitivity = the number of positive that you correctly classify out of all the actual positives = TP/(TP+FN) •  Specificity = classifier’s ability to identify negative results = TN/(TN+FP)
  41. 41. Classification •  Sub-category of Supervised Learning •  Classification is the process of taking some sort of input and assign a label to it. The predictions are discrete, categories, or “yes or no” nature. •  Examples: Logistic Regression, Random Forest •  What customers should a company target with its marketing campaigns? •  Is this Nigerian prince committing fraud? (Spam classification) •  Is this actually Barack Obama’s Facebook profile and review on Amazon? (Fraud detection) Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  42. 42. Regression •  Sub-category of Supervised Learning •  Regression is a type of algorithm that predicts a continuous values. •  How much would a user spend on a mobile game like CandyCrush? •  How much would someone spend on healthcare out of pocket? •  How many attendees will come to this event based on past registration? Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  43. 43. Decision Trees •  Using a tree-like graph or model of decisions and their possible consequence. •  Medical Testing (e.g. health incidences, etc.) •  Genealogy breakdowns (e.g. eye color, blood type, etc.) Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  44. 44. Deep Learning •  A category of machine learning algorithms that often use Artificial Neural Networks to generate model. •  Image classification •  Language processing •  Audio processing •  Outlier and fraud detection Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  45. 45. Questions?

×