O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 113 Anúncio

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Baixar para ler offline

Data Science Tech Institute - Big Data and Data Science Conference around Dr Gregory Piatetsky-Shapiro.
Keynote - An overview on Big Data & Data Science Dr Gregory Piatetsky-Shapiro - KDnuggets.com Founder & Editor.
Paris May 23rd & Nice May 26th 2016 @ Data ScienceTech Institute (https://www.datasciencetech.institute/)

Data Science Tech Institute - Big Data and Data Science Conference around Dr Gregory Piatetsky-Shapiro.
Keynote - An overview on Big Data & Data Science Dr Gregory Piatetsky-Shapiro - KDnuggets.com Founder & Editor.
Paris May 23rd & Nice May 26th 2016 @ Data ScienceTech Institute (https://www.datasciencetech.institute/)

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro (20)

Mais recentes (20)

Anúncio

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

  1. 1. 1 https://www.datasciencetech.institute/
  2. 2. Data Science: Past, Present, and Future Gregory Piatetsky-Shapiro KDnuggets 2© KDnuggets 2016 La Science des données: passé, présent et futur
  3. 3. Predicting Behavior – Key to Survival © KDnuggets 2016 3 Better prediction – better intelligence
  4. 4. “Predictions”: Astrology © KDnuggets 2016 4 My May 26 Horoscope: So what if things aren't completely wonderful in your life right now? Just keep your hopes high, and your fingers crossed. … Being with the people who make you feel good about yourself will help keep your thoughts bright, so get together with your closest friend as soon as you can.. www.astrology.com/horoscope/daily/aries.html
  5. 5. “Predictions” : Turkish Coffee Grinds © KDnuggets 2016 5 If a big chunk of the coffee grounds falls down on the saucer then it is taken as the first positive sign of your reading. “Trouble and worries are leaving you”.
  6. 6. Pundits “Predictions” • Nate Silver FiveThirtyEight.com prediction for Trump winning Republican nomination: • Aug 2015: 2% • Sep 2015: 5% • Nov 2015: 6% • Jan 2016: 12% • May 2016: 99% © KDnuggets 2016 6
  7. 7. Desire to Predict – Deep Human Trait © KDnuggets 2016 7 • People are hard-wired to see patterns • People want predictions • Human intuition does not work on large scale data, for understanding probability • Good story is essential to a convincing prediction (whether true or false) Lessons
  8. 8. Data Science Data-Driven, Scientific approach to prediction and data analysis 8
  9. 9. Outline • Intro, Data Science History and Terms • 10 Real-World Data Science Lessons • Data Science Now: Polls & Trends • Data Science Roles • Data Science Job Trends • Data Science Future © KDnuggets 2016 9
  10. 10. What do we call it? • Statistics • Data Mining • Knowledge Discovery in Data (KDD) • Predictive Analytics • Data Analytics • Data Science • …? © KDnuggets 2016 10 Core Idea: Finding Useful Patterns in Data
  11. 11. Pre-history (1800-2008): Statistics © KDnuggets 2016 11 From Google Ngram viewer – English language books Search case insensitive. Other languages need to be considered for full picture statistics is the biggest term in 20th century, Analytics is used increasingly thru 20th century data mining appears in late 1990s
  12. 12. French Books, 1800-2008 Statistiques vs Mathematiques © KDnuggets 2016 12
  13. 13. “Data Mining” Surges in 1996 © KDnuggets 2016 13 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy Analytics Data Mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google N-grams search case insensitive, smoothing 1
  14. 14. Earliest use of “data mining”: 1962 (c) KDnuggets 2016 15 Source: Google Books After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is
  15. 15. Very Recent History Using Google Trends (c) KDnuggets 2016 16
  16. 16. Google Trends, 2005-2016: After 2006, Analytics > Data Mining 17(c) KDnuggets 2016 Global – all regions
  17. 17. >50% of “Analytics” searches are for “Google Analytics” 18(c) KDnuggets 2016 Google Analytics introduced, Dec 2005
  18. 18. Google Trends, 2005-2016 (c) KDnuggets 2016 data science analytics - Google big data data mining 2010 2012 2014
  19. 19. Google Trends, 2005-2016 (c) KDnuggets 2016 2012: Analytics down, Big Data up 2015 2005
  20. 20. Google Trends, 2005-2016 (c) KDnuggets 2016 2013: Data Science grows 20132005
  21. 21. Google Trends: Machine Learning, Data Science, Deep Learning © KDnuggets 2016 22 2009 2011 2013 2015
  22. 22. Google Trends: Machine Learning © KDnuggets 2016 23 Machine Learning ~ “Machine Learning”
  23. 23. Google Trends: Data Science © KDnuggets 2016 24 [Data Science] != “Data Science” Lesson: Examine assumptions carefully 2009 2011 2013 2015
  24. 24. Regional Interest in “Data Science” in 2015 25(c) KDnuggets 2016 Google Trends Note: search for “Data Science” is different from [Data Science]
  25. 25. KDnuggets Audience by Region, Q1 2016 © KDnuggets 2016 26
  26. 26. Data Science History • < 1900 - Statistics • 1960s Data Mining = bad activity, data “dredging” • 1990 - “Data Mining” is good, surges in 1996 • 2003 - “Data Mining” peaks (bad in press, invasion of privacy?), slowly declines, but still popular • 2006 - Google Analytics • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data • 2014 - Data Science • 2015 - Deep Learning • 2018 - ?? 27© KDnuggets 2016
  27. 27. 10 Real-World Lessons from the Art & Practice of Data Science & Data Mining 28© KDnuggets 2016
  28. 28. Lesson 1: It is a Iterative, Circular Process © KDnuggets 2016 29 Waterfall model does NOT work for Data Science
  29. 29. CRISP-DM: Iterative, Circular Process © KDnuggets 2016 30 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Data Mining Process – CRISP-DM, 1998 CRISP-DM, 1998 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment
  30. 30. Academic Data Science Process © KDnuggets 2016 31 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Harvard, 2013
  31. 31. Machine Learning Workflow, MS Azure © KDnuggets 2016 32 See www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in- azureml-using-linear-regression/
  32. 32. Lesson 2: Data Engineering Takes The Bulk of Time • Building Machine Learning/Predicting Models is the key (and most fun) part, but only a small part of the whole process • 60-80% (?) spent on Data Preparation/Engineering © KDnuggets 2016 33
  33. 33. Competitions are different © KDnuggets 2016 34 March Machine Learning Mania 2016, Winner's Interview: 1st Place, Miguel Alomar https://twitter.com/kdnuggets/status/730417186167263232 http://blog.kaggle.com/2016/05/10/march-machine-learning- mania-2016-winners-interview-1st-place-miguel-alomar/ How #MachineLearning @Kaggle winner spent time: 35% read forums, 25% build models, 25% evaluate results 15% data preparation,
  34. 34. Lesson 3: Question Assumptions © KDnuggets 2016 35 Problem: Deciles not uniform Decile 1 is too rare, Decile 0 – too frequent? Why ? * Not actual data Measurement
  35. 35. Mass Spectrometry © KDnuggets 2016 36 Mass spectrometry (MS) is an analytical technique that ionizes chemical species and sorts the ions based on their mass to charge ratio. Can produce a large number (~ 20,000) of m/z values for a sample Goal: find biomarkers for disease, test, condition
  36. 36. Question Assumptions © KDnuggets 2016 37 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement
  37. 37. Question Assumptions © KDnuggets 2016 38 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement Someone added a rule to round raw measurement values below 15 to zero
  38. 38. The best data scientists have one thing in common – unbelievable curiosity DJ Patil, US First Chief Data Scientist http://www.sciencefriday.com/articles/10-questions-for-the- nations-first-chief-data-scientist April 2016 39
  39. 39. Lesson 4: Focus on the Right Metric - Actionable • Consumer: Churn may depend on age, region, usage, and rate plan. Rate plan easiest to change. • Uplift Modeling in Marketing and Politics: focus on persuadables © KDnuggets 2016 40
  40. 40. Right Metric: Uplift Modeling © KDnuggets 2016 41 Don’t model if consumer will buy – Model if consumer will buy in response to an offer
  41. 41. Right Metric: Uplift Modeling © KDnuggets 2016 42 From Eric Siegel presentation at PAW, 2011 In Obama 2012 Campaign www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory
  42. 42. Lesson 5: Be a Fox, not a Hedgehog © KDnuggets 2016 43 Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox A fox knows many things, but a hedgehog - one important thing.
  43. 43. Lesson 5: Modeling No Free Lunch Theorem – no method is universally the best (Wolpert) In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016): • Handcrafted feature engineering • Or Deep Learning Neural Networks www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html • XGBoost – winning method in many recent Kaggle competitions • Ensemble methods For Structured Data (Sebastian Rashka ) • SVM (Support Vector Machines) for smaller data • Random Forests – more data, more automated www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html Unstructured: • Deep Learning © KDnuggets 2016 44
  44. 44. Lesson 6: Avoid Overfitting © KDnuggets 2016 45 http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html Many examples at http://tylervigen.com/spurious-correlations
  45. 45. Avoid Overfitting © KDnuggets 2016 46 “Irreproducible” results - BIG problem is social sciences, medicine: John P. A. Ioannidis famous paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Due to • Small samples • Testing too many hypotheses • Confirmation bias (explicit or implicit) • Poor training
  46. 46. How to Avoid Overfitting • If it is too good to be true, it probably is • Find the simplest possible hypothesis • Adjusting the False Discovery Rate • Randomization Testing • Nested cross-validation (train, test, holdout) • Regularization (adding a penalty for complexity) © KDnuggets 2016 47 www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
  47. 47. Lesson 7: Tell a story • Combine facts into a story • Combine visual and text presentation • Explanation gives credibility • Dynamic / Interactive • Examples: Kefir, Google Analytics, Quill © KDnuggets 2016 48
  48. 48. KEFIR (KEy FInding Reporter), 1994 • Overview report www.kdnuggets.com/data_mining_course/kefir/overview.htm • Inpatient admissions www.kdnuggets.com/data_mining_course/kefir/s2.htm © KDnuggets 2016 49
  49. 49. Quill report for KDnuggets • Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average • Sessions remained flat compared to the prior week. The 121,040 sessions, however, were above your 85,105-session weekly average for the year. Your site's total pageviews stayed flat last week at 206,124, while pages per session grew less than a percent to 1.7. That's equal to your weekly average for the year. • Among all your pages, Analytics, Data Mining, and Data Science had both the highest bounce rate (43%) and the most pageviews (8,734) last week. © KDnuggets 2016 50
  50. 50. La Diseuse de bonne aventure, Caravaggio, 1595 (Louvre) © KDnuggets 2016 51 Beware of Fortune tellers!
  51. 51. Lesson 8: Limits to Predicting Human Behavior? • Inherent randomness, complexity in human behavior • Individual predictions have limited accuracy (but can still be better than random and very useful for consumer analytics) • Aggregate predictions (eg who will win the election) more accurate, because individual randomness cancels out (c) KDnuggets 2016 52
  52. 52. Example: Netflix Prize, 2006 • Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms © KDnuggets 2016 53 See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/
  53. 53. Direct Marketing Lift: Random and Model-sorted Lists 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random Model 5% of random list have 5% of hits 5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2 Pct list CPH:CumulativePctHits
  54. 54. Most lift curves are surprising similar- limit to human predictability? Study of lift curves in banking, telecom Best lift curves are similar Special point T=Target percentage Lift(T) ~ sqrt (1/T) G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2016 55 0 2 4 6 8 10 12 14 0 5 10 15 20 25 100*T% Lift Actual lift(T) Est. lift(T)
  55. 55. More recent data is more predictive! • Real-time behavior data more predictive than historical, demographic data • Ad retargeting © KDnuggets 2016 56
  56. 56. Lesson 9: Deployment & Maintenance • Netflix Prize winning algorithm not deployed • Technical debt of Machine Learning – (Google research.google.com/pubs/pub43146.html ) © KDnuggets 2016 57 … the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. http://techblog.netflix.com/2012/04/netflix -recommendations-beyond-5-stars.html
  57. 57. Modeling in Real World vs Kaggle • ROI of extra accuracy vs cost of maintenance • Is model explainable ? (legal, acceptance reasons) • Does model discriminate on basis of race, gender,…? • Netflix Prize algorithm which won $1M - not implemented • In real-world, simpler is usually better © KDnuggets 2016 58
  58. 58. Deployment Test and Monitor • Monitor assumptions – Do fields have the same value distributions • Detect when model is no longer valid, needs rebuilding • Automatic model re-build © KDnuggets 2016 59
  59. 59. Lesson 10: Don’t just predict, optimize • Prediction is usually just one part of making a decision • Consider cost, frequency, latency, human behavior, etc • Goal: Optimization • From Data Science to Decision Science © KDnuggets 2016 60
  60. 60. Privacy in the age of Big Data • Privacy laws much stricter in Europe • Individual Privacy vs Benefits for all (eg aggregated health-care data) • Image and Face recognition (eg Facebook) • Very hard to keep privacy with so many digital breadcrumbs • Privacy vs Security (eg FBI vs Apple) • Politicians are behind technology curve – researchers should help society, politicians make an informed decision © KDnuggets 2016 61
  61. 61. When It Is Ethical To Analyze A Particular Dataset? 62© KDnuggets 2016
  62. 62. Data Ethics Golden Rule Don’t do with someone else data what you don’t want done with your data © KDnuggets 2016 63
  63. 63. Data Science Now What, Where, How KDnuggets Polls Findings www.KDnuggets.com/polls/ 64(c) KDnuggets 2016
  64. 64. 65© KDnuggets 2016 www.kdnuggets.com/2016/01/poll-analytics-data-mining-data-science-applied-2015.html Where did you apply Analytics, Data Mining, Data Science ? Avg. Number of Industries 2.7 Most Popular: - CRM - Finance - Banking - Health Care - Science - e-commerce Highest growth in: Games, 121% Entertainment / Music 74% Social Good/Non-profit, 68% Finance, 42% Education, 30%
  65. 65. Data Types Analyzed/Mined 66© KDnuggets 2016 www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html Most popular: - Table data - Time series - Text - itemsets/transactions Most growing: - music/audio - JSON
  66. 66. Largest Dataset Analyzed? © KDnuggets 2016 67 www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
  67. 67. Largest Dataset Analyzed? © KDnuggets 2016 68 Python swallowed an Elephant? Antoine de Saint-Exupery
  68. 68. Largest Dataset Analyzed? © KDnuggets 2016 69 Big Data Miners – elite group . www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html Median in 11-100 GB range, slight increase.
  69. 69. Largest Dataset Analyzed by Region © KDnuggets 2016 70 Big Data Miners: TeraBytes and Petabytes 10-25%
  70. 70. 4 Main Languages of Data Science © KDnuggets 2016 71 www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
  71. 71. 4 Main Languages of Data Science, 2 © KDnuggets 2016 72
  72. 72. R vs Python © KDnuggets 2016 74 http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html Surprising Stability: 88% of R users stayed with R and 91% stayed with Python. % of primary R , Python users up, while % Other or None down.
  73. 73. Data Science Roles 77(c) KDnuggets 2016
  74. 74. Data Science Roles • Data Analyst • (Big) Data Engineer • Data Scientist • Machine Learning Researcher • Data Science Manager/Director • Chief Data Officer • Company Founder © KDnuggets 2016 78
  75. 75. Data Science Venn Diagram, 2010 © KDnuggets 2016 79 Drew Conway, 2010
  76. 76. LinkedIn Data Skills LinkedIn has 334,000 Titles with “Data” • Data Analyst 60,273 • Data Scientist 12,680 • Database Analyst 4,357 • Business Data Analyst 1,709 • Senior Data Scientist 1,691 • Sr. Data Analyst 1,131 Thanks to Lutz Finger, Director of Analytics at LinkedIn for this custom study © KDnuggets 2016 80
  77. 77. LinkedIn: 4 Groups of Skills Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills. Database Management and Software • Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP Machine Learning • Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing Research Design Sentiment Analysis Structural Bioinformatics Text Mining Mathematics • Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing Simulations Trigonometry Statistical Analysis and Data Mining • A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics Survey Research Survival Analysis Time Series Analysis Web Analytics © KDnuggets 2016 81
  78. 78. LinkedIn Skills N. Skills relating to Data Number of LinkedIn Members 1 9,708,214 2 3,870,376 3 2,065,318 4 1,097,849 5 576,310 6 305,266 7 169,351 8 98,284 9 60,419 10 37,689 © KDnuggets 2016 82
  79. 79. Data Science Skills, Updated © KDnuggets 2016 84 Database, Coding Skills Domain/Business Expertise
  80. 80. Database, Coding Skills Domain/Business Expertise Data Analyst/BI Analyst © KDnuggets 2016 85 Data Analyst Glassdoor, Apr 2016 US Avg Salary: $60-70,000 Positions: 13,000
  81. 81. Database, Coding Skills Data Engineer © KDnuggets 2016 86 Domain/Business Expertise Data Engineer Glassdoor, Apr 2016 US Salary: $95,500 Jobs: 40,296 Ingénieur … Data France: 5K Jobs
  82. 82. Machine Learning Researcher © KDnuggets 2016 87 Database, Coding Skills Domain/Business Expertise ML Researcher
  83. 83. “Unicorn” Data Scientist © KDnuggets 2016 88 Database, Coding Skills Domain/Business Expertise Glassdoor, Apr 2016 US Salary: $113,400 Jobs: 2572 France: €43,500 Jobs: 180 “Unicorn” Data Scientist
  84. 84. Data Science Manager/Director © KDnuggets 2016 89 Database, Coding Skills Domain/ Business Expertise People Management Skills Data Science Leader
  85. 85. Company Founder © KDnuggets 2016 90 Database, Coding Skills Domain/ Business Expertise People Management Skills + Vision Founder
  86. 86. Data Career Progression © KDnuggets 2016 91 BI/Data Analyst Data Engineer Data Scientist Machine Learning Researcher Data Science Manager/Director Company Founder/CEO Chief Data Officer Chief Scientist
  87. 87. DATA SCIENCE JOB TRENDS (c) KDnuggets 2016 92
  88. 88. Shortage of Data Scientists? • McKinsey (2011): shortage by 2018 in US – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ 93(c) KDnuggets 2016
  89. 89. Data Scientist – Sexiest Job of the 21st Century? • Thomas H. Davenport and D.J. Patil, (Harvard Business Review, 2012) 94(c) KDnuggets 2016
  90. 90. “Data Scientist” - leading job trend © KDnuggets 2016 95 “Data Scientist” Job has grown 1,700% from 2012 to 2016 Top 5 Tech Job Trends in 2016: Data Scientist, Devops, Puppet, PaaS, Hadoop ? Indeed.com/jobtrends
  91. 91. Attention to Detail: [Data Scientist] != “Data Scientist” © KDnuggets 2016 96 Indeed.com/jobtrends Data Scientist “Data Scientist” = “data scientist”
  92. 92. “Data Scientist” vs Statistician © KDnuggets 2016 97 Indeed.com job trends “Data Scientist” Statistician
  93. 93. Data Scientist jobs on KDnuggets © KDnuggets 2016 98 0% 5% 10% 15% 20% 25% 30% 35% 40% 2010 2011 2012 2013 2014 2015 % Data Scientist jobs on KDnuggets Including Senior, Junior, Principal, Chief DS, …
  94. 94. LinkedIn 25 Hot Skills © KDnuggets 2016 99 2015 2014
  95. 95. Data Science Future 100
  96. 96. Big Data • Next Industrial Revolution • Data Science is the Engine of Big Data 101(c) KDnuggets 2016
  97. 97. Doing Old Things Better Application areas – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence – Healthcare – … • Competition will level companies 102(c) KDnuggets 2016
  98. 98. Big Data Enables New Things ! • Google – first big success of big data • Social networks (Facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data • Big Data in Health-care – image analysis, diagnosis, – Personalized medicine • Recommendations - Netflix streaming 103(c) KDnuggets 2016
  99. 99. New services, products, platforms • Image recognition – FB uses to decide what to show users • Face recognition - security • Location-based services – Tinder • Big Data to Power AI and Machine Learning – Imagine Google DeepMind, IBM Watson, Siri in 2020 ? © KDnuggets 2016 104
  100. 100. Gartner Hype Cycle, 2012 © 2016 KDnuggets 105 Gartner Hype Cycle Big Data
  101. 101. Gartner Hype Cycle, 2013 © 2016 KDnuggets 106 Gartner Hype Cycle Big Data
  102. 102. Gartner Hype Cycle, 2014 © 2016 KDnuggets 107 Big DataData Science See http://diggdata.in/ which has 4 years of Gartner Hype Cycle
  103. 103. Gartner Hype Cycle, 2015 © 2016 KDnuggets 108 Gartner Hype Cycle Big Data www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html Citizen Data Science Machine Learning
  104. 104. “Citizen” Data Science © KDnuggets 2016 110 This is Bob, our new Citizen Data Scientist. He previously worked as a citizen dentist and a citizen pilot.
  105. 105. Golden Age of Data Science, Machine Learning • Amazing New Tools • Very Complex Algorithms are very easy to use • scikit-learn, iPython notebooks, etc • One-Click deployment of TensorFlow on AWS with GPU © KDnuggets 2016 111
  106. 106. Data Science Automated ? © KDnuggets 2016 112 Expert Human Ability Current Computer Ability
  107. 107. Data Science Automated ? © KDnuggets 2016 113 Expert Human Ability
  108. 108. Data Science Automated By 2025? © KDnuggets 2016 114 KDnuggets Poll in 2015: 51% of voters expect Data Science Automation to happen in 10 years or less - www.kdnuggets.com/2015/05/data-scientists-automated-2025.html
  109. 109. Data Science Automation © KDnuggets 2016 115 I remember when only a Deep Learning supercomputer could beat me in a Data Science competition
  110. 110. Data Science Automation KDnuggets: Software: Automated Data Science: • AutoDiscovery from ButlerScientifics • Automatic Business Modeler from Algolytics • Automatic Statistician project • DataRobot • DMWay • ForecastThis DSX • FeatureLab • Loom Systems, • machineJS: Automated machine learning • Quill from Narrative Science • SAP Predictive Analytics • Savvy from Yseop. • Skytree Machine Learning Software • Tree-based Pipeline Optimization Tool (TPOT) © KDnuggets 2016 116
  111. 111. Data Science Automation • New tools make Data Scientists more productive • Make data results more widely available • Automate lower-level Data Science tasks © KDnuggets 2016 117
  112. 112. “Soft” Data Science Skills Harder to Automate • Curiosity • Intuition • Business Knowledge • Selecting a good metric • Posing the right question • Presentation Skills Data Science – still a great profession © KDnuggets 2016 118
  113. 113. Questions? KDnuggets: Analytics, Big Data, Data Science • Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html • Email to editor1@kdnuggets.com • Twitter: @kdnuggets • facebook.com/kdnuggets • LinkedIn group: KDnuggets 119© KDnuggets 2016

Notas do Editor

  • Churn: best algorithms for predicting churn have lift of 5-7 – 5-7 times better than random.
    Behavioral advertising: 2-3% CTR – 10 times better than random
  • Future is Bright for Big Data, but need use caution when evaluating claims

×