O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data science pitfalls

719 visualizações

Publicada em

Challenges and pitfalls in applied data science, with historical and personal examples.

Publicada em: Tecnologia
  • Entre para ver os comentários

Data science pitfalls

  1. 1. Data Science Pitfalls Pedro Tabacof, co-founder of Datart
  2. 2. Data science is easy, right? 1. Get data 2. Clean data 3. Extract features 4. Train model 5. Test model 6. Deploy df = pd.read_csv(...) df = df.fillna(0) df["dayofweek"] = pd.get_dummies(...) clf = sklearn.ensemble.RandomForestClassifier() clf = clf.fit(X, y) sklearn.metrics.roc_auc_score(y_test, clf.predict(X_test)) clf.predict(X_new)
  3. 3. Hundreds of frameworks and APIs.
  4. 4. That is the easy part. In this talk we will see the real challenges.
  5. 5. Correlation X Causation Correlation is easy: Predictive models. Causation is hard: Physical models, experiments, or strong assumptions. Causation in practice is determined by randomized controlled experiments ("A/B testing").
  6. 6. Ronald Fisher Father of modern statistics. Founder of population genetics. Thought smoking did not cause cancer.
  7. 7. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias James Lind First randomized controlled trial in 1747. Showed citrus fruits cured scurvy, a deadly disease at the sea. Lemon juice became a staple in the British navy.
  8. 8. Robert Falcon Scott Great British explorer and hero. Disastrous expedition to the south pole, in 1912. Crew showed symptoms of scurvy.
  9. 9. Frequentist hypothesis testing 1. z-test 2. Student's t-test 3. ANOVA 4. Chi-squared test 5. F-test 6. Paired, pooled, interactions, etc z.test(x, y = NULL, …) t.test(x, y, ...) fit <- aov(y ~ A, data=df) chisq.test(df) var.test(x, ...)
  10. 10. Power lines cause leukemia. 25-year study, 800 ailments.
  11. 11. What is a p-value?
  12. 12. What is a p-value? The probability of obtaining a result equal to or more extreme than what was actually observed, assuming the null hypothesis is true.
  13. 13. Statistical significance does not mean practical significance. Check the effect size.
  14. 14. What is a 95% confidence interval?
  15. 15. What is a 95% confidence interval? A range generated by a procedure that contains the true value 95% of the time.
  16. 16. Biases and fallacies Selection bias. Confirmation bias. Survivorship bias. Winner's curse. Hawthorne effect. Shy Tory effect. Availability heuristic. Gambler's fallacy. Regression to the mean. Optimism bias. Texas sharpshooter fallacy. And many more.
  17. 17. Kidney transplant Kidney givers: 1/3 risk of kidney failure compared to the general population. 8x more risk when compared to the correct group (healthy people).
  18. 18. Roosevelt X Landon 1936 USA presidential elections. The Literary Digest poll: 10M questionnaires, 2.3M returned, Landon victory. Gallup poll: Random sample, 50K pollsters, Roosevelt victory.
  19. 19. Abraham Wald Statistician doing research in WWII. Experts thought they should reinforce where the planes were most hit. Wald thought the opposite. Classical case of survivorship bias.
  20. 20. Baselines 1. Classification: Most frequent class. 2. Regression: Mean value. 3. Time-series: Last value. 4. Simple models: Linear regression, decision trees, k-NN, etc. 5. Standard models: CNNs for computer vision, LDA for topic models, ARIMA for time series, LSTMs for speech, etc. 6. XGBoost.
  21. 21. Time series Easy to be fooled by zoomed out graphs. Naive baseline: Repeat last value.
  22. 22. Anomaly detection 99% regular points, 1% anomalies. Trivial to get 99% accuracy. What about the AUC?
  23. 23. Overfitting Learning noise instead of the signal. Test set hyperparameters tuning. Leakage. Testing methodology: Stratified, hierarchical, temporal, etc.
  24. 24. Google Flu trends
  25. 25. AUC:100% Buyers -> Personal Information -> X. Visitors -> No Personal Information -> NaN -> Mean(X). Perfect discrimination between X and its mean value.
  26. 26. Prostate Cancer Dataset says which patients had prostate surgery. Completely useless predictor in practice, perfect in the competition. Obvious leakage.
  27. 27. Temporal evaluation Train Test Present Current issues (training data) Future New issues Not only for time series.
  28. 28. Soft sensor wizard Wizard for predictive model building. Used by chemical engineers in the wild. Next -> Back -> Next -> Back -> .... “RNG optimization”.
  29. 29. Visualization and interpretability. See your data. Challenge your models.
  30. 30. Anscombe’s quartet Same mean (x and y). Same standard deviation (x and y). Same correlation. Same regression line.
  31. 31. Simpson’s paradox Treatment A Treatment B Small stones Group 1 93% (81/87) Group 2 87% (234/270) Large stones Group 3 73% (192/263) Group 4 69% (55/80) Both 78% (273/350) 83% (289/350)
  32. 32. Simpson’s paradox
  33. 33. LIME: Interpretability
  34. 34. Pneumonia screening Rich Caruana’s work at Microsoft. State-of-the-art models and simple baselines. Linear model showed that patients with asthma would be sent home.
  35. 35. Criminal recidivism prediction. Same arrest (drug possession). Left: One non-violent prior offense. High risk. Right: One violent prior offense. Low risk. One of them was arrested 3 times after, the other none.
  36. 36. Deployment Covariate shift. Technical debt. Misuse of predictions. Interaction with users. Experimental validation.
  37. 37. Netflix prize $1 million for best predictive model. Winning solution: Ensemble of hundreds of models. What went into production: Not the winning solution.
  38. 38. Hidden technical debt
  39. 39. Calibration Does 70% chance of positive means I am right 70% of the time?
  40. 40. Online learning Nobody used my online learning algorithm for parameter tuning. It worked in theory, in simulations, everyone liked the idea. But no one was comfortable with an algorithm second guessing human judgement.
  41. 41. Historical anecdotes. Those who don't know history are doomed to repeat it.
  42. 42. Vulcan Predicted planet based on Mercury’s orbit and Newtonian physics. Same methodology that led to the discovery of Neptune. Many actual “sightings” of Vulcan.
  43. 43. Ignaz Semmelweis Obstetricians did not wash their hands. Mortality rate 3x higher than midwives. Showed washing hands greatly reduces mortality. Ignored by the establishment: “Doctor hands are clean”.
  44. 44. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias George Stigler Nobel prize in economics. He and other U. of Chicago economists found a great arbitrage opportunity: The ton of wheat. The British ton is not the same as the American ton.
  45. 45. LTCM Long-Term Capital Management. Two Nobel prize winners in the board. Sophisticated models, high leverage. Lost $4.6 billion in four months.
  46. 46. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias Sally Clark Two of her babies died of SIDS. SIDS: 1 in 8500. “1 in 72M chance of two SIDS”. Found guilty of murder.
  47. 47. Data science is hard Machine learning goes way beyond training predictive models. Statistics goes way beyond p-values and frequentist hypothesis tests. A data scientist must understand his data, models, assumptions, production environment, objectives, and business. What can be automated by framework, tools, and APIs is the easy part. The hard part is delivering actual value.
  48. 48. Thank you! Questions? ptabacof@datart.com.br