O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

A data scientist's study plan

194 visualizações

Publicada em

I am writing a book to help anyone to train to become a data scientist. This is work in progress. Likely to updated frequently.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

A data scientist's study plan

  1. 1. Having fun with stats, maths and games in life! Adjunct, MoT and CS&E Department Tandon School of Engineering N e w Y o r k U n i v e r s i t y 1 / 1 / 2 0 1 6 Raman Kannan A Study Plan to become a practicing data scientist!
  2. 2. Outline for having fun with stats, maths and games in life A Study Plan to become a practicing data scientist! Raman Kannan Adjunct, MoT and CS&E Departments Tandon School of Engineering NYU
  3. 3. Contents Introduction ..................................................................................................................................................4 Basics: Khan Academy...............................................................................................................................4 why now, perfect storm........................................................................................................................4 advances for computing hardware, networking, tools for communication.........................................4 introduction to data..............................................................................................................................4 sample/population................................................................................................................................5 iid...........................................................................................................................................................5 bias........................................................................................................................................................5 Relationship ..........................................................................................................................................6 univariate regression ............................................................................................................................7 multivariate...........................................................................................................................................8 logistic regression .................................................................................................................................8 Linear Algebra...........................................................................................................................................8 matrices, identity,square, rectangular,symmetric................................................................................8 operations:transpose, inversion,decomposition..................................................................................8 roots, positive definiteness,eigen values..............................................................................................8 cholesky, principal components, singular value decomposition ..........................................................8 Applications...............................................................................................................................................8 analytics ................................................................................................................................................8 descriptive.............................................................................................................................................8 predictive ..............................................................................................................................................8 prescriptive ...........................................................................................................................................8 learning and intelligence need big data 3V...........................................................................................8 dimensionality reduction..........................................................................................................................8 unsupervised learning...............................................................................................................................8 clustering...............................................................................................................................................9 supervised.................................................................................................................................................9 classification,.........................................................................................................................................9 measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity ............................................9 semisupervised, hybrid.............................................................................................................................9 network, hidden, feedback, selfcorrecting...........................................................................................9
  4. 4. deep learning, Boltzman Machine, Markov Chain....................................................................................9 Information Retrieval Entropy, Gain.........................................................................................................9
  5. 5. Introduction Paraphrasing Einstein, The problem of "qualified labor" shortfall cannot be solved if we continue with the same mentality that created it. We need to be disruptive. There is no need for university or college degree or any structure. Mathematics and analytics is universal and a basic language and anyone (returning veterans, dropouts, can become proficient, if you are willing to be disruptive like Gates,Zuckerburg). So with that hope, this document attempts to layout a path to become a practicing data scientist. It could at first be daunting. But, dont be intimidated! Even though I respect Malcolm Gladwell, I have to encourage you to ignore Malcolm's 10000 hour rule. Anyone with passion, determination and discipline can become a data scientist in less than 10000 hours...may be 6 months approximately 4 hours per day * 5 days per week * 4 weeks per month * 6 months = 480 hours. Because all this stuff is basic and mostly intuitive and lurks in the subconscious realm of cognitive apparatus, even that of monkeys, dogs, leopards and of course human beings. Otherwise, we could not catch a ball or frisbee or a prey. I assure you none of this involves String theory, Reiman surface, Hilbert dimensions or Tichnoff Embedding theorem. We already do so much of this subconsciously, we just have to transfer them to the conscious realm of yourself. Let us go! Basics: Khan Academy why now, perfect storm advances for computing hardware, networking, tools for communication introduction to data operational filter> transactional vs master data domain filter > what values can it hold > categorical/qualitative (nominal,ordinal) numerical/quantitative (interval, ratio) Statistics Refresher
  6. 6. sample/population iid bias randomness outlier, anomaly, Bonferroni test sample means, convergence to population mean CLT Central Limit theorem LLN Law of Large numbers Benford Law small digits central tendencies measures of, moments mean (median,mode),variance (standard deviation),skew,kurtosis comovement, relationship correlation, covariance distributions normal (Gaussian),poisson,uniform
  7. 7. probability basic properties, certainity, uncertainity, impossibility, knowable, unknowables, known unknowables, unknown unknowables counting/frequentist discrete, conditional, joint probabilities Bayesian probability continuous probability Relationship regression parametric nonparametric independent vs dependent variables dependent also known as response independent aka regressors,predictors
  8. 8. univariate regression linear relationship y=mx+c quality of the relationship, goodness of fit pvalue, null hypothesis, rsquare assumptions autocorrelation multicollinearity heteroskedasticity nonconstant variance tests of normality tests of randomness transformation mixtures standard normal lognormal
  9. 9. multivariate logistic regression odds ratio Linear Algebra matrices, identity,square, rectangular,symmetric operations:transpose, inversion,decomposition roots, positive definiteness,eigen values cholesky, principal components, singular value decomposition Applications analytics descriptive predictive prescriptive learning and intelligence need big data 3V dimensionality reduction unsupervised learning
  10. 10. clustering supervised classification, measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity semisupervised, hybrid network, hidden, feedback, selfcorrecting deep learning, Boltzman Machine, Markov Chain Information Retrieval Entropy Gain References (2 B CONTD) KhanAcademy.com http://tutors4you.com/probabilitytutorial.htm http://www.mathportal.org/linear-algebra/vectors/dot-product.php http://www.stat.berkeley.edu/~brill/Stat153/tstests.pdf http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ http://singhal.info/ieee2001.pdf Introduction to Information Retrieval http://www.cs.columbia.edu/~gravano/Qual/Papers/singhal.pdf http://times.cs.uiuc.edu/course/410/note/mle.pdf http://www.dataschool.io/simple-guide-to-confusion- matrix-term
  11. 11. Acknowledgements To all those who have taught me everything I have learned in life, starting with my mother.

×