O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Big data and data science overview

Próximos SlideShares
Unit  3 part 2
Unit 3 part 2
Carregando em…3

Confira estes a seguir

1 de 14 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)


Semelhante a Big data and data science overview (20)

Mais de Colleen Farrelly (20)


Mais recentes (20)

Big data and data science overview

  1. 1. Colleen M. Farrelly
  2. 2.  Oxford English Dictionary: ◦ “An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”  Defined by volume, variety, velocity  2008 computer scientist predictions: ◦ Big Data will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations”  According to the New York Times: ◦ Big data science “typically means applying the tools of artificial application of intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases”
  3. 3.  Wider  Longer  Wider and Longer  Complex subgroupings within wider or longer sets  Many correlations  Noisy  Missing data
  4. 4.  Computational challenges of storage and statistical program memory ◦ R space on a laptop is limited to 2 GB unless more RAM is added ◦ Algorithm computing time grows according to scaling rules, many of which are exponential. Thus, 2 GB takes 4 minutes, and 4 GB then takes 16 minutes…  Statistical challenges from data structure ◦ Wide data violates many statistical assumptions. ◦ Correlations among predictors also violate statistical assumptions and creates problems with the underlying linear algebra calculation methods. ◦ Potential for lots of informative missing data that can’t be imputed using existing statistical methods.
  5. 5.  More computing resources ◦ Expensive ◦ Cloud computing ◦ Does not solve statistical issues posed by big data  New statistical methods ◦ Rely on a new set of tools from computer science ◦ Work around limitations of existing multivariate data analysis methods ◦ Don’t always scale as big data grows  Still have computational issues  Need for larger and larger training sets for good performance
  6. 6.  Hadoop ◦ Open-source software for storage and processing of big data across computer cores/clusters ◦ Compatible with existing statistical software  MapReduce ◦ Distributed computing strategy for big data processing and analyses ◦ Compute problem in parallel and combine final answers for shorter compute times  SQL/NoSQL ◦ Relational database language for:  Database construction/modifications  Pulling pieces of data for further analyses/reporting  R ◦ Free open-source software with existing machine learning algorithms and coding environment to create and test new machine learning algorithms  Simulations ◦ Use data structure and relationship rules to create a dataset with pre- specified structure to it ◦ Allows for testing and validation of new algorithms against datasets with known answers ◦ Useful for comparing existing algorithms with new algorithms
  7. 7.  Statistics ◦ Hypothesis testing (parametric and nonparametric) and experimental design ◦ Generalized linear models ◦ Longitudinal, time series, and survival models ◦ Bayesian methods  Mathematics ◦ Multivariable calculus ◦ Linear algebra ◦ Probability theory ◦ Optimization ◦ Graph theory/discrete math ◦ Real analysis/topology  Machine learning ◦ Technically, considered a branch of statistics ◦ Supervised, unsupervised, and semi-supervised models ◦ Serve to extend statistical models and relax assumptions on data ◦ Includes algorithms from topological data analysis and network analysis
  8. 8.  A professional who blends several different areas of expertise to draw insights from disparate data sources (particularly big data) such that inference can be made about specific problems/decisions within the field of application  Data science is a blend of statistical, machine learning, computer science, mathematical, and domain knowledge to leverage data for decision-making in that domain (business, medical, social media…).
  9. 9.  Discuss problem with leadership to understand the problem and how results might be used. ◦ Providing a predictive algorithm that performs well but doesn’t provide insight into the problem might not be useful. ◦ There may be related items that leadership hasn’t considered, items that can enrich the project.  Define data that needs to be pulled. ◦ May exist in database. ◦ May need to find elsewhere.  Pull and clean data. ◦ Examine for errors or bias. ◦ Deal with missing data.  Perform analyses and interpret output. ◦ Can be supervised (fit to outcome) or unsupervised (exploratory). ◦ Typically involves visualization of important results.  Compile summary of actionable insights for leadership. ◦ Simplification ◦ Business value (no point in doing analysis if it can’t be implemented!)
  10. 10.  Mathematical/Statistical Background ◦ Graduate degree, typically in mathematics/statistics, computer science, or engineering ◦ Training in machine learning and algorithm design ◦ Experience with R and SAS statistical languages/programs  Computer Science Background ◦ Python/MATLAB/other high-level computing languages ◦ Hadoop/MapReduce concepts ◦ SQL or NoSQL coding for database extraction/management ◦ Experience with structured or unstructured data ◦ Data mining/algorithm design  Field of Application Expertise ◦ Intellectual curiosity ◦ Understanding of the industry of application (marketing, medical, finance…) ◦ Communication skills to relate findings to non-technical leaders
  11. 11.  From a quick Indeed.com search: ◦ Allstate Insurance ◦ Sprint ◦ Twitter ◦ APS Healthcare ◦ XOR Security ◦ LinkedIn ◦ IBM ◦ Intel  Indeed.com search continued: ◦ Roche Pharmaceuticals ◦ Amazon ◦ Capital One
  12. 12.  According to NewVantage and others: ◦ 2016 revenue gained from data science is estimated at $130.1 billion. ◦ This is expected to grow to $203 billion by 2020.  Individual company results vary according to: ◦ Team talent and expertise ◦ Data collected (and quality of data) ◦ Competitor strengths in data science.  Current and projected shortages of those with analytics talent will impact the market. ◦ Hubs of data science are emerging outside California— Boston, New York, Austin, Chicago, Jacksonville, Tampa, Charlotte, Atlanta… ◦ Across industries—healthcare, tech, finance, energy…

Notas do Editor

  • http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/
    Bryant, R., Katz, R. H., & Lazowska, E. D. (2008). Big-data computing: creating revolutionary breakthroughs in commerce, science and society.
    Lohr, S. (2012). How big data became so big. New York Times, 11.
    Cuzzocrea, A., Song, I. Y., & Davis, K. C. (2011, October). Analytics over large-scale multidimensional data: the big data revolution!. In Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP (pp. 101-104). ACM.
    Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
    Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4, 24-35.
  • Heidema, A. G., Boer, J. M., Nagelkerke, N., Mariman, E. C., & Feskens, E. J. (2006). The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC genetics, 7(1), 23.
    Draper, N. R., Smith, H., & Pownell, E. (1966). Applied regression analysis (Vol. 3). New York: Wiley.
    Gopalkrishnan, V., Steier, D., Lewis, H., & Guszcza, J. (2012, August). Big data, big business: bridging the gap. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 7-11). ACM.
  • Bekkerman, R., Bilenko, M., & Langford, J. (Eds.). (2011). Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press.

    Christopher K. Riesbeck. From conceptual analyzer to Direct Memory Access Parsing: an overview., chapter 8. Ellis Horwood Limited, 1986.

    M. W. Berry. Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13–49, Spring, 1992.

    Caporaso, J. G., Baumgartner Jr, W. A., Kim, H., Lu, Z., Johnson, H. L., Medvedeva, O., ... & Hunter, L. (2006). Concept Recognition, Information Retrieval, and Machine Learning in Genomics Question-Answering. In TREC.
    Madden, S. (2012). From databases to big data. IEEE Internet Computing, 16(3), 4-6.
    Agrawal, D., Das, S., & El Abbadi, A. (2011, March). Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530-533). ACM.
  • http://www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html