As presented at BigConf on 28 March 2014 in Silver Spring, MD
http://www.bigconf.io/schedule/index#charlie_greenbacker
=========================
Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag?
This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
Open Source Software for Data Scientists -- BigConf 2014
1. Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science28 Mar 2014
2. Altamira Technologies Corporation 2014
Agenda
■ What is a Data Scientist?
■ Why use Open Source Software?
■ Survey of Open Source Software Tools:
¤ Statistical Analysis
¤ Data Mining
¤ Machine Learning
¤ Natural Language Processing
¤ Social Network Analysis
¤ Data Visualization
3. Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
18. Altamira Technologies Corporation 2014
Statistical Analysis
■ Name: R
■ Creator: Gentleman, Ihaka, et al.
■ License: GPL Version 2
■ Website: r-project.org
■ Source: cran.us.r-project.org/src/base/
■ Features:
¤ Language & environment for statistical computing & viz
¤ Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more…
¤ 5000+ packages available in CRAN repository
19. Altamira Technologies Corporation 2014
Data Mining
■ Name: Pandas
■ Creator: Wes McKinney, et al.
■ License: BSD 3-Clause License
■ Website: pandas.pydata.org
■ Source: github.com/pydata/pandas
■ Features:
¤ Data analysis workflow in Python
¤ DataFrame object for fast manipulation & indexing
¤ Tools for reading & writing data between formats
¤ Label-based slicing, indexing, and subsetting of data
20. Altamira Technologies Corporation 2014
Data Mining
■ Name: Impala
■ Creator: Cloudera
■ License: Apache License 2.0
■ Website: impala.io
■ Source: github.com/cloudera/impala
■ Features:
¤ MPP query engine implemented on Hadoop
¤ Low latency, high concurrency SQL & BI queries
¤ Same interfaces as Apache Hive, but ~24x faster
¤ Written in C++; does not use MapReduce
31. Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨ People
¤ salaries, training, etc.
¨ Resources
¤ hardware, AWS, etc.
¨ Proprietary software
¤ if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer’s