In this presentation its given an introduction about Data Science, Data Scientist role and features, and how Python ecosystem provides great tools for Data Science process (Obtain, Scrub, Explore, Model, Interpret).
For that, an attached IPython Notebook ( http://bit.ly/python4datascience_nb ) exemplifies the full process of a corporate network analysis, using Pandas, Matplotlib, Scikit-learn, Numpy and Scipy.
8. WHAT IS DATA SCIENTIST
http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist
A Data Scientist is someone with deliberate dual personality who can first
build a curious business case defined with a telescopic vision and can then dive
deep with microscopic lens to sift through DATA to reach the goal while
defining and executing all the intermittent tasks.
9. WHAT IS A DATA SCIENTIST?
Data scientists explore and transform data in novel ways to
create and publish new features and combine data from diverse
sources to create new value. Data scientists make visualizations
with researchers, engineers, web developers, and designers to
expose raw, intermediate, and refined data early and often.
Applied researchers solve the heavy problems that data
scientists uncover and that stand in the way of delivering value.
These problems take intense effort and require novel methods
from statistics and machine learning.
[Agile Data Science, O’Reilly, 2014]
19. INQUIRE
1. Which communities are more popular?
2. Is the engagement of users in corporate communities increasing?
3. What is the distribution of posts publishing time, during the day?
4. What is the percentage of interactions (likes and comments)?
5. How is the likes distribution by user?
6. Is there a relationship between publishing hour and number of interactions?
7. What communities are more engaging (greater avg. interactions on posts)?
8. What are the most relevant words in the posts?
9. How to group posts about similar subjects?
21. OBTAIN
•Download data from another location (e.g., a web
page or server)
•Query data from a database (e.g., MySQL or
Oracle)
•Extract data from an API (e.g.,Twitter, Facebook)
•Extract data from another file (e.g., an HTML file
or spreadsheet)
•Generate data yourself (e.g., reading sensors or
taking surveys)
34. PYTHON IN HADOOP
• Hadoop Streaming - Allows MapReduce jobs from any
executable script - including Python!
Example using AWS Elastic MapReduce:
http://workingsweng.com.br/2014/04/clusterizando-raios-com-
hadoop-e-k-means-em-map-reduce/
• Other supporting options for Python in Hadoop
HADOOPY
Pig UDFs
in Jython
35. THE NEXT-GEM DATA
SCIENTIST
The best minds of my generation are thinking about how to
make people click ads...That sucks. [Jeff Hammerbacher]
Next-gen data scientists don’t try to impress with
complicated algorithms and models that don’t work.
They spend a lot more time trying to get data into shape
than anyone cares to admit—maybe up to 90% of their
time. Finally, they don’t find religion in tools, methods, or
academic departments. They are versatile and
interdisciplinary.
[Doing Data Science, O’Reilly, 2014]
36. DATA SCIENCE COURSES
• Introduction to Data Science (Univ. of Washington)
• Data Science specialization (John Hopkins)
• Intro to Hadoop and MapReduce (Cloudera)
• Machine Learning (Stanford)
• Statistical Learning (Stanford)
http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/