2. What Is Data Science?
Extraction of knowledge from data (also known as
knowledge discovery and data mining, KDD).
Data science :=
Computer science (for data structures,
algorithms, visualization, big data support, general
programming) +
Statistics (for regressions and inference) +
Domain knowledge (for asking questions and
interpreting results). 2
4. Data Science and Other Disciplines: BI
Business Intelligence engineers traditionally make tools for others to analyze
data with. BI engineers do not analyze the data. Data scientists will both make
and analyze using what they made. If you are a software engineer you need to
learn statistical modeling and how to communicate results. You will need to use
these datasets and work with them to make decisions.
4
5. Data Science and Other Disciplines: STATS
Statisticians are traditionally content with the assumption (condition) that all their
data will fit in main memory at the same time. Statisticians traditionally used
math or created new math to squeeze as much information as possible from small
numbers of observations or features. Data scientists recognize the need to use
and create math to handle analyses in data-poor environments but will use and
create new software engineering tools to handle very large datasets, and they
recognize that some the models are the same in both cases. You need to learn to
deal with data that does not fit in memory to be a data scientist because it’s no
longer safe to assume.
5
6. Data Science and Other Disciplines: DB
Database programmers and administrators bring useful skills to data science
but they are traditionally focused on one data model: relational. Handling
graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL
when appropriate, are more like data science. You need to deal with unstructured
data to be a data scientist.
6
7. Data Science and Other Disciplines: Visualization
Visualization experts and business analysts bring skills but are traditionally not
concerned with massive scale like hundreds or thousands of machines. If you
are a business analyst then you need to learn about algorithms and tradeoffs at
large scale. With cloud computing and with algorithms, you may get an answer but
it may cost more or less than it did 5 years ago. It is no longer safe to throw your
trust over the wall to some algorithm or to your staff to run some algorithm. You
will need to internalize the tradeoffs of choosing one model or another yourself.
7
8. Data Science and Other Disciplines: ML
Machine learning is similar to data science but it’s a small fraction of it. The
getting of data, cleaning, exploring, and making interactive visualizations and data
products for yourself and for others to use (e.g. data driven language translators,
spellcheckers) as well as doing ML, these are more like data science.
8
9. Topics
● Numeric data analysis
● Signal processing
● Text data analysis (information/document/text retrieval, natural language
processing)
● Statistical inference
● Databases (information integration)
● Complex network analysis
● Data visualization 9
10. Define the Question of Study
● Descriptive: Describe a set of data.
● Exploratory: Find new relationships.
● Inferential: Use a small data sample to describe a bigger population. Based
on statistics.
● Predictive: Use data on some objects to predict values for another object.
● Causal: Does one variable affect another variable? Based on statistics.
Correlation != Causation.
● Mechanistic: Exactly how does one variable affect another variable? Based
on deep domain knowledge. 10
11. Get and Clean Data
1. Define the ideal data set
Determine what data you can access
2. Obtain the data
Raw data vs processed data. Always use raw data, but process it once; record all
processing steps
3. Clean the data
11
12. Explore Data
● Exploratory data analysis
● Model data and predict
● Interpret results
● Challenge results
● Present results to the data sponsor
12
13. Create Reproducible Code
● Don't do things by hand–teach the computer! All things done by hand must be
precisely documents
● Don't use interactive GUI tools (no history!)
● Use version control software (Git/GitHub)
● Avoid intermediate files, unless they are hard to build (in which case cache
them)
13
14. Report Structure
● Project report
○ Abstract: A brief description of the project.
○ Introduction.
○ Methods.
○ Results.
○ Conclusion.
● Code
○ Well-commented scripts that can be executed without any command line parameters or
interaction. 14
15. Suggested Directory Structure
● data – for the input data, if needed
● cache – for the previously downloaded data
● results – for numerical results
● code – for the Python script(s)
● doc – for the report and figures
15