MORE THAN MINING
“The sexiest job
in the next
10 years will be
— Hal Varian,
While the concept of data science has been around for
decades, the notion of a data scientist has become a
sought-after and in-demand career leading to a rise of a new
generation of data scientists.
The phenomenon in technology development signiﬁcantly
exposes the staggering growth rates of “big data.”
Technology innovation and the World Wide Web provide for
the growth of new types of data — such as user-generated
content — and tools that can be used to interpret it.
Social media platforms such as Facebook (the largest social
network and valued at $52 billion) depend on data science to
create innovative, interactive features that encourage users
to get interested and stay that way — all so that we know it's
But what does the term ‘Data Science’ really mean?
What is data science?
Data science can be broken down into four essential parts.
Mining data Statistics
Collecting and formatting Information analysis
Representation or visualization in Implications of the data,
the form of presentations, application of the data, interaction
infographics, graphs or charts using the data and predictions
formed from studying it
Deﬁning a data scientist
A good data scientist understands the importance of:
Their eyes search for Their voice asks questions
information on the web about what they hope to
Vectorized operations accomplish at the end of
the project, setting
Extraction Expansion &
Takes information they want and Application
organizing it using formulas. They
organize the information in order to The appropriate data ﬂows
form educated, insightful conclusions out of the person in the form
using statistical and these of keywords, Facebook “Likes”
mathematical methods: and other statistics.
Time Series Analysis
Creating new theories and
predictions based upon the data
Ask questions to further expound pile-up and missed opportunities.
upon the data beyond the reaches of
For example, statistics regarding
hard numbers or facts.
holiday shopping trends are
Apply the information in a useful, imperative around the holiday
innovative manner to applications season. If the statistics are
whose success depends on data processed and the conclusions are
science. drawn too late, the season has
passed and the information can no
Immediately process terabytes of
longer be utilized to its full potential.
data that ﬂow in to prevent
for a data scientist
A successful data scientist must have a combination of skills that opens up
possibilities both for that individual and their team. Visualization processes are
often disjointed since each person is typically assigned to a speciﬁc part of the
project. The designer depends on the information architect. The information
architect depends on stats from the statistician, and so on. A true data scientist
should be skilled in multiple areas.
Hacking and Mathematics,
Computer Statistics, Creativity
Science Data Mining & Insight
Knowing how to take Pulling important Knowing what
advantage of statistics and statistics are
computers and the coherently organizing important and how
internet to create them using to leverage them
data-mining formulas mathematic prowess
and computer formulas
Dangers of data science
Statistics can be displayed in a misleading manner
Leading the pollee:
What type of question are you more likely
to answer “yes” to?
Should Americans be taxed Should taxes support the
so others can take advantage government’s aid to those
of welfare and avoid working? who are unable to ﬁnd work?
Facts that are left out
Including only the starting
and ending points
of data makes the change
seem more drastic.
A collage of carefully
9 of 10
combined to induce a
Selection bias occurs when an unrepresentative
population has been taken for a survey or study
and then the results are advertised to the public
consumers as if it represented the total
population. An example is a toothpaste brand
that shows the user how ‘studies’ can often be
weighted in a company's favor.
Ironically, facts and stats can be used to
paint a very inaccurate — and damaging —
picture of a business, organization or
Facts about data science
1790 The ﬁrst big data collection project in
history was by the U.S. Census, which
started in 1790.
5MB When hard drives were ﬁrst
invented, a 5 megabyte server
took up roughly the space of a
luxury refrigerator. Today, a
32 gigabyte micro-SD card
measures around 5/8 x 3/8 inch
and weighs about 0.5 grams.
When collecting mass quantities of data, some human remedial input is needed,
this gave birth to crowd sourcing, The best example is
Amazon's mechanical turk.
Modern collecting of big data is possible with cloud computing,
or the spreading of the data across several physical resources that can be accessed
remotely, rather than concentrated at one location.
“The computing and processing of
data is literally 100 to 1,000 times
faster and cheaper than before.”
— Scott Yara, Greenplum