Data is now marketed and labelled as "the new oil". Towards this, data is now being extracted from all aspects of our everyday lives with the hope that by analyzing these large volumes of data, useful insights and knowledge will be derived. This talk provides a broad overview of the Data Science process, starting from "tapping" into online data sources, to the analysis, and then the importance of data visualization. Through the discussion of each stage of the Data Science process we will outline tasks that should be followed along with practical challenges and strategies to overcome these challenges.
The Data Science Process: From Mining Raw Data to Story Visualization
1. 06/03/2019 1Demetris Trihinas
trihinas.d@unic.ac.cy
1Tutorial | MSc Research Seminars
Department of
Computer Science
The Data Science Process
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy
2. 06/03/2019 2Demetris Trihinas
trihinas.d@unic.ac.cy
2Tutorial | MSc Research Seminars
Department of
Computer Science
Full-Time Faculty Member
University of Nicosia
“Designing and developing scalable and self-adaptive tools for data
management, exploration and visualization”
@dtrihinas
http://dtrihinas.info
https://ailab.unic.ac.cy/https://www.slideshare.net/DemetrisTrihinas
3. 06/03/2019 3Demetris Trihinas
trihinas.d@unic.ac.cy
3Tutorial | MSc Research Seminars
Department of
Computer Science
The quest for
knowledge used to
begin with grand
theories
(a hypothesis).
Now it begins with
massive amounts of
data.
Welcome to the
Petabyte Age.
Wired, Jun 2008
4. 06/03/2019 4Demetris Trihinas
trihinas.d@unic.ac.cy
4Tutorial | MSc Research Seminars
Department of
Computer Science
State | Unemployment
------------------------------
NY | 1.72
CA | 2.43
DC | 3.54
…
Raw bits n’ bytes
Structure
Knowledge
Story
Data
Information
Understanding
Wisdom
Population
(initial data)
Data
model
Algorithmic
model
Visual
model
Cause/Effect
(why?)
Today’s Talk
6. 06/03/2019 6Demetris Trihinas
trihinas.d@unic.ac.cy
6Tutorial | MSc Research Seminars
Department of
Computer Science
Data Collection
• In the process of data democratization… the world’s data
have never been more open that today.
• The world’s data sources (e.g., social media, news outlets)
often permit –restricted– access to their data.
• Web Scraping: methodically scrape website content
• Application Programmable Interfaces (APIs)
• “ASK for permission and GET access to resource(s)”
• So… turn the “tap” of a data source (coding task) and store the
data somewhere (data warehousing) for analysis.
8. 06/03/2019 8Demetris Trihinas
trihinas.d@unic.ac.cy
8Tutorial | MSc Research Seminars
Department of
Computer Science
Data Collection via API
Data
Collection
GET access to tweets
You can have 1% for free
with this access token.
For > 1% pay up!
The tweet sink
Data
Warehouse
GET tweets with token
from @dtrihinas
or with #data_miningAlso, ask for
#cyprus and #cyprus
11. 06/03/2019 11Demetris Trihinas
trihinas.d@unic.ac.cy
11Tutorial | MSc Research Seminars
Department of
Computer Science
Data Overview
• Trawling through a couple of articles manually is easy.
• But… what about thousands of news articles from
multiple news outlets?
Humans are slow, Computers are fast!
• Get the data, store it and then mine it!
13. 06/03/2019 13Demetris Trihinas
trihinas.d@unic.ac.cy
13Tutorial | MSc Research Seminars
Department of
Computer Science
Data Models
• The representation chosen to store and extract data.
Y f(X, parameters, random noise)
We understand
the world!
• For example, db schemas, spreadsheets, objects, etc.
14. 06/03/2019 14Demetris Trihinas
trihinas.d@unic.ac.cy
14Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data refers to datasets that are too large or
complex for traditional data-processing application
software to adequately deal with.
20. 06/03/2019 20Demetris Trihinas
trihinas.d@unic.ac.cy
20Tutorial | MSc Research Seminars
Department of
Computer Science
Batch Data
• Assumes that the data is available when and if we want it
(e.g., reading and parsing data from a file or database)
• The processing engine knows the dataset in advance and
controls the input rate of the data
Count events by color
fetch data
<red, 3>
<yellow, 1>
<blue, 2>
<green, 2>
Processing
Engine
Database
21. 06/03/2019 21Demetris Trihinas
trihinas.d@unic.ac.cy
21Tutorial | MSc Research Seminars
Department of
Computer Science
• Unbounded Data -> the volume of the data is overwhelming
• Conceptually infinite sequence of data items
• Push Model -> data arrives at high velocity and different rates
• Potentially multiple sources pushing data to the processing engine at
different rates (data distribution changes over time)
Data Streams
Processing
Engine
src1
src2
src3
0
2
4
input rate
t
22. 06/03/2019 22Demetris Trihinas
trihinas.d@unic.ac.cy
22Tutorial | MSc Research Seminars
Department of
Computer Science
US Presidential Elections 2016
Happiness Anger
Clinton
Trump
Per minute Emotions During First Debate
200K
tweets/min
https://qz.com/810092
29. 06/03/2019 30Demetris Trihinas
trihinas.d@unic.ac.cy
30Tutorial | MSc Research Seminars
Department of
Computer Science
Data Warehousing
• Data warehousing provides data storage and
management capabilities.
• Memory and storage have
never been cheaper.
1MB today is 10 times
cheaper than 5 years
ago!
31. 06/03/2019 32Demetris Trihinas
trihinas.d@unic.ac.cy
32Tutorial | MSc Research Seminars
Department of
Computer Science
Computing Power
• Cloud Computing - Abundance of computing power.
• Rent instead of buying expensive compute power (removes
also side-costs e.g., cooling, physical security, etc.)
33. 06/03/2019 34Demetris Trihinas
trihinas.d@unic.ac.cy
34Tutorial | MSc Research Seminars
Department of
Computer Science
Marketing Mantra
Collect whatever data you can, whenever and wherever
possible.
The expectation is that collected data
will have value either for the purpose
collected or for a purpose not yet
envisioned.
35. 06/03/2019 36Demetris Trihinas
trihinas.d@unic.ac.cy
36Tutorial | MSc Research Seminars
Department of
Computer Science
Data Mining
• Data is useless unless you can convert it to structured
information and ultimately into knowledge.
• So… data mining provides you with the intelligence to
convert data into knowledge.
38. 06/03/2019 39Demetris Trihinas
trihinas.d@unic.ac.cy
39Tutorial | MSc Research Seminars
Department of
Computer Science
What is NOT Data Mining
• Any question you can ask and get an –immediate and
concrete– answer from a database.
• How many sofas models does IKEA currently have in stock?
• How many sofas did IKEA sell in Sweden last month?
• Which IKEA customers bought a sofa worth more than 500
euros this year?
39. 06/03/2019 40Demetris Trihinas
trihinas.d@unic.ac.cy
40Tutorial | MSc Research Seminars
Department of
Computer Science
Algorithmic Models
• Attempt to understand and represent the reality
through a particular lens (e.g., math, biological).
• Artificial construction where all extraneous detail has
been removed or abstracted.
We don’t understand the world (but try too!)
Model
(black box)Y X
State | Unemployment
------------------------------
NY | 1.72
CA | 2.43
DC | 3.54
… Refinement
41. 06/03/2019 42Demetris Trihinas
trihinas.d@unic.ac.cy
42Tutorial | MSc Research Seminars
Department of
Computer Science
Classification
• Develop models (or functions) that feature the ability
to distinguish and describe a collection of various
attributes into classes.
• “Give a label to your data!”
• Should the IKEA sofa model S be added to this month’s
discount items (yes, no)?
44. 06/03/2019 45Demetris Trihinas
trihinas.d@unic.ac.cy
45Tutorial | MSc Research Seminars
Department of
Computer Science
Clustering
• Develop models to group data together based on their
similarity or dissimilarity to data in other groups.
• Group IKEA customers based on how much disposable
income they have, or how often they tend to shop at a
particular IKEA branch.
• Similar to classification but with unknown classes.
48. 06/03/2019 49Demetris Trihinas
trihinas.d@unic.ac.cy
49Tutorial | MSc Research Seminars
Department of
Computer Science
Pattern Discovery
• One of the most basic techniques in data mining is learning
to recognize patterns in the data.
• This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a
certain variable over time.
• Sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more
people to your website.
50. 06/03/2019 51Demetris Trihinas
trihinas.d@unic.ac.cy
51Tutorial | MSc Research Seminars
Department of
Computer Science
Association
• Association is related to tracking patterns, but is more
specific to dependently linked attributes.
• Model developed to look for specific events or
attributes that are highly correlated with another event
or attribute.
• When your customers buy a specific item, they also
often buy a second, related item.
54. 06/03/2019 58Demetris Trihinas
trihinas.d@unic.ac.cy
58Tutorial | MSc Research Seminars
Department of
Computer Science
Correlation
• Correlation is a statistical technique that tells us how
strongly related are pairs of variables.
• But… correlation does not tell us the why and how
behind the relationship.
• So… correlation just says that a relationship exists.
56. 06/03/2019 60Demetris Trihinas
trihinas.d@unic.ac.cy
60Tutorial | MSc Research Seminars
Department of
Computer Science
Causation
• Causation denotes that any change in the value of one
variable will cause a change in the value of another
variable.
• This means that one variable makes other to happen.
57. 06/03/2019 61Demetris Trihinas
trihinas.d@unic.ac.cy
61Tutorial | MSc Research Seminars
Department of
Computer Science
Exercise and Calories
• When a person is exercising then the amount of
calories burned increases every minute.
• The former (exercise) is causing the latter (calories
burned) to happen.
58. 06/03/2019 62Demetris Trihinas
trihinas.d@unic.ac.cy
62Tutorial | MSc Research Seminars
Department of
Computer Science
Ice-Cream and Homicides in New York
• A study in the 90’s showed that ice-cream sales are the
cause of homicides in New York.
• As the sales of ice-cream rise and fall, so do the
number of homicides -> correlation.
• But… does the consumption of ice-cream actually
cause the death of people in NY?
https://www.nytimes.com/2009/06/19/nyregion/19murder.html
59. 06/03/2019 63Demetris Trihinas
trihinas.d@unic.ac.cy
63Tutorial | MSc Research Seminars
Department of
Computer Science
Correlation Does NOT Imply Causation
• The two things are, yes, correlated.
• But this does NOT mean one causes other.
Correlation is something which
we think, when we can’t see
under the covers.
So the less the information we
have the more we are forced
to observe correlations.
60. 06/03/2019 64Demetris Trihinas
trihinas.d@unic.ac.cy
64Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• How many football games do US citizens got to?
• To get an -exact- answer (100% correct), you must ask
everyone in the US (>350M people) -> Not practical!
• Use a random sample, meaning ask (much) less people
-> but we won’t be 100% correct.
61. 06/03/2019 65Demetris Trihinas
trihinas.d@unic.ac.cy
65Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• What we try to achieve: Get an interval that we are
confident that the actual answer lies within.
“I am 95% confident that the number of football games
people in the U.S. go to lies between 10 and 12”
• So basically, CIs describe the level of uncertainty
associated with a sample estimation.
62. 06/03/2019 66Demetris Trihinas
trihinas.d@unic.ac.cy
66Tutorial | MSc Research Seminars
Department of
Computer Science
Random Sample Selection
• Random… means random!
• You cannot just select 1000 people from one city, the
sample wont represent the whole US.
• You cannot just send FB messages to 1000 random
people, you will get a representation of US FB users,
and of course not all of the US citizens use FB.
• So… constructing a random sample is actually hard!
63. 06/03/2019 69Demetris Trihinas
trihinas.d@unic.ac.cy
69Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• Random sample: 1000 US citizens
• Avg is 11 games and SD is 5 games.
• Let’s say we want a 95% confidence interval.
95%
11
With some statistics
we get an interval of
+-1 game for 95% CI.
We are 95% confident
that the average US
citizen watches between
10-12 games a year.
70. 06/03/2019 76Demetris Trihinas
trihinas.d@unic.ac.cy
76Tutorial | MSc Research Seminars
Department of
Computer Science
Why Visualize Your Results?
Easier to interpret large
volumes of data because
the human eye can
immediately focus on
the main information.
76. 06/03/2019 82Demetris Trihinas
trihinas.d@unic.ac.cy
82Tutorial | MSc Research Seminars
Department of
Computer Science
Data Science Process
Data
Warehousing
Data
Collection
Data
Mining
Data
Visualization
Insights Story
Struct
Info
Raw
Data
Data
Preprocessing
Preprocessed
Info
77. 06/03/2019 83Demetris Trihinas
trihinas.d@unic.ac.cy
83Tutorial | MSc Research Seminars
Department of
Computer Science
Data Preprocessing
• Data mining, especially on big data, is a -compute and
time- expensive process.
• Data Preprocessing can significantly increase
performance if performed before mining.
• Data Cleaning
• Data Reduction
• Data Transformation
Preprocessing can even take around
60% of your effort but totally worth it!
79. 06/03/2019 85Demetris Trihinas
trihinas.d@unic.ac.cy
85Tutorial | MSc Research Seminars
Department of
Computer Science
Data Cleaning
• You would assume that data stored in a database is
ready for analysis, but… “dirty data”.
• Removing duplicate, erroneous or NA data.
• Statistically imputing missing data.
id name age score
1000
1001
Anna
John
42
fifty
84.7
89.5
age MUST be a number
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.5
Mat was sick on test day but is C-
average student so lets assume he
would have scored a 72.0
80. 06/03/2019 86Demetris Trihinas
trihinas.d@unic.ac.cy
86Tutorial | MSc Research Seminars
Department of
Computer Science
Data Transformation
• Reshape, sort and combine data to suitable format(s)
for analysis.
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.7
72.0
id name Eats Breakfast
1000
1001
1002
Anna
John
Mat
Yes
yes
no
id name age score
1001
1000
1002
John
Anna
Mat
50
42
29
90
85
72
Breakfast
1
1
0 Sort
by
score
81. 06/03/2019 87Demetris Trihinas
trihinas.d@unic.ac.cy
87Tutorial | MSc Research Seminars
Department of
Computer Science
Data Reduction
• Perform filtering on the data that is not needed for the
analysis to consume less resources and time.
• Analysis will be performed on US citizens so remove others.
• Use only a sample of the data to get an approximate, but
quick, answer
• Create random sample of 1K rows instead of 1M rows.
• Reduce the dimensionality of the problem
• The field age is not relevant to analysis.
83. 06/03/2019 89Demetris Trihinas
trihinas.d@unic.ac.cy
89Tutorial | MSc Research Seminars
Department of
Computer Science
The Data Science Process
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy