A conversation grounded in slides, which was with a group of Swedish CIOs in March 2018. We talked about the implications of data collection for those who weren't even directly involved in data collection; for organisations, talent hunters, and ultimately ecosystems.
3. AGENDA
1. What are “Big Data”?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
4. WHAT ARE BIG DATA?
• Big Data is abroad term used to describe data sets that are large, complex, and cannot be
addressed by traditional IT methodologies and applications (Davenport, 2013)
• New technologies—both hardware and software—have had to be designed to manage the
volume
5.
6. DIGITAL TRACE DATA
Taxonomy of Digital Data
Understanding different data types is crucial to correctly address
problematic areas relating to the use and collection of digital trace data
Data that we leave behind
Content Data Metadata
- User’s name and
address
- Substantial and
personal: can be
identified/linked
to a person
- Explicitly shared
or traced through
content shared on
social media like
Facebook
- User’s IP address, time of
login (data about data)
- Strength is in scale: as
companies can use it to
recognise user patterns
- Potentially problematic:
if it reveals things that we
don’t want to reveal,
example presence of a
mobile device at a protest
in might reveal the
identities of protesters
Entrusted Data: content we post
on medium not controlled by us
(FB). We don’t control what
firms do with our traces
Incidental Data: data about us
shared by others (tagged
photos). We neither influence
nor control our data traces
Service Data: Information we
provide to be able to use a
service
Disclosed Data: Content that we
post online, but on a medium
that we control, example blogs,
limiting our data traces
Behavioural Data:
unintentionally shared; captured
by services from our devices.
Example, time spent on a site
Derived Data: data inferred
about us from other data.
Example, our credit profiles built
by firms using personal data
7. DIGITAL TRACES
• Make existing services more efficient
• Create new services
• Access (or create?) new markets
8. “The loan amounts users are initially presented with currently
tend to be either £111 or £265, although I have also achieved
figures of £350 and £361. In my informal survey, those using
Apple products (a Safari browser, or say an iPhone or an
iPad) seemed to be most consistently offered £265. Although
tests with some obscure browsers suggest that it is likely that
it is less that you are ‘uprated’ by using Apple products, than
you are ‘down rated’ by using less niche browsers like Firefox
and Internet explorer.” (Deville 2013)
“The firm has found that people who
immediately shove the slider up to the
maximum amount on offer, currently £400
for 30 days for a first-time applicant for a
personal loan, are more likely than others
to default.” (Pollock 2012)
9.
10. STRUCTURED VS UNSTRUCTURED DATA
• Structured: clean, organised, in a database format. Has relational properties and can be
divided into fields (e.g. what you have been working with in SQL)
• Thought to be 5-10% of all data
• Semi-structured: unstructured data that has some organisational properties that make it
easier to query, but not enough to be considered structured (e.g. your CSV files)
• Also around 5-10% of data
• Unstructured data: no structure, no clear relational properties (e.g. images, multimedia,
business documents)
• Around 80% of all data
11. AGENDA
1. What are “Big Data”?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
16. AGENDA
1. What are “Big Data”?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
17.
18. “TRAINING” AN ALGO
“A computer program is said to learn from
experience (E) with some class of tasks (T) and
a performance measure (P) if its performance at
tasks in T as measured by P improves with E”
Training
data
Feature
Extraction
Model
ML
Algorithm
Test
data
Model
(learnt during
training phase)
predictions
19. TERMINOLOGY
• Features: features or distinct traits that can be used to describe each item in a
quantitative manner.
• Sample: item(s) to process (e.g. classify). It can be a document, a picture, a sound, a
video, a row in database or CSV file, or whatever you can describe with a fixed set of
quantitative traits.
• Feature extraction: simplifies samples into, e.g. vectors
• Training data: data to discover potentially predictive relationships.
• Test data: different data used to test the model built
21. SUPERVISED LEARNING
• the correct classes of the training data are known
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
22. UNSUPERVISED LEARNING
• the correct classes of the training data are not known
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
23. SEMI-SUPERVISED LEARNING
• A Mix of Supervised and Unsupervised learning
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
24. REINFORCEMENT LEARNING
• allows the machine or software agent to learn its behavior based on feedback from the
environment.
• This behavior can be learnt once and for all, or keep on adapting as time goes by.
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
28. MANAGERIAL CHALLENGES
• Leadership
Set clear goals, define success, ask the right questions, be creative, create a vision,
deal with stakeholders …
• Talent management
Obvious: Data scientists, computer scientists.
Also: Those who can reframe questions so that data can answer them, design
experiments, visualize and interpret data, speak the language of business.
• Technology
Commonly used: Hadoop. IT departments will need to adapt.
• Decision making
Bring people who understand the problem together with the relevant data.
• Company culture
Stop relying on hunches. Ask yourself ”What do we know?”, not ”What do we think?”
29. RECOMMENDATIONS
• Self-regulate
• Be transparent / educate your customers
• Need for clear rules around ownership
• Public infrastructure?
• Is data collection anti-competitive?
• Trust?
30. AGENDA
1. What are “Big Data”?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
32. ARTIFICIAL INTELLIGENCE
• “ [The automation of] activities that we associate with human thinking, activities such
as decision-making, problem solving, learning ...“ (Bellman, 1978)
• "A field of study that seeks to explain and emulate intelligent behavior in terms of
computational processes" (Schalkoff, 1990)
• Turing Test: “Is a machine able to exhibit intelligent behavior equivalent to, or
indistinguishable from, that of a human?”