Data is "the new oil" or "the new gold". In the context of AI systems, we often treat Data more like "the new bacon": Bigger data is better data, and we overfeed AI systems with data as a cheap, infinitely available resource.
We want to fight data's bacon-like image by promoting the concept of data minimalism for AI as a strategy to enhance both, quality and sustainability of AI systems. In order to survive as data minimalists, we compute the (monetary) value of single data points, and then try to just keep the valuable ones.
Implementing this concept is as challenging and as interesting as it sounds. As a corporate-scale example, we show how much data actually is wasted in an e-commerce recommender system, and how we also found toxic data while applying our data-minimalization strategies.
Topic was presented at a joint event of Munich Datageeks and Women in Big Data Munich
https://munich-datageeks.de/
https://www.womeninbigdata.org/
For more content like this, visit IT Knowledge Bank website:
https://www.itknowledgebank.com/
Video
You can watch the recording of the presentation in our YouTube channel:
https://youtu.be/zNAXnWUaqaU
About Michaela Regneri
Michaela Regneri works as a Senior Expert for Artificial Intelligence & Cognitive Computing at OTTO (Hamburg). She is fascinated by AI, especially by its visual, linguistic and cognitive implications for human-computer interaction.
After her PhD in Computational Linguistics, she joined Der SPIEGEL as a R&D engineer, working on search and text mining for the newsroom. In 2016, she started to work at OTTO as a product manager for Business Intelligence Analytics, developing applications with and around data science.
In her current role, she continues to drive & challenge different areas of AI for e-commerce, with a particular interest in AI innovation processes and corporate digital responsibility.
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Data Minimalism & Data Value by Michaela Regneri
1. 1 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Minimalism & Data Value
(Or: Why Data should not be the new Bacon)
Michaela Regneri
Munich Data Geeks / Women in Big Data, 22.01.2020
Joint work with: Julia Georgi, Jurij Kost, Niklas Pietsch, Sabine Stamm
Special Thanks to Malte Hoffmann & Timo Schulz
2. 2
Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Minimalism
3. 3 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Minimalism & AI
4. 4 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Costs beyond Money: Energy & Emissions
The internet needs more
enegery than a metropole
(25 power plants)
Data traffic causes more carbon
emissions than air traffic.
5. 5 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Costs beyond Money: Safety (and trust, and more money)
Source: Statista
- Data carries enormous
value
- Data value does not
necessarily depend on
its mass!
0,
50,
100,
150,
200,
250,
300,
350,
400,
450,
500,
0
200
400
600
800
1.000
1.200
1.400
1.600
1.800
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
US Data Breaches
# data breaches # stolen records (millions)
7. 7 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
How much data do we need?
Does every click count?
Performance(gains!)
Data Volume (costs!)
Standard learning
curve – depends on
algorithm & task!
Assumption 1: most data
value happens in
automation
Assumption 2:
You can measure
perofrmance
(usage-based
value)
8. 8 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
How much data do we need? (At OTTO, in real life)
Example case: a recommender system (in multiple versions)
In our case:
„Customers who
clicked this item…“
9. 9 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
How much data do we need? (At OTTO, in real life)
Experiment: How does Data Volume affect KPIs? (Machine Learning System)
🍐
🍇 🍒
🍎
🍎
🍎 🍐
🍐
🍫
🍫
🍒
🍒🍎 🍇
You might
also like
🍎
🍎 🍎
User Sessions
Word2Vec
Recommendations
🍒
You might
also like
🍎
🍎 🍎
10% of Data
🍐
🍇 🍒
🍎
🍎
🍐
🍫
🍒
You might
also like
🍎
🍎 🍎
20% of Data
🍐
🍇 🍒
🍎
🍎
🍐
🍫
🍐
🍇 🍒
🍎
🍎
🍐
🍫
🍒
You might
also like
🍎
🍎 🍎
30% of Data
🍐
🍇 🍒
🍎
🍎
🍐
🍫
🍐
🍇 🍒
🍎
🍎
🍐
🍫
🍐
🍇 🍒
🍎
🍎
🍐
🍫
Increase Data
Volume
…
Evaluate KPI
change
10. 10 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
How much data do we need? (At OTTO, in real life)
Experiment: How does Data Volume affect KPIs? (Machine Learning System)
0,70% 3,33% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
KPIs(normalizedfrom0to1)
Amount of Data (relative to max.)
computing time revenue conversion rate
% of customers who bought a
recommendation
~ 9 TB
12. 12 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
customer
data celery
AI-Algorithm
(cravingknowledge)
data deluxe
burger
expectable Page Impression
(e.g. Daily Deal)
click on new search result
redundand or
irrelevant
information
new & relevant
information
💶 💶
💶
💰
€
€
€
Which data do we need?
Finding clicks that matter.
13. 13 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
data celery
data deluxe
burger
expectable Page Impression
(e.g. Daily Deal)
redundand or
irrelevant
information
new & relevant
information
💶 💶
💶
💰
€
€
€
customer
click on new search result
AI-Algorithm
(cravingknowledge)
Which data do we need?
Finding clicks that matter.
14. 14 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Value: Sensitivity Analysis
🍐
🍇 🍒
🍎
🍎
🍎 🍐
🍐
🍫
🍫
🍒
🍒🍎 🍇
You might
also like
🍎
🍎 🍎
Reference System
🍎
🍎
🍎
You might
also like:
?
?
?
🍎
🍐
🍇 🍒
🍎
🍎
🍎 🍐
🍐
🍫
🍫
🍒🍎 🍇
🍎
🍎
🍎
You might
also like:
?
?
?
🍎
🍎
🍎
🍎
You might
also like:
?
?
?
🍎
🍎
🍎
🍎
You might
also like:
?
?
?
🍎
🍇 🍒
🍎 🍐
🍐
🍫
🍫
🍒🍎 🍇
🍐 🍎
🍎
🍇 🍒
🍎 🍐
🍐
🍫
🍫
🍒🍎 🍇
🍐 🍎
🍎
🍇 🍒
🍎 🍐
🍐
🍫
🍫
🍒🍎 🍇
🍐 🍎
🍎
500 test systems with one
data point omitted in each
Difference in
recommendation
quality?
?
Computing the value of individual data points (by leaving them out)
15. 15 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data value: Results
(Lab-scale experiment, real revenue data)
More than
62%of test data points with
positive value
about
11%with negative value
(„toxic data“)
26%of the data points with
(virtually) no effect
16. 16 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Value ↔ Informational Value
2Sensitivity Analysis:
what does a single data
point change?
vs.
3 Relate output changes to KPI
changes
(more informed does not always
imply better performance)
💰?=
1Determine your
system‘s business
impact
17. 17 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Toxic Data: Typical Online Marketing Example
18. 18 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Toxic Data: New Edition of Tech vs. Sales
Experiment: „Deal of the Day“ and Recommendation Quality
• Generates lots of clicks /
engagement
• Generates lots of “unnatural”
use sessions, too…
19. 19 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Product as Deal of the
Day (Averaged over
one month’s deals)
Click rate: -28%
Conversion Rate: -8%
30 days ahead of „deal day“ 30 days after „deal day“
Toxic Data: New Edition of Tech vs. Sales
Experiment: „Deal of the Day“ and Recommendation Quality
20. 20 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Data Minimalism
Sustainability: economical, ecological and social
necessity for future-proof systems
Quality: enabling optimization
by filtering toxic data
Is as complex as the decision system using the data
(but still feasible – needs explainable AI)
…so much fun research to do. ☺
21. 21 Data Minimalism & Data Value (Why Data shouldn‘t be the new Bacon)
Michaela Regneri
Munich Datageeks, January 2020
Looking forward to chat about…
michaela.regneri@otto.de
- Data value & explainable AI
- AI, ethics & digitallLiteracy
- Applied research &
innovation processes
Analyzing Hypersensitive AI:
Instability in Corporate-Scale
Machine Learning.
Explainable AI (XAI) 2018
Computing the Value of Data:
Towards Applied Data
Minimalism.
Green Data Mining 2019