SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Privacy by design
Big data technology summit, 2018-02-22
Lars Albertsson
www.mapflat.com
1
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc
2
Privacy protection resources
3
All of this might go
wrong. Large fine.
Pour your data into
our product.
404
Privacy by design
● Required by GDPR
● Technical scope
○ Engineering toolbox
○ Puzzle pieces - not complete solutions
● Assuming that you solve:
○ Legal requirements
○ Security primitives
○ ...
● Disclaimers:
○ This is not a description of company X
○ This is not legal / compliance advice
4
Archi-
tecture
Statis-
tics
Legal
Organi-
sation
Security
Process Privacy UX
Culture
Requirements, engineer’s perspective
● Right to be forgotten
● Limited collection
● Limited retention
● Limited access
○ From employees
○ In case of security breach
● Consent for processing
● Right for explanations
● Right to correct data
● User data enumeration
● User data export
5
Ancient data-centric systems
● The monolith
● All data in one place
● Analytics + online serving from
single database
● Current state, mutable
- Please delete me?
- What data have you got on me?
- Please correct this data
- Sure, no problem!
6
DB
Presentation
Logic
Storage
Event-oriented / big data systems
7
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
● Motivated by
○ New types of data-driven (AI) features
○ Quicker product iterations
■ Data-driven product feedback (A/B tests)
■ Democratised data - fewer teams involved in changes
○ Robustness - scales to more complex business logic
Enable disruption
Data
lake
Data processing at scale
8
Cluster storage
Batch processing
AI feature
DatasetJob
Pipeline
Data-driven product
development
AnalyticsCold
store
Ingress Egress
Factors of success
9
Functional architecture:
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
- Please delete me?
- What data have you got on me?
- Please correct this data
- Hold on a second...
Solution space
10
Technical
feasibility
Easy to do
the right thing
Awareness
culture
Personal information (PII) classification
You need to establish a field/dataset
classification. Example:
Is application content sensitive? Depends.
● Music, video playlists - perhaps not
● Running tracks, taxi rides - apparently
● In-application messages - probably
11
● Red - sensitive data
○ Messages
○ GPS location
○ Views, preferences
● Yellow - personal data
○ IDs (user, device)
○ Name, email, address
○ IP address
● Green - insensitive data
○ Not related to persons
○ Aggregated numbers
● Grey zone
○ Birth date, zip code
○ Recommendation / ads models?
Eye of the needle tool
● Provide data access through gateway tool
○ Thin wrapper around Spark/Hadoop/S3/...
○ Hard-wired configuration
● Governance
○ Access audit, verification
○ Policing/retention for experiment data
12
Eye of the needle tool
● Easy to do the right thing
○ Right resource choice, e.g. “allocate temporary
cluster/storage”
○ Enforce practices, e.g. run jobs from central repository code
○ No command for data download
● Enabling for data scientists
○ Empowered without operations
○ Directory of resources
13
Possible strategy: Privacy protection at ingress
Scramble on arrival
+ Simple to implement
- Limits value extraction
- Deanonymisation possible
IMHO not a feasible strategy
14
Privacy protection at egress
Processing in opaque box
+ Enabling
+ Simpler to reason about
- Strict operations required
- Exploratory analytics need explicit egress /
classification
15
Machines are
allowed to see
intermediate data
Humans &
services interact
with exported
data
Towards oblivion
● Left to its own devices,
personal (PII) data spreads
like weed
● PII data needs to be
governed, collared, or
discarded
○ Discard what you can
16
● Discard all PII
○ User id in example
● No link between records or datasets
● Replace with non-PII
○ E.g. age, gender, country
● Still no link
○ Beware: rare combination => not anonymised
Drop user id
Discard: Anonymisation
17
Replace user id with
demographics
Useful for
business
insights
Useful for
metrics
Partial discard: Pseudonymisation
● Hash PII
● Records are linked
○ Across datasets
○ Still PII, GDPR applies
○ Persons can be identified (with additional data)
○ Hash recoverable from PII
● Hash PII + salt
○ Hash not recoverable
● Records are still linked
○ Across datasets if salt is constant
18
Hash user id Useful for
recommendations
Hash user id
+ salt
Useful for product
insights
● Push reruns with
workflow orchestrator
- No versioning support in tools
- Computationally expensive
- Easy to miss datasets
- PII in cleartext everywhere
+ No data model changes required
+ Usually necessary for egress storage
Governance: Recomputation
19
● Fields reference PII table
● Clear single record => oblivion
- PII table injection needed
- Key with UUID or hash
- Extra join
- Multiple or wide PII tables
+ PII table can be well protected
Ejected record pattern
20
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Requires user-defined function in SQL?
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles changing PII fields
Lost key pattern
21
● Input:
○ Page view events
○ User account creations
○ User deletion requests
● Business job outputs:
○ Web daily active user count, per country
○ Duplicate display name detection → email
Example: Lost key pattern
22
Raw
NewUser
Raw
Forgotten
User
WebDau
UserName
Duplicate
User
service
DB
Raw
PageView
Web app
● Split RawNewUser
○ Encryption key
○ Non-PII + encrypted PII
Example: Lost key pattern
23
NewUser
NewUser
Key
Raw
NewUser
Raw
Forgotten
User
WebDau
UserName
Duplicate
User
service
DB
Raw
PageView
Web app
Joinable
● UserKey = latest user encryption keys
○ Recursive - depends on yesterday
○ Yesterday's + new - forgotten
● User = all users ever seen
○ Recursive
○ Yesterday's + new
■ grows forever
○ Encrypted PII
Example: Lost key pattern
24
NewUser
NewUser
Key
Raw
NewUser
User
Raw
Forgotten
User
WebDau
UserName
Duplicate
UserKey
User
service
DB
Raw
PageView
Web app
Joinable
● Encrypt page view PII
○ Pseudonymised
● WebDau aggregation requires no PII
● UserNameDuplicate requires email for push
○ Depend on UserKey.latest for decrypting email in User
○ Egress DB should have limited retention
Example: Lost key pattern
25
NewUser
NewUser
Key
Raw
NewUser
User
Raw
Forgotten
User
WebDau
UserName
Duplicate
UserKey
User
service
Latest day
dependency
DB
Raw
PageView
Web app PageView
Joinable
Must have
retention
Tombstone line
● Produce dataset/stream of forgotten users
● Egress components, e.g. online service
databases, may need push for removal.
○ Higher PII leak risk
26
DB Service
Data model deadly sins
● Using PII data as key
○ Username, email
● Publishing entity ids containing PII data
○ E.g. user shared resources (favourites, compilations) including username
● Publishing pseudonymised datasets
○ They can be de-pseudonymised with external data
○ E.g. AOL, Netflix, ...
27
Cluster storage Cluster storage
Limited retention → lake freeze
● Remove expired raw dataset, freeze derived datasets
● Workflow DAG still works
28
Cold store Derived Cold store Derived
www.mapflat.com
What about streaming?
29
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
● Unified log - bus of all business events
○ Streams = infinite datasets
● Pipelines with stream processing jobs
○ Governance & reprocessing difficult
● Ejected record & lost key patterns work
○ PII or encryption key in database table
● Retention is naturally limited
www.mapflat.com
Correcting invalid data = human in the loop
● Humans are lousy data processors
○ Expensive to execute
○ Not completely deterministic
○ Not ready to kick off at 2 am
○ Don't read Avro very well
○ Not compatible with CI/CD
30
● Add human curation to cold store
○ Pipeline job merges human curation input
○ Overrides data from other sources
Curation
service
Human
overrides
Lineage
● Tooling for tracking data flow
● Dataset granularity
○ Workflow manager?
● Field granularity
○ Framework instrumentation?
● Multiple use cases
○ (Discovering data)
○ (Pipeline change management)
○ Detecting dead end data flows
○ Right to export data
○ Explanation of model decisions
31
Resources Credits
● https://www.slideshare.net/lallea/protecting
-privacy-in-practice
● http://www.slideshare.net/lallea/data-pipeli
nes-from-zero-to-solid
● http://www.mapflat.com/lands/resources/re
ading-list
● https://ico.org.uk/
● EU Article 29 Working Party
● ENISA: "Privacy by design in big data"
● GDPR-podden
32
● Alexander Kjeldaas, independent
● Lena Sundin, independent
● Oscar Söderlund, Spotify
● Oskar Löthberg, Spotify
● Sofia Edvardsen,
Sharp Cookie Advisors
● Øyvind Løkling,
Schibsted Media Group
● Enno Runne, Baymarkets

Mais conteúdo relacionado

Semelhante a Privacy by Design - Lars Albertsson, Mapflat

Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2Steen Rasmussen
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Staying safe in the cloud
Staying safe in the cloudStaying safe in the cloud
Staying safe in the cloudOleg Podsechin
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingMartinStrycek
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Chris Dagdigian
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraShubham Tagra
 
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017Rendy Bambang Junior
 

Semelhante a Privacy by Design - Lars Albertsson, Mapflat (20)

Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Paving The Way To Data Driven
Paving The Way To Data DrivenPaving The Way To Data Driven
Paving The Way To Data Driven
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Data democratised
Data democratisedData democratised
Data democratised
 
Staying safe in the cloud
Staying safe in the cloudStaying safe in the cloud
Staying safe in the cloud
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
 
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
 

Mais de Evention

The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...Evention
 
A/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comA/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comEvention
 
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Evention
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
 
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Evention
 
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformBuilding a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformEvention
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
 
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Evention
 
Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Evention
 
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...Evention
 
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Evention
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
 
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyScaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyEvention
 
Big Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaBig Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaEvention
 
Elastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućElastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućEvention
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chowEvention
 
That won’t fit into RAM - Michał Brzezicki
That won’t fit into RAM -  Michał  BrzezickiThat won’t fit into RAM -  Michał  Brzezicki
That won’t fit into RAM - Michał BrzezickiEvention
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeEvention
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
 

Mais de Evention (20)

The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...
 
A/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comA/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.com
 
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
 
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
 
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformBuilding a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
 
Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...
 
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
 
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyScaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell Spotify
 
Big Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaBig Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz Śliwa
 
Elastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućElastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz Kołpuć
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chow
 
That won’t fit into RAM - Michał Brzezicki
That won’t fit into RAM -  Michał  BrzezickiThat won’t fit into RAM -  Michał  Brzezicki
That won’t fit into RAM - Michał Brzezicki
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 

Último

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 

Último (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Privacy by Design - Lars Albertsson, Mapflat

  • 1. Privacy by design Big data technology summit, 2018-02-22 Lars Albertsson www.mapflat.com 1
  • 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) ○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc 2
  • 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  • 4. Privacy by design ● Required by GDPR ● Technical scope ○ Engineering toolbox ○ Puzzle pieces - not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Disclaimers: ○ This is not a description of company X ○ This is not legal / compliance advice 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  • 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Consent for processing ● Right for explanations ● Right to correct data ● User data enumeration ● User data export 5
  • 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Please correct this data - Sure, no problem! 6 DB Presentation Logic Storage
  • 7. Event-oriented / big data systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Democratised data - fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  • 8. Data lake Data processing at scale 8 Cluster storage Batch processing AI feature DatasetJob Pipeline Data-driven product development AnalyticsCold store Ingress Egress
  • 9. Factors of success 9 Functional architecture: ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Please correct this data - Hold on a second...
  • 10. Solution space 10 Technical feasibility Easy to do the right thing Awareness culture
  • 11. Personal information (PII) classification You need to establish a field/dataset classification. Example: Is application content sensitive? Depends. ● Music, video playlists - perhaps not ● Running tracks, taxi rides - apparently ● In-application messages - probably 11 ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models?
  • 12. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 12
  • 13. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 13
  • 14. Possible strategy: Privacy protection at ingress Scramble on arrival + Simple to implement - Limits value extraction - Deanonymisation possible IMHO not a feasible strategy 14
  • 15. Privacy protection at egress Processing in opaque box + Enabling + Simpler to reason about - Strict operations required - Exploratory analytics need explicit egress / classification 15 Machines are allowed to see intermediate data Humans & services interact with exported data
  • 16. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 16
  • 17. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 17 Replace user id with demographics Useful for business insights Useful for metrics
  • 18. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 18 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  • 19. ● Push reruns with workflow orchestrator - No versioning support in tools - Computationally expensive - Easy to miss datasets - PII in cleartext everywhere + No data model changes required + Usually necessary for egress storage Governance: Recomputation 19
  • 20. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + PII table can be well protected Ejected record pattern 20
  • 21. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Requires user-defined function in SQL? - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles changing PII fields Lost key pattern 21
  • 22. ● Input: ○ Page view events ○ User account creations ○ User deletion requests ● Business job outputs: ○ Web daily active user count, per country ○ Duplicate display name detection → email Example: Lost key pattern 22 Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app
  • 23. ● Split RawNewUser ○ Encryption key ○ Non-PII + encrypted PII Example: Lost key pattern 23 NewUser NewUser Key Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app Joinable
  • 24. ● UserKey = latest user encryption keys ○ Recursive - depends on yesterday ○ Yesterday's + new - forgotten ● User = all users ever seen ○ Recursive ○ Yesterday's + new ■ grows forever ○ Encrypted PII Example: Lost key pattern 24 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service DB Raw PageView Web app Joinable
  • 25. ● Encrypt page view PII ○ Pseudonymised ● WebDau aggregation requires no PII ● UserNameDuplicate requires email for push ○ Depend on UserKey.latest for decrypting email in User ○ Egress DB should have limited retention Example: Lost key pattern 25 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service Latest day dependency DB Raw PageView Web app PageView Joinable Must have retention
  • 26. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 26 DB Service
  • 27. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 27
  • 28. Cluster storage Cluster storage Limited retention → lake freeze ● Remove expired raw dataset, freeze derived datasets ● Workflow DAG still works 28 Cold store Derived Cold store Derived
  • 29. www.mapflat.com What about streaming? 29 Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job ● Unified log - bus of all business events ○ Streams = infinite datasets ● Pipelines with stream processing jobs ○ Governance & reprocessing difficult ● Ejected record & lost key patterns work ○ PII or encryption key in database table ● Retention is naturally limited
  • 30. www.mapflat.com Correcting invalid data = human in the loop ● Humans are lousy data processors ○ Expensive to execute ○ Not completely deterministic ○ Not ready to kick off at 2 am ○ Don't read Avro very well ○ Not compatible with CI/CD 30 ● Add human curation to cold store ○ Pipeline job merges human curation input ○ Overrides data from other sources Curation service Human overrides
  • 31. Lineage ● Tooling for tracking data flow ● Dataset granularity ○ Workflow manager? ● Field granularity ○ Framework instrumentation? ● Multiple use cases ○ (Discovering data) ○ (Pipeline change management) ○ Detecting dead end data flows ○ Right to export data ○ Explanation of model decisions 31
  • 32. Resources Credits ● https://www.slideshare.net/lallea/protecting -privacy-in-practice ● http://www.slideshare.net/lallea/data-pipeli nes-from-zero-to-solid ● http://www.mapflat.com/lands/resources/re ading-list ● https://ico.org.uk/ ● EU Article 29 Working Party ● ENISA: "Privacy by design in big data" ● GDPR-podden 32 ● Alexander Kjeldaas, independent ● Lena Sundin, independent ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group ● Enno Runne, Baymarkets