Main Takeaways:
- Understand the importance of proactively thinking about customer privacy and why ML-based solutions are ideal to tackle that problem
- Bootstrapping an ML workflow and leading your ML scientists through the different steps - goal setting, data collection, data labeling, picking the right ML model, validation, and setting goal success criteria
- Avoiding common pitfalls, not getting overwhelmed with data and complexity, and managing leadership expectations
5. Using ML to
protect customer
privacy
Pushpak Pujari
PM at Verkada | ex Sr. PM at Amazon
6. Bio
PM at Verkada for Security Cameras
Sr. PM at Amazon Alexa AI - Privacy
Sr. PM at Amazon Web Services IoT
Wharton MBA, EE from IIT Delhi
Hobbies: Tennis, Hiking, Beer Brewing
7. Takeaways from this Webinar
Privacy fundamentals
Privacy preservation techniques
Using ML for privacy – a walkthrough
Strategies for being an impactful Privacy PM
8. Why Privacy Matters
Companies collect and retain tons of
customer data:
• Fulfilling a service request
• Legal or regulatory requirements
• Better CX: recommendations, marketing etc.
• Resell data to 3P
Collected data can contain sensitive
information
Such data landing into wrong hands can be
devastating – both for customer and the
organization
9. Why Privacy Matters
• Data breaches happen way more
frequently than you think
• Data is spread across different
organizations and medium. Almost
impossible to track data lineage
• Rise of privacy laws (HIPAA, GDPR, CCPA,
COPPA etc.) with more coming soon
• Growing distrust of social media providers
• Customers want transparency on how their
data is being used
10. What constitutes Personal Data
• Direct identifiers
• E.g.: Full name, address, SSN, phone
number
• Indirect identifiers
• E.g.: location history, gender, demographic
information, salary
11. Data classification
• Identified: contains direct or
indirect identifiers
• Pseudonymous: eliminate or
transform direct identifiers
• De-identified: direct and known
indirect identifiers removed
• Anonymous: mathematically
proven to prevent re-
identification
John Doe
Personal Data
eEf2gT_334
Pseudonymized Data
Mary Jane
Personal Data
********
Anonymous Data
Random
Noise
Key
14. Benefits of being Privacy-first
• Avoid huge fines
• Prevent loss of business licenses
• Brand impact, trust
• Customer loyalty and retention
• Increase Customer Lifetime Value and higher conversion
• Competitive moat
Privacy-first positioning is table stakes
15. Sources of
Privacy Risk
Raw Customer Data and its
derivatives
Metadata and logs
ML Models
For attackers, raw data is the holy grail, but ML Model should not be ignored
16. Privacy Risks from ML Models
non-members
in training
dataset
member in
training dataset
predictions
Output distributions
Delta denotes
privacy risk
Test Dataset (potential members)
Source: Privacy-Preserving Machine Learning: Threats and Solutions
17. Don’t be alarmed!
• Locking customer data in a secure vault and
throwing away the key is not the answer
• Goal is to protect customer data while using it
to deliver great CX without sacrificing
customer privacy
Rest of the presentation is focused on using ML
to mitigate the privacy risks while leaving
enough utility in the data
19. Direct
Identifiers
Examples
• Name
• Address (all geographic subdivisions smaller than state)
• All dates related to an individual
• Telephone / Fax numbers
• Email address
• Social Security Number
• Medical record number
• Health plan beneficiary number
• Any account number
• Any certificate or license number
• Vehicle identifiers including license plate numbers
• Device identifiers and serial numbers
• Web URLs
• Internet Protocol (IP) Address
• Biometrics including finger or voice print
• Photographic image - not limited to images of the face
20. Direct Identifier Detection
and Filtering
Define a list of identifiers and scan
datasets for said identifiers
Easiest to implement
No measurable guarantees
Needs humans in the loop
Maintaining and improving models is hard
21. Pseudonymization
Map direct identifiers to unique tokens
Can be one-way or two-way
Easier to implement
Allows joins with other data tables
Re-identification impossible from tokens
Original data can be extracted
Needs consistent implementation
----------------------------
----------------------------
4145 4455 3489 9985
----------------------------
----------------------------
41ss utoh dkjbg 9985
22. K-anonymization
Generalize quasi-identifiers and make each
record indistinguishable from at least k-1
other records
Stronger anonymization
Reduces data utility
Choosing ideal k value is hard
Choosing generalization logic is hard
944*
94401
94454
94432
Zip Codes
26
24
27
29
Age
23. Differential Privacy
Query outcome is not dependent
on any one record
Measurable privacy guarantees
Hard to choose right parameters
Not practical for a lot of use cases (yet)
Maintaining DP datasets over time is expensive
Picture credit: Winton Research
24. ML to detect direct identifiers: a walkthrough
• Use cases:
• [p0] Scan search phrases for direct identifiers, if found delete immediately
• [p1] If an employee is trying to access customer data for customer analytics, ensure
that it contains no direct identifiers
• Functional requirements
• Detect 5 types of identifiers: full name, address, telephone numbers, email id, SSN
• en_US locale only
• Goal Success Criteria – precision 70%, recall 95%
• Non-functional requirements
• [p0] Scan 1 query (~5 search words) in 250ms
• [p1] Provide API for batch detection
25. Ingredients for a spicy ML model
Training
Data
Success
Metrics
Model
architecture
ML
Infrastructure
Continuous
improvement
26. Training Data
• Garbage-in, garbage-out: training data should be as close as to your
runtime data as possible in syntax and semantics
• Human labeling challenges
• Identifying which search phrases contain PII so it can be annotated
• Ambiguity – high quality ground truth requires multiple passes
• Using actual customer data might lead to privacy exposure
• Track Labeling metrics as it directly impact model performance
• Size and diversity in training data to minimize overfit and underfit
27. Metrics and
Performance
Evaluation
Precision and Recall – which one is
more important?
Sampling challenges with skewed
identifier distribution
Measurement can be expensive
How frequently should your run
measurement workflow
28. Model Architecture: Choose Your Weapon
• Logistic Regression based binary classifiers
• Easy to implement
• Hard to attribute what is working and what isn’t
• Regular Expression (Regex)
• Highly effective for direct identifiers which have consistent schema
• Dumb, hard to generalize, hard to expand and scale
• NER (Stanford NER, Stanza, FLAIR, spaCy, transformers like BERT)
• Ideal for names, addresses and context dependent identifiers
• Computationally expensive, requires large training data
• No one size fits all solution
• Trial and Error based experimentation is key
29. Model Architecture: Choose Your Weapon
1. Name - NER
2. Address - NER
3. Telephone numbers - Regex
4. Email address - Regex
5. Social Security Number - Regex
30. Infrastructure
All public cloud providers have offerings
for training, testing, hosting and MLOps
Work with ML scientists to pick
framework of choice
33. The most rewarding
PM opportunity
Can seem technically challenging and ambiguous
but
• True opportunity to lead and stand out
• Core Product Management
• Tremendous learning opportunity, build specific
skills for the data-first world
• Truly multi-disciplinary cutting –AI/ML/data,
security, legal, compliance, cloud
• Create positive impact and make the world a
better place
34. Strategies to
Gain
Leverage
Partner Identify who cares – CISO, senior leadership
Quantify Quantify impact on Brand and tie it to organization’s
business metrics
Goals Work backwards from Customer Promises
Vision Set an exciting and appealing North Star vision
35. Strategies to
Gain
Leverage
Team
Put together a cross-
team task force of
curious people
Incremental
Build an incremental
roadmap with few
quick wins
Visibility
Provide continuous
visibility
Incentivize
Create adoption plan
with the right
incentives
36. Where to begin
Follow the data
Chart the customer data
lifecycle
Create threat map
Where are humans in the loop
What tools do they use to
access the data
Identify use cases Privacy vs Utility tradeoff
Identify drivers and define success
metrics
Ingestion Deletion
Usage
Storage
38. Resources
• Visual guide to practical data de-identification: https://fpf.org/wp-
content/uploads/2016/04/FPF_Visual-Guide-to-Practical-Data-DeID.pdf
• Google's Patent on PII detection: https://patents.google.com/patent/US8561185B1/en
• Microsoft Presidio: https://github.com/microsoft/presidio
• Use NER mode to detect person names in text: https://pii-tools.com/detect-person-
names-in-text/
• Custom NLP approaches to data anonymization: https://towardsdatascience.com/nlp-
approaches-to-data-anonymization-1fb5bde6b929
• Detecting and redacting PII using Amazon Comprehend:
https://aws.amazon.com/blogs/machine-learning/detecting-and-redacting-pii-using-
amazon-comprehend/