Using ML to Protect Customer Privacy by fmr Amazon Sr PM

Using ML to Protect Customer Privacy
by fmr Amazon Sr PM
www.productschool.com

CERTIFICATES
Your Product Management Certificate Path
Product Leadership
Certificate™
Full Stack Product
Management Certificate™
Product Management
Certificate™
20 HOURS
40 HOURS
40 HOURS

Corporate
Training
Level up your team’s Product
Management skills

Free Product Management Resources
BOOKS
EVENTS
JOB PORTAL
COMMUNITIES
bit.ly/product_resources
COURSES

Using ML to
protect customer
privacy
Pushpak Pujari
PM at Verkada | ex Sr. PM at Amazon

Bio
PM at Verkada for Security Cameras
Sr. PM at Amazon Alexa AI - Privacy
Sr. PM at Amazon Web Services IoT
Wharton MBA, EE from IIT Delhi
Hobbies: Tennis, Hiking, Beer Brewing

Takeaways from this Webinar
Privacy fundamentals
Privacy preservation techniques
Using ML for privacy – a walkthrough
Strategies for being an impactful Privacy PM

Why Privacy Matters
Companies collect and retain tons of
customer data:
• Fulfilling a service request
• Legal or regulatory requirements
• Better CX: recommendations, marketing etc.
• Resell data to 3P
Collected data can contain sensitive
information
Such data landing into wrong hands can be
devastating – both for customer and the
organization

Why Privacy Matters
• Data breaches happen way more
frequently than you think
• Data is spread across different
organizations and medium. Almost
impossible to track data lineage
• Rise of privacy laws (HIPAA, GDPR, CCPA,
COPPA etc.) with more coming soon
• Growing distrust of social media providers
• Customers want transparency on how their
data is being used

What constitutes Personal Data
• Direct identifiers
• E.g.: Full name, address, SSN, phone
number
• Indirect identifiers
• E.g.: location history, gender, demographic
information, salary

Data classification
• Identified: contains direct or
indirect identifiers
• Pseudonymous: eliminate or
transform direct identifiers
• De-identified: direct and known
indirect identifiers removed
• Anonymous: mathematically
proven to prevent re-
identification
John Doe
Personal Data
eEf2gT_334
Pseudonymized Data
Mary Jane
Personal Data
********
Anonymous Data
Random
Noise
Key

Privacy vs Utility Tradeoff
Picture credit: Mostly AI

Stakeholders in Privacy
Enactment
• Compliance Team
• Information Security
• Legal
• Privacy Engineering
• Product

Benefits of being Privacy-first
• Avoid huge fines
• Prevent loss of business licenses
• Brand impact, trust
• Customer loyalty and retention
• Increase Customer Lifetime Value and higher conversion
• Competitive moat
Privacy-first positioning is table stakes

Sources of
Privacy Risk
Raw Customer Data and its
derivatives
Metadata and logs
ML Models
For attackers, raw data is the holy grail, but ML Model should not be ignored

Privacy Risks from ML Models
non-members
in training
dataset
member in
training dataset
predictions
Output distributions
Delta denotes
privacy risk
Test Dataset (potential members)
Source: Privacy-Preserving Machine Learning: Threats and Solutions

Don’t be alarmed!
• Locking customer data in a secure vault and
throwing away the key is not the answer
• Goal is to protect customer data while using it
to deliver great CX without sacrificing
customer privacy
Rest of the presentation is focused on using ML
to mitigate the privacy risks while leaving
enough utility in the data

Data
Sanitization
Privacy Preservation Techniques
Privacy-preserving
Computation
• Direct Identifier Detection
and Filtering
• Pseudonymization
• K-anonymization
• Differential Privacy
• Homomorphic Encryption
• Secure Multi-Party Computation
• Trusted Execution Environments
• Federated Learning

Direct
Identifiers
Examples
• Name
• Address (all geographic subdivisions smaller than state)
• All dates related to an individual
• Telephone / Fax numbers
• Email address
• Social Security Number
• Medical record number
• Health plan beneficiary number
• Any account number
• Any certificate or license number
• Vehicle identifiers including license plate numbers
• Device identifiers and serial numbers
• Web URLs
• Internet Protocol (IP) Address
• Biometrics including finger or voice print
• Photographic image - not limited to images of the face

Direct Identifier Detection
and Filtering
Define a list of identifiers and scan
datasets for said identifiers
Easiest to implement
No measurable guarantees
Needs humans in the loop
Maintaining and improving models is hard

Pseudonymization
Map direct identifiers to unique tokens
Can be one-way or two-way
Easier to implement
Allows joins with other data tables
Re-identification impossible from tokens
Original data can be extracted
Needs consistent implementation
----------------------------
----------------------------
4145 4455 3489 9985
----------------------------
----------------------------
41ss utoh dkjbg 9985

K-anonymization
Generalize quasi-identifiers and make each
record indistinguishable from at least k-1
other records
Stronger anonymization
Reduces data utility
Choosing ideal k value is hard
Choosing generalization logic is hard
944*
94401
94454
94432
Zip Codes
26
24
27
29
Age

Differential Privacy
Query outcome is not dependent
on any one record
Measurable privacy guarantees
Hard to choose right parameters
Not practical for a lot of use cases (yet)
Maintaining DP datasets over time is expensive
Picture credit: Winton Research

ML to detect direct identifiers: a walkthrough
• Use cases:
• [p0] Scan search phrases for direct identifiers, if found delete immediately
• [p1] If an employee is trying to access customer data for customer analytics, ensure
that it contains no direct identifiers
• Functional requirements
• Detect 5 types of identifiers: full name, address, telephone numbers, email id, SSN
• en_US locale only
• Goal Success Criteria – precision 70%, recall 95%
• Non-functional requirements
• [p0] Scan 1 query (~5 search words) in 250ms
• [p1] Provide API for batch detection

Ingredients for a spicy ML model
Training
Data
Success
Metrics
Model
architecture
ML
Infrastructure
Continuous
improvement

Training Data
• Garbage-in, garbage-out: training data should be as close as to your
runtime data as possible in syntax and semantics
• Human labeling challenges
• Identifying which search phrases contain PII so it can be annotated
• Ambiguity – high quality ground truth requires multiple passes
• Using actual customer data might lead to privacy exposure
• Track Labeling metrics as it directly impact model performance
• Size and diversity in training data to minimize overfit and underfit

Metrics and
Performance
Evaluation
Precision and Recall – which one is
more important?
Sampling challenges with skewed
identifier distribution
Measurement can be expensive
How frequently should your run
measurement workflow

Model Architecture: Choose Your Weapon
• Logistic Regression based binary classifiers
• Easy to implement
• Hard to attribute what is working and what isn’t
• Regular Expression (Regex)
• Highly effective for direct identifiers which have consistent schema
• Dumb, hard to generalize, hard to expand and scale
• NER (Stanford NER, Stanza, FLAIR, spaCy, transformers like BERT)
• Ideal for names, addresses and context dependent identifiers
• Computationally expensive, requires large training data
• No one size fits all solution
• Trial and Error based experimentation is key

Model Architecture: Choose Your Weapon
1. Name - NER
2. Address - NER
3. Telephone numbers - Regex
4. Email address - Regex
5. Social Security Number - Regex

Infrastructure
All public cloud providers have offerings
for training, testing, hosting and MLOps
Work with ML scientists to pick
framework of choice

Continuous
Improvement
Workflow
Re-train your model
periodically
Track model
performance metrics
regularly
Optimize training
frequency
Watch out for model
drift over time
Track labeling quality
metrics regularly
Optimize labeling
workflow

Effective Privacy PM
Cheat Sheet

The most rewarding
PM opportunity
Can seem technically challenging and ambiguous
but
• True opportunity to lead and stand out
• Core Product Management
• Tremendous learning opportunity, build specific
skills for the data-first world
• Truly multi-disciplinary cutting –AI/ML/data,
security, legal, compliance, cloud
• Create positive impact and make the world a
better place

Strategies to
Gain
Leverage
Partner Identify who cares – CISO, senior leadership
Quantify Quantify impact on Brand and tie it to organization’s
business metrics
Goals Work backwards from Customer Promises
Vision Set an exciting and appealing North Star vision

Strategies to
Gain
Leverage
Team
Put together a cross-
team task force of
curious people
Incremental
Build an incremental
roadmap with few
quick wins
Visibility
Provide continuous
visibility
Incentivize
Create adoption plan
with the right
incentives

Where to begin
Follow the data
Chart the customer data
lifecycle
Create threat map
Where are humans in the loop
What tools do they use to
access the data
Identify use cases Privacy vs Utility tradeoff
Identify drivers and define success
metrics
Ingestion Deletion
Usage
Storage

Best Practices
Stay abreast with
new technology
Build a community Join conferences Experiment

Resources
• Visual guide to practical data de-identification: https://fpf.org/wp-
content/uploads/2016/04/FPF_Visual-Guide-to-Practical-Data-DeID.pdf
• Google's Patent on PII detection: https://patents.google.com/patent/US8561185B1/en
• Microsoft Presidio: https://github.com/microsoft/presidio
• Use NER mode to detect person names in text: https://pii-tools.com/detect-person-
names-in-text/
• Custom NLP approaches to data anonymization: https://towardsdatascience.com/nlp-
approaches-to-data-anonymization-1fb5bde6b929
• Detecting and redacting PII using Amazon Comprehend:
https://aws.amazon.com/blogs/machine-learning/detecting-and-redacting-pii-using-
amazon-comprehend/

Thank
you!
• https://www.linkedin.com/in/pushpakpujari/
• @pushpakpujari

www.productschool.com
Part-time Product Management Training Courses
and
Corporate Training

Using ML to Protect Customer Privacy by fmr Amazon Sr PM

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Using ML to Protect Customer Privacy by fmr Amazon Sr PM

Semelhante a Using ML to Protect Customer Privacy by fmr Amazon Sr PM (20)

Mais de Product School

Mais de Product School (20)

Último

Último (20)

Using ML to Protect Customer Privacy by fmr Amazon Sr PM