Mattingly "AI & Prompt Design: Large Language Models"
(Ab)using Identifiers: Indiscernibility of Identity
1. (Ab)using Identifiers
@ Ben Gross
BayCHI
2009-11-10
University of Illinois Urbana Champaign
Library and Information Science
bgross@acm.org
http://bengross.com/ @
6. Why you might care
•Usability implications
•Productivity implications
•Security implications
•Employee satisfaction
@
7. How did I get here?
•“I only have one email address...”
•“Well, except that one I only use for...”
•“And that other one I use with...”
@
8. Half a million users
“... average user has 6.5 passwords, each of
which is shared across 3.9 different sites.
Each user has about 25 accounts that require
passwords, and types an average of 8
passwords per day.”
Dinei Florêncio and Cormac Herley. A Large-
Scale Study of Web Password Habits. WWW ’07
@
10. Data
• Financial services • Average # of
email addresses = 1.8
min 1 / max 4.
IM = 1.8
min 1 / max 4
• Design Firm • Average # of
email addresses = 3.6
min 1 / max 10
IM = 1.7
min 1 / max 3
• Combined total • Average = 3.3
@
11. “The individual in ordinary work situations
presents himself and his activity to others, the
ways in which he guides and controls the
impression they form of him and the kinds of
things he may and may not do while sustaining
his performance before them.”
Erving Goffman
Presentation of Self in Everyday Life, 1959.
@
13. Social factors
•“I knew that my college one wasn't
forever, so I wanted something more
permanent after I graduated.”
•“...I didn't like the name that I
picked when it was my first email.”
•“...you just say oh my first name and
last name at gmail.com ... something
easy to remember.”
@
14. Technical factors
•Namespace saturation AKA the
jimsm1th77@hotmail.com problem
•Firewalls and VPNs AKA “They
don’t let me use Hotmail at work...”
•Configuration problems AKA “What
does SMTP-AUTH with MD5
checksums on port 567 mean?”
@
16. It’s Just Data...
“We’re an information economy. They
teach you that in school. What they don't
tell you is that it's impossible to move, to
live, to operate at any level without leaving
traces, bits, seemingly meaningless
fragments that can be retrieved
amplified...”
William Gibson Johnny Mnemonic
@
21. Managing Flash Cookies
http://www.macromedia.com/support/
documentation/en/flashplayer/help/
settings_manager07.html @
22. Referer (sic)
•adsl-75-18-132-43.dsl.pltn13.sbcglobal.net -
- [10/Nov/2009:14:50:56 -0800] "GET /
wireless.html HTTP/1.1" 200 29149
"http://bengross.com/voip.html" "Mozilla/
5.0 (Macintosh; U; Intel Mac OS X 10_6_2;
en-us) AppleWebKit/531.9 (KHTML, like
Gecko) Version/4.0.3 Safari/531.9"
@
23. Leaky Headers
On the Leakage of Personally Identifiable
Information Via Online Social Networks
Balachander Krishnamurthy and Craig Wills
@
24. More Options
•URL Munging and Session IDs in URL
•Flash Cookies/Local Shared Object
•Silverlight Cookies
•Virtual Page Views, Event (Google
Analytics) User Defined Values
@
25. Synthetic IDs
•Everything in the Referer header can
be used to for a synthetic identifier.
•The User Agent is a good source
•IP addresses if you have them
•Screen dimensions, user agent
•Hash of IP address/remote ports
@
26. Other Sources of Bits
•Last Modified and ETag headers
•HTTP Keepalive
•SSL Session IDs
•TCP Timestamps
@
27. The Art of Being Lost
•“We do not collect personal contact
information from visitors to your
website. Personal contact information
means billing address, physical
address, individual name, email
address, etc.” (OpenTracker.com)
@
28. Netflix Data Released
•Dataset contains 100,480,507 movie
ratings, created by 480,189 Netflix
subscribers between December 1999 and
December 2005.
•“...all customer identifying information
has been removed; all that remains are
ratings and dates.
This
follows our
privacy policy...”
•No unique identifiers or quasi-identifiers
@
29. You Only Need Two
•Robust De-anonymization of Large Sparse
Datasets by Arvind Narayanan and Vitaly
Shmatikov
•IMBD as a source of entropy
•“With 8 movie ratings (of which 2 may be
completely wrong) and dates that may have
a 14-day error, 99% of records can be
uniquely identified in the dataset.”
@
30. It comes down to this
“Q: If you don't publicly rate movies on IMDb and similar
forums, there is nothing to worry about.
A: ...you should not ever mention any movies you
watched prior to 2005 on a public blog or website.
Everybody who was a Netflix subscriber prior to 2005
should restrain themselves from these activities...
We do not think this is a feasible privacy policy.”
FAQ
“How to Break Anonymity of the Netflix Prize Dataset”
@
31. Guessing Your SSN
•Predicting Social Security Numbers
from Public Data by Alessandro Acquisti
and Ralph Gross
•...I’ll just need the last 4 of your SSN for
verification purposes...
•“...we accurately predicted the first 5
digits of 2% of California records with
1980 birthdays, and 90% of Vermont
records with 1995 birthdays.”
@
32. Disclosure and UI
•“Facebook Beacon is a way for you to
bring actions you take online into
Facebook. Beacon works by allowing
affiliate websites to send stories about
actions you take to Facebook.”
•Launched November 2007
•Class action lawsuit August 2008
•Shut down September 2009
@