The awareness and sense of privacy has increased in the minds of people over the past few years. Earlier, people were not very restrictive in sharing their personal information, but now they are more cautious in sharing it with strangers, either in person or online. With such privacy
expectations and attitude of people, it is difficult to embrace the fact that a lot of information is publicly available on the web. Information portals in the form of the e-governance websites run by Delhi Government in India provide access to such PII without any anonymization. Several databases e.g., Voterrolls, Driving Licence number, MTNL phone directory, PAN card serve as
repositories of personal information of Delhi residents. This large amount of available personal information can be exploited due to the absence of proper written law on privacy in India. PII can also be collected from various social networking sites like Facebook, Twitter, GooglePlus etc. where the users share some information about them. Since users themselves put this information, it may not be considered as a privacy breach, but if the information is aggregated, it may give out much more information resulting in a bigger threat. For e.g., data from social networks and open government databases can be combined together to connect an online identity to a real world identity. Even though the awareness about privacy has increased, the threats possible due to the
availability of this large amount of personal data is still unknown. To bring such issues to public notice, we developed Open-source Collation of eGovernment data And Networks (OCEAN), a system where the user enters little information (e.g. Name) about a person and gets large
amount of personal information about him / her like name, age, address, date of birth, mother's name, father's name, voter ID, driving licence number, PAN. On aggregation of information within the Voter ID database, OCEAN creates a family tree of the user giving out the details of his / her family members as well. We also calculated a privacy score, which calculates the risk associated with that individual in terms of how much PII of that person is revealed from open government data sources. 1,693 users had the highest privacy score making them the most
vulnerable to risks. Using OCEAN, we could collect 8,195,053 Voterrolls; 2,24,982 Driving licence; 53,419 PAN card numbers; 1,557,715 Twitter; 3,377,102 Facebook; 29,393 Foursquare; 1,86,798 LinkedIn and 28,900 GooglePlus records. We received 661 total hits (657 unique visitors) from the day we released the system, January 21, 2013, until October 10, 2013. To the best of our knowledge, this is the first real world deployed tool which provides personal information about residents of Delhi to everyone free of cost.
Full Report: http://arxiv.org/abs/1312.2784
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
OCEAN: Open-source Collation of eGovernment data And Networks: Understanding Privacy Leaks in Open Government Data
1. OCEAN: Open-source Collation of
eGovernment data And Networks
Understanding Privacy Leaks in Open
Government Data
Srishti Gupta
Advisor: Dr. Ponnurangam Kumaraguru
M.Tech Thesis Defense
20-November-2013
2. Thesis Committee
Dr. Muttukrishnan Rajarajan, City University,
London
Dr. Vinayak Naik, IIIT-Delhi
Dr. PK (Chair), IIIT-Delhi
2
4. Academic Honors
Gupta, S., Gupta, M., and Kumaraguru, P. OCEAN: Open- Poster
source Collation of eGovernment data And Networks. Poster
at Security and Privacy Symposium (SPS), IIT-K, 2013.
BEST
Gupta, S., Gupta, M., and Kumaraguru, P. Is Government a
Friend or Foe? Privacy in Open Government Data. Poster at
IBM-ICARE, IISc Bangalore, 2012.
4
6. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Experiments and Analysis
Conclusion
Future Work
Questions
6
8. Ways to get PII
OSN
E-mail, Docs,
Spreadsheet
Mail Thefts, Pharming
Shoulder Surfing
Dumpster Diving
Social Engineering
(e.g., Fake accounts)
Not credible
Limited Info.
Open Government Data Source
8
9. Research Motivation and Aim
Open Government Data Sources
‘Open’: Publicly available
eGovernment initiatives by different state government in
the form of databases / services.
Objective?
Improve information gathering procedure
Reduce the burden on citizens to access their data
Pros: Improved data availability, easy verification.
Cons: Databases publicly available, leading to information
disclosure, privacy breach.
9
11. Research Motivation and Aim
PII Leakage
Voter ID, Name, Father’s name, Age, Gender, Date Of
Birth, DL number, PAN, Phone number
Personally Identifiable
Information (PII)
11
12. Research Motivation and Aim
The Other Side! “People’s View”
CONSCIOUS
DECISION !
(Kumaraguru, 2012)
12
14. Research Motivation and Aim
Research Aim
To develop a technology to showcase publicly available
personal information online
To highlight the privacy issues on aggregation of available
personal information
14
16. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Experiments and Analysis
Conclusion
Future Work
Questions
16
17. Related Work and Research Contribution
Related Work
Yasni
(www.yasni.com)
17
18. Related Work and Research Contribution
Related Work
Pipl
(www.pipl.com)
18
19. Related Work and Research Contribution
Related Work
Various country-specific systems built with Open Government Data
Name
Country
Description
IndianKanoon
India
Legal search engine
Indexes judgements of the Supreme Court and several High
Courts
India
Application Programming Interface
Gives data about state assembly elections and profiles of MP's in
Maharashtra
USA
Real-time locations of city buses
Fares for other public transportation
UK
Comparing locations
Gives crime, education, transport and census data for a location
(http://www.indiankano
on.org/)
OpenCivic.in
(http://www.opencivic.i
n/)
ABQ Ride
(http://www.cabq.gov/a
bq-apps/city-appslisting/abq-ride)
Illustreets
(http://data.gov.uk/app
s/illustreets)
19
20. Related Work and Research Contribution
Research Gap
Indian Kanoon
Open
Government Data
Open Source Data
Aggregation
OCEAN
Yasni / Pipl
PII Leakage
20
21. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Experiments and Analysis
Conclusion
Future Work
Questions
21
22. Related Work and Research Contribution
Research Contribution
First deployed system which shows the aggregated personal
information about the residents of Delhi.
Threat modelling on the various open government databases.
Privacy Score: Risk associated with the person on the leaking PII.
Empirical understanding of privacy perceptions, awareness and
expectations of the users from the open government data.
22
23. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Identification of open government data sources
Threat Modelling
Data Extraction
Information Aggregation
Experiments and Analysis
Conclusion
Future Work
Questions
23
31. Methodology
II. Threat Modelling
TRUST BOUNDARY
USER
Name, Address, Relation name,
Age, Gender, Voter ID
Driving License
number
DRIVING
LICENSE
Name, Address, Father’s
name, Driving License no.,
DOB
OPEN
GOVERNMENT
DATA
Name, DOB
VOTER ROLLS
Name,
Constituency
Name, PAN
PAN
31
32. Research Motivation and Aim
Attack Scenario (I)
Online Voter ID card – Multiple fake voter ID cards can be
created from the available PII
32
33. Research Motivation and Aim
Attack Scenario (II)
View tax statements (Income tax e-filing) – Fake accounts
can be created to view TDS statements.
33
34. Research Motivation and Aim
Attack Scenario (III)
Procure a SIM card / phone connection
Fake documents can be created
Credit / debit cards can be applied in victim’s name
Networking accounts can be created
34
35. Methodology
II. Threat Modelling
DREAD Model: Microsoft’s Risk Assessment Model
Term
Remarks
Damage
How big the damage would be if the attack
succeeded?
Reproducibility
How easy it is to reproduce the attack to work?
Exploitability
How much time, effort, and expertise is needed to
exploit the threat?
Affected Users
If a threat were exploited, what percentage of users
would be affected?
Discoverability
How easy is it for an attacker to discover this
threat?
35
36. Methodology
II. Threat Modelling
Scheme: High (3), Medium (2), Low (1)
Threat: Malicious user can identify PII of Delhi residents
[Threat modelling: http://msdn.microsoft.com/en-us/library/ff648644.aspx]
36
37. Methodology
II. Threat Modelling
According to Microsoft’s DREAD model,
Range
Level of risk
5 -7
Low
8 – 11
Medium
12 – 15
High
In our case,
Overall rating = 2 + 3 + 2 + 3 + 3 = 13 (High)
It means that this threat pose a significant risk to the
various information portal websites of Delhi government
and needs to be addressed as soon as possible !
37
39. Methodology
III. Data Extraction
Data was collected from various open government data sources using
PHP scripts and stored as MySQL databases.
OPEN GOVT. WEBSITES
Alphabets a-z for name,
across 70 constituencies
Random 5 seeds,
‘Incremental attack’
Name and DOB from DL
VOTER
[81,95,053]
DRIVING LICENCE
[2,24,982]
PAN
[53,419]
39
40. Methodology
III. Data Extraction
Public data from various online social networking sites was
collected using public API calls.
OAuth tokens were used for authentication and authorization.
FACEBOOK
[33,77,102]
TWITTER
[15,57,715]
FOURSQUARE
[29,393]
UNIQUE NAME
GOOGLEPLUS
[28,900]
API CALLS
LINKEDIN
[1,86,798]
40
42. Methodology
IV. Information Aggregation
Family Tree
Information within Voter ID database aggregated to find
relationships among records.
OCEAN has 3,90,353 such users.
42
43. Methodology
IV. Information Aggregation
Mapping of users across Voter ID and Driving licence database.
Table Schema:
Database
Attributes
Voter ID
Voter ID, Name, Address, Father's / Mother's / Husband's name,
Age, Gender
Driving Licence
Name, Address, Father's name, DOB, Validity period, vehicle
category
Done on the basis of similarity between name, relation name and
address of the users across the database.
OCEAN has 6,384 such users.
43
45. Methodology
IV. Information Aggregation
Mapping of users across Voter ID, Driving licence and PAN
database.
Subset of DL having PAN were chosen.
OCEAN has 1,693 such users.
45
46. Methodology
IV. Information Aggregation
Mapping users across Foursquare, Facebook and Twitter.
Some users specify their other OSN’s contact on Foursquare. The
information available from such users is aggregated together.
OCEAN has 11 such users
46
48. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work and Research Contribution
Methodology
System User Interface
Experiments and Analysis
Conclusion
Future Work
Questions
48
50. Experiments and Analysis
Survey Dataset
62 complete responses.
51% males, 49% females.
77% in the age group 20 – 25.
23% had friends / self experience identity thefts online.
50
51. Experiments and Analysis
Evaluation Metric I - Privacy Score
Privacy score measure the risk associated with a person on the
basis of how much PII about that person is revealed from open
government data sources.
Privacy score (user) = Σ Sensitivity score (attributes)
Sensitivity score -> {1, 2, 3, 4, 5}
Range
Level
<20 %
1
21 – 30 %
2
31 – 50 %
3
51 – 60 %
4
>61 %
5
51
52. Experiments and Analysis
Privacy Score
Attribute
Percentage of users unwilling to share
personal information with anyone
Privacy Level
Voter ID
56.4%
4
Driving licence no.
58%
4
PAN
67.7%
5
Full name
14.5%
1
Home address
82.25%
5
Age
29%
2
DOB
50%
3
Father’s name
38.7%
3
Gender
14.5%
1
Level 5
1
Willingness to share
52
53. Experiments and Analysis
Privacy Score
Privacy score for 84,22,459 users:
Case 1: Users having only Voter ID (97.3%)
PS = Σ(Voter ID, name, father’s name, age, gender, address) = 16
Case 2: Users having only Driving licence number (2%)
PS = Σ(DL number, name, relative’s name, DOB, address) = 17
Case 3: Users having only PAN (1%)
PS = Σ(PAN, DL number, name, relative’s name, DOB, address) = 25
53
54. Experiments and Analysis
Privacy Score
Case 4: Users having Voter ID and DL number (0.07%)
PS = Σ(Voter ID, DL number, name, father’s name, age, gender, DOB,
address) = 24
Case 5: Users having Voter ID, DL number and PAN (0.02%)
PS = Σ(Voter ID, DL number, PAN, name, father’s name, age, gender,
DOB, address) = 29
1,693 people
Highest Risk!
54
55. Evaluation Metrics
Evaluation Metric II
Recall (Based on user study)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑜𝑝𝑙𝑒 𝑤ℎ𝑜 𝑐𝑜𝑢𝑙𝑑 𝑏𝑒 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑎𝑟𝑐ℎ 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑑𝑜𝑛𝑒 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
Thus, Recall = ( 179 / 389 ) = 46%
Low Recall
Data collection not 100%.
(Out of 12 million voter records, we have ~8 million records)
Respondents might be unclear about constituency.
55
56. Evaluation Metrics
Evaluation Metric III
System Usability Score (SUS)
Measured using the standard method as defined by Brooke et.al.
For OCEAN, value was 74.5 / 100 which means that people found the
system usable and convenient to use.
(Brooke, 1996)
56
57. Experiments and Analysis
User Awareness
Government started various open initiatives to increase
the level of transparency with citizens.
But, only 19% survey respondents aware.
Around 76% have started using these for less than 2 years.
Proper schemes required to convey the existence.
57
58. Experiments and Analysis
User Experience
Majority, 62% were shocked to see the availability of
personal information to this extent.
People felt that the information can be used maliciously
against them.
People now feel scared in sharing their information with
various government departments.
58
60. Feedback
Feedback
“It was an eye-opener
to a common man.”
“Waiting for an
upgraded version
which will work for
other states also.”
I am really shocked
that the exact ID
numbers are available
online without much
security against data
mining at this scale.”
“A great shortcoming
and security flaw has
been pointed out by
OCEAN. Great work.”
“Good system. Great
work ! Didn't know
such a system existed.”
60
61. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Experiments and Analysis
Conclusion
Future Work
Questions
61
62. Conclusion
Conclusion
Large amount of personal information is available on
government servers.
Information aggregation yields more information about a
person.
Threat Modelling on open government data sources shows risk
associated with PII leakage and need for preventive measures.
1,693 users are most vulnerable to identity thefts risks.
People felt the need of access control on the data and proper
privacy laws against the misuse of information.
62
63. Presentation Outline
Presentation Outline
Research Motivation and Aim
Related Work
Research Contribution
Methodology
Experiments and Analysis
Conclusion
Future Work
Questions
63
64. Future Work
Future Work
Datasets can be extended to other states in India.
Mapping users across offline (govt. databases) and online
(social networking sites) worlds.
Data collection can be expanded to improve the recall.
64
66. References
Kumaraguru, P., and Sachdeva, N. Privacy in India: Attitudes and
Awareness V 2.0. Tech. rep., PreCog-TR-12-001, PreCog@IIIT-Delhi,
2012. http://precog.iiitd.edu.in/research/privacyindia/
McCallister, Erika, Tim Grance, and Karen Scanfone. "Guide to
protecting the confidentiality of personally identifiable information
(PII)(draft), January 2009." NIST Special Publication: 800-122.
Schwartz, Paul M., and Daniel J. Solove. "PII Problem: Privacy and a
New Concept of Personally Identifiable Information, The." NYUL Rev. 86
(2011): 1814.
Mont, Marco Casassa, Siani Pearson, and Pete Bramhall. "Towards
accountable management of identity and privacy: Sticky policies and
enforceable tracing services." Database and Expert Systems
Applications, 2003. Proceedings. 14th International Workshop on. IEEE,
2003.
Jones, Rosie, et al. "I know what you did last summer: query logs and
user privacy." Proceedings of the sixteenth ACM conference on
Conference on information and knowledge management. ACM, 2007
66
67. References (I)
Nashash, Hyam. "EDUCATION AS A BUILDING BLOCK IN OPENING UP
GOVERNMENT DATA." European Scientific Journal 9.13 (2013).
Barber, Grayson. "Personal Information in Government Records:
Protecting the Public Interest in Privacy." . Louis U. Pub. L. Rev. 25
(2006): 63.
Krishnamurthy, Balachander, and Craig E. Wills. "On the leakage of
personally identifiable information via online social networks."
Proceedings of the 2nd ACM workshop on Online social networks.
ACM, 2009.
Jurgens, David. "That’s What Friends Are For: Inferring Location in
Online Social Media Platforms Based on Social Relationships." Seventh
International AAAI Conference on Weblogs and Social Media. 2013.
Zheleva, Elena, and Lise Getoor. "To join or not to join: the illusion of
privacy in social networks with mixed public and private user profiles."
Proceedings of the 18th international conference on World wide web.
ACM, 2009.
67
68. References (II)
Mislove, Alan, et al. "You are who you know: inferring user profiles in
online social networks." Proceedings of the third ACM international
conference on Web search and data mining. ACM, 2010.
Harel, Amir, et al. "M-score: estimating the potential damage of data
leakage incident by assigning misuseability weight." Proceedings of the
2010 ACM workshop on Insider threats. ACM, 2010.
Wright, Glover, Pranesh Prakash Sunil Abraham, and Nishant Shah.
"Open government data study: India." Study commissioned by the
Transparency and Accountability Initiative (2010).
Godse, Mr Vinayak, and Director–Data Protection. "RISE PROJECT."
(2010).bibitem{brooke1996sus} Brooke, John. ``SUS-A quick and dirty
usability scale." Usability evaluation in industry 189 (1996): 194.
Social media report 2012: Social media comes of age.
http://www.nielsen.com/us/en/reports/2012/state-of-the-media-thesocial-media-report-2012.html
68