Secure and Privacy-Preserving Big-Data Processing

Secure Big-Data Processing
University of California, Irvine, USA.
IEEE Big Data 2019, Los Angeles, California, USA
Anton Burtsev Sharad Mehrotra Shantanu Sharma

• Introduction
• How to securely process data at the cloud?
• Challenges and overview of existing state-of-the-art
• Cryptographic Techniques
• Encryption-based Data Outsourcing
• Secret-Sharing-based Data Outsourcing
• Exploiting Trusted Computing Platforms
• Secure hardware
• Hybrid cloud
• Data Partitioning-based Outsourced Data Processing
• Conclusion and Open Problems
Contents

Storage
Distributed File Systems (DFS)
Hadoop DFS, Google File System
(GFS), Gfarm, Amazon S3
Machine
Google Compute Engine, Amazon
Web Services, Microsoft Azure,
Rackspace OpenCloud
Network
Relational databases, Key-Value
Store, NoSQL, MapReduce
Database operations, e.g., selection,
projection, aggregation, and join, and
clustering, machine learning
IaaSPaaSSaaS
Data providers The public cloud Users
Big-Data Processing in the Cloud
Figurereference:PhilipDerbeko,ShlomiDolev,EhudGudes,andShantanuSharma.
“SecurityandprivacyaspectsinMapReduceonclouds:Asurvey.”Computersciencereview20(2016):1-28.

• Utility model
• Pay for only what you use
• No infrastructure build-up cost and/or
database administration costs
• Elastic
• Use as much as your needs (virtually
limitless)
• No system management
headaches
• failure, loss of data, Software
upgrades, patches, bug fixes, etc.
• Cost amortization
• Cheaper due to economy of scale
• Better control over IT investment
Why Cloud?
Public Cloud
Elastic, pay-as-you-go
service
Private
Existing servers or data centers
Hybrid
Utilize both public & private

• Data resides in shared systems
administration of which is not in owners'
control
• Unknown applications and processes share
resources with apps and data.
• Data owners have no control over the
cloud’s internal data security personnel,
policies or their enforcement
• Insider attacks
• Data mining attacks leading to information
leakage
• Cloud providers compliance to government
subpoenas
Key Challenge: Loss of Control
End Users
Public Cloud

• Availability
• Will the owners always have access to data and services?
• Integrity
• Will the cloud provide answers to queries correctly?
• Security
• Will the cloud implement its own security policies correctly?
• Privacy and confidentiality
• Will sensitive data remain confidential?
• Will data be vulnerable to misuse? By other tenants? By the service provider?
Implications of Loss of Control

What is The Solution?
Encrypt sensitive data before uploading to the cloud

Secure Computing
Download the encrypted data and compute at the trusted side
Cryptographic Solutions at the Cloud Exploiting Trusted Computing
Trusted Private Cloud Untrusted Public Cloud
Download encrypted data
Upload encrypted data
Encrypted dataCleartext data The DB owner Secure hardwareCleartext data processing
Cleartext results
Encrypted query

• An adversary may learn about data:
• From ciphertext (ciphertext representation-based attack)
• From prior knowledge of data distribution (frequency-count attack)
• From the size of the output to a query (output-size attack)
• From the access pattern used by the mechanism in answering a query (access-
pattern attack)
• From knowledge of queries that have executed (search-pattern attack)
• From knowledge of frequency of queries (workload-skew attack)
Common Attacks in Data Outsourcing

• Honest-but-Curious versus Malicious adversary
• Honest-but curious
• Executes protocols correctly, but wishes to learn about data
• Malicious
• Might sabotage data or computation
• Passive versus Active Adversary
• Passive
• Makes inferences based on passive observations - ciphertext, queries,
workload, and access patterns
• Active
• May actively injecting new data, execute queries, or interfere with the
execution
Adversarial Cloud Model

• Semantic Security
• Access to ciphertext does not help provide any information about the
plaintext other than what the adversary knew a-priori.
• Difficult to use directly
• Equivalent notion – Indistinguishability
• Adversary cannot distinguish between the ciphertexts of two
plaintexts
• Easier to prove using a real-versus-ideal game
• Security definition needs to be adapted in data outsourcing
• Since leakages occur from encrypted data representation and query
execution
Defining Security
Reference: Shafi Goldwasser, and Silvio Micali. "Probabilistic encryption." Journal of computer and system sciences 28, no. 2 (1984): 270-299.
Curtmola, Reza, Juan Garay, Seny Kamara, and Rafail Ostrovsky. "Searchable symmetric encryption: improved definitions and efficient constructions." Journal of Computer Security 19, no. 5 (2011): 895-934.

Security Goal: IND-CKA1: Real Game Model with Leakage
Profile
D0
E(D0)
Leakages
e.g., access-patterns, search-patterns,
output-size
A set of queries
A set of encrypted queries (i.e., trapdoors/tokens) for
the requested set of queries

Security Goal: IND-CKA1: Ideal Game Model with
Leakage Profile
D0
E(D’)
1. Leakages (e.g., access-patterns, search-patterns, output-size) from the real game
2. Generate a fake dataset (D’) having the same leakages
3. Randomly select D0 or D’ and encrypt it
Which dataset is
encrypted – D0 or fake?
The same set of queries like in the real game
A set of encrypted queries (i.e., trapdoors/tokens) for the
requested set of queries

• Many cloud providers support
encryption at rest
• Microsoft Always Encrypt
• Amazon Aurora , MariaDB
Cloud Layers and Security
IaaS
PaaS
SaaS
• Secure MapReduce, Secure
Spark, Secure SQL…
• Microsoft Always Encrypt,
Jana@Galois Inc.,
Pulsar@Stealth Software
Technologies
• Application security
• Garble Cloud, Cloud Protect,
SPORC

Encryption-based Cryptographic Approaches
Encrypted data
Cleartext data
The DB ownerThe DB owner
Encrypted processing
Trusted Private
Cloud
Untrusted Public Cloud Users
• Fully homomorphic approach
• Very inefficient and not practical
• Partially homomorphic
• Additive: e.g., Paillers
• Multiplicative: e.g., Elgamal
• Searchable encryption
• Bucketization [Hore et al., VLDB, 04]
• Searchable Encryption [Song et al., IEEE
SP, 00]
• Secure indexes – encrypted Bloom filters
[Goh, 03]
• Order-Preserving Encryption (OPE)
[Agrawal et al., SIGMOD, 04)
• Conjunctive keyword search [Golle et al.,
ACNS, 04]
• Encrypted inverted lists [Curtmola et al.,
CCS, 06]
• Onion encryption [Popa et al., SOSP, 11]
Different approaches
• Different levels of security
• Support different operations
• Different levels of efficiency

MPC and Secret Shared Mechanisms
Untrusted Public Clouds
Users
• Techniques:
• Secret-sharing [Shamir, CACM, 1979]
• Distributed Point Function [Gilboa et al., EUROCRYPT,
2014.]
• Function secret-sharing [Boyle et al., EUROCRYPT, 2015]
• Homomorphic Secret-Sharing [Boyle et al., CCS, 2017]
• Accumulating-Automata [Dolev et al, SCC@ASIACCS ,
2014]
• Obscure [Gupta et al, CS@UCI, 2019]
• Conclave [Volgushev et al. arxiv, 2019]
• SMCQL [Bater et al., PVLDB, 2017]
• Systems:
• Jana by Galois
• Partisia
• PULSAR by Stealth Software Technologies
• Secret Double Octopus and SecretSkyDB Ltd
• Sharemind by Cybernetica
• Unbound Tech.
Secret-Shared Data
Cleartext data
The DB ownerThe DB owner
Secret-Shared processing
Trusted Private
Cloud
• Secure against stronger adversaries
• Information-theoretically secure
• Secure against access-pattern-based attacks
• However, much more expensive
• 5-6 order of magnitude expensive compared
to plain text processing

Cryptographic Techniques vs Security Threats
represents technique is resilient to a given attack.
Resilient to attacks
Techniques Data at rest During query execution
Ciphertext
indistinguishability
Output-
Size
Workload-skew Access-patterns
Full Download
Deterministic Encryption/OPE X X X X
Non-Deterministic Encryption X X X
Searchable encryption X X X
Homomorphic + ORAM X X
Shamir’s Secret-sharing X X
Multi-party computations-Jana X X
Reference: Sharad Mehrotra, Shantanu Sharma, and Jeffrey D. Ullman. "Scaling Cryptographic Techniques by Exploiting Data Sensitivity at a Public Cloud." In Proceedings of the Ninth ACM Conference on Data
and Application Security and Privacy, pp. 165-167. ACM, 2019.

• Efficiency
• How expensive are the cryptographic operations? Is operation linear or sublinear
in the size of the data (indexable versus non-indexable)?
• Generality
• What queries can the technique support – selection, range, join, aggregation
• Dynamic Operations
• Does the scheme support insertion/deletions/updates?
• Client-Side Execution
• How much work does the client have to do? During insertion/updates/queries.
• Security
• How much security does the scheme offer? Quantifiable leakage, e.g.,
orderability, distribution? Semantic security?
Cryptographic Techniques – Design Criteria

Exploiting Trusted Platform
Trusted Private
Cloud
Untrusted Public Cloud Users
Trusted Private
Cloud Untrusted Public Cloud Users
Hybrid Cloud Scenario
Secure Hardware Scenario
Cleartext non-sensitive dataCleartext sensitive data The DB owner Cleartext non-sensitive data processing
Secure
hardware
Cleartext sensitive data processing
• Distribute computation between
untrusted platform and trusted
platform
• Solutions differ on the trusted platform
exploited, degree of integration, security
offered, and computations supported
• Hybrid Cloud-based Solutions
• HybrEx, SEMROD, Sedic
• Secure FPGA-based solutions
• Microsoft Cipherbase
• Intel SGX-based solutions
• Opaque, EnclaveDB, VC3, HardIDX

• Minimizing data movement between trusted and untrusted platforms
• Movement between trusted and untrusted platforms can lead to leakage
• Mapping complex operator workflow between trusted and untrusted
platforms
• Existing trusted hardware are vulnerable to side-channel attacks
• Oblivious access at different levels, e.g., register and cache-line
• Cost vs security
Trusted Platform – Challenges

Security Techniques vs Computation Cost
Selecting a single row from TPC-H Customer table of 1.5M rows and 8 columns
Searchable encryption: DSSE: Distributed
Searchable Symmetric Encryption (PULSAR
by Stealth Software Technologies)
MPC: Multi-party computation (Jana by
Galois)
Opaque SGX based solution [Zhang et al.,
NSDI, 2017]
• Cryptographic Overheads:
• Searchable encryption – ~2 orders of magnitude
• Secure hardware - ~3-4 order of magnitude
• MPC based solution - ~5-6 orders of magnitude

Can we design an outsourcing solution for that is
simultaneously??
Efficient – significantly better compared to downloading
cryptographically secured data, and
Secure – similar to downloading the data and local processing
Secure Data Outsourcing: Challenge
A possible approach??
Partitioned computing that exploits partial sensitivity of data
to restrict cryptographic overheads to only sensitive data
Trusted Private Cloud Untrusted Public Cloud Users
Cleartext non-sensitive dataCleartext sensitive data The DB owner Partition computationEncrypted sensitive data
Reference:SharadMehrotra,ShantanuSharma,JeffreyUllman,andAnuragMishra."Partitioneddatasecurity
onoutsourcedsensitiveandnon-sensitivedata."In2019IEEEICDE,pp.650-661.IEEE,2019.

• Organization data is often only partially sensitive
• Sensitivity dictated by policies
• Sensitivity dictates what data and in what form is it outsourced
• E.g., General office emails possibly not sensitive (hence outsourced)
• Information related to a sensitive project sensitive (hence not outsourced in
plaintext)
• Can we exploit partially sensitive nature of data to scale cryptographic
solutions without compromising security of sensitive data?
• Commercial encrypted database solutions (e.g., Jana by Galois Inc.) are beginning
to explore such solutions
Data Sensitivity

Partitioned Data Security Challenge
• Non-Linkability
• The Adversary does not learn relationship between any encrypted and plaintext
value
• Cyphertext Indistinguishability
• The adversary does not learn any relationships between encrypted values
• unless underlying crypto allows such relationships to be learnt (e.g., OPE)
Reference: Sharad Mehrotra, Shantanu Sharma, Jeffrey Ullman, and Anurag Mishra. "Partitioned data security on outsourced sensitive and non-sensitive data."
In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 650-661. IEEE, 2019.

Cryptographic Solutions
• Encryption-based Techniques
• Bucketization [Hore et al. VLDB 04]
• Searchable Encryption [Song et al., IEEE SP 00]
• Secure indexes – encrypted Bloom filters [Goh, 03]
• Bilinear maps [Boneh et al., EuroCrypt 03]
• Order-Preserving Encryption (OPE) [Agrawal et al.,
SIGMOD 04]
• Modular-OPE [Boldyreva et al., CRYPTO 11]
• Conjunctive keyword search [Golle et al., ACNS 04]
• Encrypted inverted lists [Curtmola et al., CCS 06]
• Fully homomorphic encryption [Gentry, STOC 09]
• Onion encryption [Popa et al., SOSP 11]
• Dynamic Searchable Encryption [Cash et al.NDSS 14]
• PBTree [Li et al., VLDB 14]
• IBTree [Li et al., ICDE 17]
• Secret-Sharing Techniques
• Shamir’s secret-sharing [Shamir, CACM 79]
• Multi-Linear Secret-Sharing Schemes [Brickell et al., J. of
Cryptology 91, Bertilsson et al., AUSCRYPT 92]
• Verifiable secret sharing [Rabin et al., STOC 89]
• Proactive Secret Sharing [Herzberg et al., CRYPTO 95]
• Function Secret Sharing [Boyle et al., EUROCRYPT 15]
• Homomorphic secret sharing [Boyle et al. CRYPTO 16]
• Accumulating Automata [Dolev et al., TCS 19]
• Encryption-based Systems
• CryptDB [Popa et al., SOSP 11]
• Monomi [Tu et al.. VLDB 13]
• Cipherbase [Arasu et al., CIDR 13]
• TrustedDB [Bajaj et al., IEEE TKDE 13]
• CorrectDB [Bajaj et al., VLDB 13]
• ZeroDB [Egorov et al., arxiv 16]
• MrCrypt [Tetali et al., OOPSLA 13]
• EncKB [Yuan et al., ASIACCS 17]
• Microsoft Always Encrypted
• Oracle 12c
• Amazon Aurora
• MariaDB
• Secret-Sharing-based Systems
• SSSDB [Avni et al., ALGOCLOUD 15]
• Splinter [Wang et al., NSDI 17]
• OBSCURE [Gupta et al, VLDB 19]
• Cybernetica
• Jana by Galois Inc.
• Partisia
• Secret Double Octopus
• SecretSkyDB Ltd
• PULSAR by Stealth Software Technologies Inc.
• Unbound Tech.

EmpID name DID
E1 Alice D1
E2 Bob D2
E3 Carl D1
Problems
DDID Dname
D1 Sale
D2 Coding
On the relations, execute the following in a secure manner:
1. Selection query
(e.g., SELECT * FROM employee WHERE name= ‘Alice’)
2. Join query
(e.g., SELECT * FROM employee INNER JOIN department ON employee.DID = department.DDID)
3. Aggregation query
(e.g., SELECT count(*) FROM employee WHERE DID=‘D1’)
employee department

ID Dept Comment
Id1 D1 W1
Id2 D1 W2
.
.
.
.
.
.
.
.
.
Idi Di Wi
Idk Dk Wk
Searchable Encryption: Ciphertext Generation
A relation
Wi Ek(Wi)
Ek():
Deterministic
encryption
Li Ri
Si Ti
ki = fk(Li)
Ti= fki(Si)
 Ciphertext (CT)
n-m bits m bits
n bits
Trapdoor for wi
Key generation
Partitioning the encrypted
word into two parts
Pseudorandom string
Reference: Dawn Xiaoding Song, David Wagner, and Adrian Perrig. “Practical techniques for searches on encrypted data.”
In Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000, pp. 44-55. IEEE, 2000.

Searchable Encryption: Search at the Cloud
Ciphertext (CT)
Si Ti

Matching
or not???
CTLi CTRi
User provided values
Ek(Wi)
ki = fk(E1)
E1 E2
Partitioning the ciphertext
into two parts
Partitioning the encrypted
word into two parts
n-m bits m bitsn-m bits m bits
Ti= fki(Si)
Ek(Wi)
Data outsourcing method
Idea
A  B = C
A  C = B
B  C = A
Reference: Dawn Xiaoding Song, David Wagner, and Adrian Perrig. “Practical techniques for searches on encrypted data.”
In Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000, pp. 44-55. IEEE, 2000.
Advantage
Does not reveal any thing before the query execution, unlike
deterministic encryption that reveals things before query execution
Disadvantage
Linearly scan the entire data, i.e., no index support
Question
Can we have indexable searchable encryption?

• The cloud maintains an index
• User sends keywords and the cloud traverses the index to answer the query
• Issues:
• Index Generator: Who will create an index – the DB owner vs server?
• Mostly work consider the DB owner to create index
• Index Traverse: Interactive vs non-interactive – can the cloud traverse the index by
own?
• Index Update: Can the cloud update the index?
• Techniques:
• Early approaches: The DB owner generated, interactive traversal, non-updateable
• Exploit oblivious techniques
• Implemented in Stealth Software Technology, Inc.
• Recent: PB-Tree: The DB owner generated, non-interactive traversal, updateable
Indexable Searchable Encryption
Reza Curtmola, Juan Garay, Seny Kamara, and Rafail Ostrovsky: Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions.
Our
focus

• Consider
• A number n
• The number of bit to represent the number n in binary form = w
• Prefix family will contain w+1 items
• Prefix family
• Consider a number 6
• 6 in 5-bit binary = (00110)
• Prefix family of 6 is F(6) = {00110, 0011*, 001**, 00***, 0****,*****}
• What a node of the index will contain?
• Leaf node: Prefix family of one of the data items
• Other nodes: Union of prefix families of their child nodes
Indexable Searchable Encryption
Reference:
Li, Rui, Alex X. Liu, Ann L. Wang, and Bezawada Bruhadeshwar. "Fast range query processing with strong privacy protection for cloud computing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1953-1964.
Rui Li and Alex X. Liu. "Adaptively secure conjunctive query processing over encrypted data for cloud computing." In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 697-708. IEEE, 2017.

• Step 1:
• Find prefix family of all
such numbers
• Step 2:
• Allocate the prefix family
to the root node
• Step 3:
• Divide the number in the
given node until a node
contains prefix family of
one of the given numbers
Create Index using Prefix Family: Top-Down Way
F(1), F(6), F(7), F(9), F(11), F(12), F(13), F(16), F(20), F(25)
F(1), F(6), F(7), F(16), F(20) F(9), F(11), F(12), F(13), F(25)
F(1), F(6), F(7) F(16), F(20) F(12), F(13), F(25)
F(6),F(7)
F(1)F(6) F(7) F(20) F(16)
F(12), F(13)
F(12) F(13) F(25) F(9) F(11)
F(9), F(11)
Create index on the following numbers
1, 6, 7, 9, 11, 12, 13, 16, 20, 25

• Step 1:
• User creates prefix family of 6 and sends to the cloud
• Step 2:
• The cloud starts from the root node to find the prefix family of the given query
Execute a Point Query using the Index
F(1), F(6), F(7), F(9) ,F(11), F(12), F(13), F(16), F(20), F(25)
F(1), F(6), F(7), F(16), F(20) F(9), F(11), F(12), F(13), F(25)
F(1), F(6), F(7) F(16), F(20) F(12), F(13), F(25)
F(6),F(7)
F(1)F(6) F(7) F(20) F(16)
F(12), F(13)
F(12) F(13) F(25) F(9) F(11)
F(9), F(11)
Query: Find 6
F(6) 
Reference:
A
B C
D E
F
G
H
I

Execute a Range Query
F(1), F(6), F(7), F(9) ,F(11), F(12), F(13), F(16), F(20), F(25)
F(1), F(6), F(7), F(16), F(20) F(9), F(11), F(12), F(13), F(25)
F(1), F(6), F(7) F(16), F(20) F(12), F(13), F(25)
F(6),F(7)
F(1)F(6) F(7) F(20) F(16)
F(12), F(13)
F(12) F(13) F(25) F(9) F(11)
F(9), F(11)
Query: Find all numbers between [0,8]• Step 1:
• Represent the range predicate into their prefix family
• F(0)= {00000, 0000*, 000**, 00***, 0****,*****}
• F(8) = {01000, 0100*, 010**, 01***, 0****,*****}
• Step 2:
• Minimum set of prefixes such that union of prefixes cover the range
• {00***,01000}
• Step 3:
• Check node for
minimum set of
prefixes
000 → 0
001 → 1
010 → 2
011 → 3
100 → 4
101 → 5
110 → 6
111 → 7
A
B C
D E
F
G
H I

• Indistinguishability
• Use Bloom filters
• Any prefix will be hashed to r locations using HMAC with r keys
• Node Indistinguishability
• Associate each node v with a random number v.R, then hash r times as follows:
• HMAC(k1, v.R, p), …, HMAC(kr, v.R, p)
• Reverse engineering
• An adversary can do reverse engineering to create PB-tree after observing many
queries or by asking queries
• How to solve this issue?
• IB-Tree
Issues with the Index (PB-Tree)
Reference:
Two nodes may contain overlapping prefix families
F(6) = {00110, 0011*, 001**, 00***, 0****,*****}
F(7) = {00111, 0011*, 001**, 00***, 0****,*****}

• Indistinguishable Bloom Filter
• Twin cell: 0 and 1
• For i-th location, which cell stores 1: HMAC(kr+1, i)  rB
• rB is a random number for IBF B.
• IB-Tree
• A tree like PBtree, but all nodes use Indistinguishable Bloom Filter
Searchable Encryption for Adaptive Adversary
Referenceandslidecredit:RuiLiandAlexX.Liu."Adaptivelysecureconjunctivequeryprocessingover
encrypteddataforcloudcomputing."In2017IEEE33rdInternationalConferenceonDataEngineering
(ICDE),pp.697-708.IEEE,2017.
Selected
Unselected

Name :=John and Age=[1,15]
Name :John,
Age:0001
NR
N32
N11
N21
N31
d2d1
N34
N22
N33
d4d3
N36
N12
N23
N35
d6d5
N37
d7
Name :John,
Age:001*
Name :John,
Age:01**
Name :John,
Age:1***
U U U
Processing Range Queries on IB Tree
Slide credit: Alex Liu: Adaptively Secure Conjunction Query
Processing over Encrypted Data for Cloud Computing.
Minimum set of prefixes such that union of prefixes cover the range

•Secure execution of selection queries
• Point and range queries
• Indexable vs non-indexable
•What about join and aggregation?
What we have discussed so far?

Bucketization
NAME SALA
RY
John 54500
Mary 111029
James 95300
Lisa 14500
0
E-tuple Bucket_id
fErf!$Q!! Xr2k%s
F%%3w& 11vb$$
&%gfsdf$ bbcr3@
%%33w& Xxrty*
Q: SELECT name FROM EMPLOYEE
WHERE salary ≥ 90k AND salary < 110k
false positive
Q: SELECT name FROM EMPLOYEE
WHERE Bucket_id = bbcr3@
OR Bucket_id = 11vb$$
Bucket ID
30k – 50k 1bx!23
50k – 70k Xr2k%s
70k – 80k Rtes12!
80k – 90k Cvtr^e
90k – 100k bbcr3@
100k – 115k 11vb$$
115k – 130k 23wqa%
130k – 160k Xxrty*
Pros
• Generality: allows large class of predicates to be evaluated (most of SQL)
• Efficient implementation: index
Cons
• Incurs overhead on client: pruning of false positives
Database Owner Site Cloud Site

• Buckets’ impact
• Query execution overhead
• Security
• Security metrics
• How large is the span of the bucket? – larger the better
• How are the frequencies distributed? More uniform the better
• Cost metrics
• How many false positives are generated for a predicate?
• What is the storage overhead due to metadata ?
• Improving security
• Introduce randomness to increase security level
Bucketization
Reference: Hakan Hacigümüş, Bala Iyer, Chen Li, and Sharad Mehrotra. "Executing SQL over encrypted data in the database-service-provider model." In Proceedings of the 2002 ACM SIGMOD international conference on
Management of data, pp. 216-227. ACM, 2002.

• We can do
• Joins at the cloud-side based on bucket-ids
• But with computational overhead at the DB owner due to filtering
• Can we avoid computation overheads at the DB owner in join
operation?
• Precompute join operation before outsourcing the data
What We Have Seen in Bucketization?

• Represent data in different format
• Execute join among tables before outsourcing
Precomputed Joins: Different Representation of Datasets
Slide credit: Seny Kamara Tarik Moataz:
SQL on Structurally-Encrypted Databases

Data Outsourcing
Precomputed Joins
Reference and slide credit: Seny Kamara and Tarik Moataz. "SQL on structurally-encrypted databases." In International Conference on the Theory and Application of Cryptology and Information Security,
pp. 149-180. Springer, Cham, 2018.
Join
ProjectionSelectionSelection Projection
Disadvantages:
1. Joins are precomputed
2. Aggregation queries cannot be executed at the cloud
3. Complex queries cannot be solved at the cloud

• Non-indexable Searchable Encryption
• Indexable Searchable Encryption for point and range queries
• Bucketization for join, aggregation, and most of SQL
• Precomputed joins
•Is there any system based on these techniques or based
on encryption?
What we have seen so far?

• CryptDB
• Monomi
• Arx
• Cipherbase
• TrustedDB
• CorrectDB
• SDB
• EncKV
Encryption-based Systems
• ZeroDB
• MrCrypt
• Crypsis
• Microsoft Always Encrypted
• Oracle 12c
• Amazon Aurora
• MariaDB

• Can be seen as a two-column table
• One column for key
• Another column for value
• Also, they can store complicated relational table in this format
• Example:
• Person database:
• Key: Person ID, Value: Person record
• Key: City, Value: PersonID
• Key: PersonID, Value: Name
Key-Value Store
Person id name age city
1001 alice 20 LA
1002 bob 25 LA
1003 tom 20 NY
Key = City
Value = PersonID
LA 1001
LA 1002
NY 1003
Key = PersonID
Value = Nme
1001 alice
1002 bob
1003 tom

EncKV: Encrypted Key-Value Store
1
5
2
4
3
6
Server
Hash on Row Id to allocate key-value pair to a server
LA 1001
LA 1002
NY 1003
city Person
id
Person
Id
Name
Reference and Slide credit: Xingliang Yuan, Yu Guo, Xinyu Wang, Cong Wang, Baochun Li, and Xiaohua Jia. “Enckv: An encrypted key-value store with rich queries.”
In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 423-435. ACM, 2017.
H(Gk(city||LA||i),2) Enck(1002)
H(Gk(name||bob||i),1) Enck(1002)
H(Gk(city||NY||i),1) Enck(1003)
… …
Server i
Attribute
Name
Attribute
Value
Server i
Value
Occurrence
Double
encryption
1001 alice
1002 bob
1003 tom

EncKV: Encrypted Key-Value Store
H(Gk(name||bob||i),1) Enck(1002)
H(Gk(city||NY||i),1) Enck(1003)
… …
Enck(1001)
Enck(1002)
2
Encrypted exact match Index
Pk(name||1002) Enck(bob)
Pk(age||1001) Enck(20)
Pk(name||1001) Enck(alice)
… …
Server i
Encrypted data records
select “name” where “city=LA”
Gk(city||LA||i)1
Pk(name||1001)
Pk(name||1002)
3
Enck(bob)
Enck(alice)
4
• Observations:
• The server does not learn whether the index entries of two different values belong to the same attribute
or not before query execution, e.g., H(Gk(city||LA||i),1) and H(Gk(city||NY||i),1) for LA and NY.
• At any two servers, the index entries for the same attribute are different, e.g., H(Gk(city||LA||i),1) and
H(Gk(city||LA||j),1) for server i and j.
P, H,G: PRF
ReferenceandSlidecredit:XingliangYuan,YuGuo,XinyuWang,CongWang,BaochunLi,and
XiaohuaJia.“Enckv:Anencryptedkey-valuestorewithrichqueries.”InProceedingsofthe2017ACM
onAsiaConferenceonComputerandCommunicationsSecurity,pp.423-435.ACM,2017.
Double
encryption
Communication Overhead
If more than 1M people from LA in the table???

• Some encryption techniques are fast, but reveal information
• Deterministic encryption is fast but reveals distribution of values
• Order-Preserving encryption (OPE) is fast but reveals order of the values
• Searchable encryption is fast but only secure unless a query is executed; otherwise
reveals data distribution or order of the values
• Bucketization is more secure as compared to above techniques and fast
• Retrieve more items
• Require client-side processing
• CryptDB is fast but insecure, due to using deterministic encryption and OPE
• Open issues:
• Need a fast and secure encryption technique that can support
different types of SQL queries
• Need an index that a cloud can build
Pros and Cons of Encryption-based Techniques

Secret-Sharing-based Data
Outsourcing

•Encryption techniques are computationally secure
• A powerful adversary can break the encryption technique
• Google, with sufficient computational capabilities, broke SHA-1 (https://shattered.io/)
•Information-theoretical security
• Secure regardless of the computational power of an
adversary
• Quantum secure
Why Secret-Sharing?

Shamir’s Secret-Sharing (SSS) [Shamir79] – Key Idea
• One point  Infinite number of lines
• Two points  Only one line
• Where f(0) is the secret
• Alice wants to share her secret value 5 to Bob and Carl
• Bob and Carl do not communicate with each other
• Impact of degree of the polynomial vs security
• 𝑓 servers collude  polynomial degree should be 𝑓 + 1
• Servers do not collude  a polynomial of the degree 1
• Fault tolerant
• Due to creating multiple shares
Reference:AdiShamir.“Howtoshareasecret.”CommunicationsoftheACM22,no.11(1979):612-613.

Shamir’s Secret-Sharing (SSS)
Secret
S
Secret Owner Non-Communicating Public Servers
s1
s2
s3
s4
Mathematical operations
f(x) = S + ax
Each server
cannot learn
the secret S
Secret-Share Creation:
e.g., under the assumption that
no server will collude
Reference: Adi Shamir. “How to share a secret.” Communications of the ACM 22, no. 11 (1979): 612-613.

Secret
S
s1
s2
s3
s4
Lagrange Interpolation
Secret Reconstruction

Secret
S
s1
s2
s3
s4
Secret Reconstruction
Lagrange Interpolation

• Similar to Order-Preserving Encryption (OPE)
• If cleartext values have a relation, such as 𝒂 < 𝒃, then
• 𝑆 𝑎 < 𝑆 𝑏
• Efficient for maximum/minimum and range queries
Order-Preserving Secret-Sharing
Reference: Fatih Emekci, Ahmed Methwally, Divyakant Agrawal, and Amr El Abbadi. “Dividing secrets to secure data outsourcing.” Information Sciences 263 (2014): 198-210.

Computing over Secret Shared Data
Secret Sharing
Communicating Servers
(Jana and Sharemind)
Non-communicating
servers (SSDB, OBSCURE)
• Selection and aggregation queries
• Significant communication overheads
amongst servers
• Selection and aggregation queries
Our
focus

• Outsource the above relation using Shamir’s secret-sharing
• Add all secret-shared values of ‘Salary’ attributes
• Exploit additive homomorphic property
Simple Aggregation using Secret-Shared Data
EmpID Name Salary Dept
E101 John 1000 Testing
E101 John 100000 Security
E102 Adam 5000 Testing
E103 Eve 2000 Design
SELECT SUM(Salary) FROM Employee

•Aggregation with complex selection obliviously, i.e.,
access-pattern hiding
•Complex Selection Query Execution
•Join Query Execution
Challenges

• The DB owner keeps each polynomial,
which was used to create database
shares
• To execute a query, the DB owner
creates shares of the query predicate
and fetches the desired value from the
clouds
• Very fast
• Access-pattern attack
• Distribution revealing
DB Owner Assisted Query Execution
Reference: Fatih Emekci, Ahmed Methwally, Divyakant Agrawal, and Amr El Abbadi. “Dividing secrets to secure data outsourcing.” Information Sciences 263 (2014): 198-210.

• How to search on secret-shared outsourced data
• Without remembering any polynomial, which were used to create the
dataset
• Otherwise, the DB owner can store the entire dataset also
• Supporting multiple-DB owners
Big Question
Solution
Non-interactive string-matching over the secret-shared data

Step 1: Unary representation
Step 2: Creating secret-shares of unary represented data
Step 3: Outsourcing the data
String Matching over Secret-Shared Data
A
B
C
1, 0, 0
0, 1, 0
0, 0, 1
Polynomials
Secret-shares
Secret-shares
Secret-shares
Reference: Dolev et al. Accumulating automata and cascaded equations automata for communicationless information theoretically secure multi-party computation, TCS 2019.

String-Matching over Secret-Shared Data
Secret-Share Creation by the DB owner
B
0
1
0
0 + 5x
1 + 9x
0 + 2x
5
10
2
10
19
4
15
28
6
This is representing B 0, 1, 0 of secret-shared form
→
The adversary cannot learn the actual value, B
Dolev et al. Accumulating automata and cascaded equations automata for communicationless information theoretically secure multi-party computation, 2019.

5
10
2
10
19
4
15
28
6
User wants
to search
for
B0
1
0
0 + x
1 + 2x
0 + 4x
No need to share any
polynomial b/w the DB
owner and the user
1
3
4
2
5
8
3
7
12
Secret-Share
Creation by
the user
These shares are
representing B 0, 1, 0
of secret-shared form
→
The adversary cannot
learn the actual value,
B, of either the dataset
or the query predicate
Dolev et al. Accumulating automata and cascaded equations automata for communicationless information theoretically secure multi-party computation, 2019.

5
10
2
10
19
4
15
28
6
1
3
4
2
5
8
3
7
12
Cloud
operations:
Multiplication
and addition of
shares
5
30
8
20
95
32
45
196
72
43
147
313
User wants
to search
for
B
Lagrange
interpolation
Answer = 1
This is the multiplication of [0,1,0]
and [0,1,0] in secret-shared form.
So using SSS, we are hiding 1 or 0
from the adversary.
Each cloud sends only one value to the user,
regardless of dataset size →
Less communication cost
Dolev et al. Accumulating automata and cascaded
equations automata for communicationless
information theoretically secure multi-party
computation, 2019.
Can we use this string-matching technique for solving
other operations such as selection and aggregation?
V1 V2

• Based on string-matching techniques explained previously
• Supporting database outsourcing using SSS
• Execute complex selection (conjunctive and disjunctive) in an
oblivious manner
• No communication among servers
• Minimize work at the database owner site
• Result Verification Methods
• Count, Sum, Maximum, Minimum, Top-K
• Tuple verification
OBSCURE: Oblivious and Verifiable Aggregation Queries
Reference: Peeyush Gupta, Yin Li, Sharad Mehrotra, Nisha Panwar, Shantanu Sharma, and Sumaya Almanee. “Obscure: Information-theoretic oblivious and verifiable aggregation queries.”
Proceedings of the VLDB Endowment 12, no. 9 (2019): 1030-1043.

OBSCURE: Data Outsourcing using OBSCURE
EmpID Name Salary
E101 John 1000
E101 John 100000
E102 Adam 5000
E103 Eve 2000
CleartextTID SSTID Salary
5 5 5000
4 4 1000
3 3 1000
2 2 100000
Employee Relation
Create shares using SSS Create shares using OP-SS
Only order of
values is revealed.
But, which row has
the highest value is
not revealed.
Fast
answering
to
maximum
finding
queries.
EmpID Name Salary TID Index
For verification
purpose
E101 John 1000 3 3
E101 John 100000 2 2
E102 Adam 5000 5 5
E103 Eve 1000 4 4
E1 E2

• Step 1: Convert query predicates to secret-share representation
• Step 2: Send secret-shares query predicate to the servers
OBSCURE: Conjunctive Count Query
Name
John
John
Adam
Eve
Salary
1000
100000
5000
1000
John
John
John
John




String-Matching
Operation over
Secret-Shares
1
1
0
0
Answers of
String-Matching
Operations
1000
1000
1000
1000



1
0
0
1
Query
predicate
String-Matching
Operation over
Secret-Shares
Answers of
String-Matching
Query
predicate
1
0
0
0
1
Final answer to
the query
select count(*) from Employee where Name = ‘John’ and Salary = 1000
Multiply
Add
Multiplication increases the degree of the polynomial
If we have a smaller number of servers than the desired
number of servers, then we can still solve the problem by
1. Increasing communication rounds
2. Increasing computation time
V1 V2

OBSCURE: Count Query – Security Guarantees
• Identical operations on each row  Oblivious execution
• Hide access-patterns: The adversary cannot learn which rows have satisfied the query
• The adversary cannot learn anything
• By observing the values of the data and query predicates, since all values are secret-shared
• No output-size attack
Name
John
John
Adam
Eve
Salary
1000
100000
5000
1000
John
John
John
John




String-Matching
Operation over
Secret-Shares
1
1
0
0
Answers of
String-Matching
Operations
1000
1000
1000
1000



1
0
0
1
Query
predicate
String-Matching
Operation over
Secret-Shares
Answers of
String-Matching
Operations
Query
predicate

Impact of #Shares – Conjunctive Count Query
Name
John
John
Adam
Eve
Salary
1000
100000
5000
1000
John
John
John
John




1
1
0
0
1000
1000
1000
1000



1
0
0
1
1
0
0
0
1
select count(*) from Emp where
Name = ‘John’ and Salary = 1000 and Age = 40
Multiply
Add
Age
40
40
50
40
40
40
40
40



1
1
0
1
Polynomial
degree = 3
• Min. number of shares of interpolate a polynomial of the degree = 3
• Need four shares
V2
V3
V1

Impact of #Shares – Conjunctive Count Query
select count(*) from Emp where
Name = ‘John’ and Salary = 1000 and Age = 40
• What if you have only three shares?
• Compute the result of any two predicate, e.g., Salary = 1000 and Age = 40
• And execute the remaining query at the user side
Name
John
John
Adam
Eve
Salary
1000
100000
5000
1000
John
John
John
John



1
1
0
0
1000
1000
1000
1000



1
0
0
1
Age
40
40
50
40
40
40
40
40



1
1
0
1
Multiply
1
0
0
1
V2
V'
V1 V3

OBSCURE: Count Query Result Verification
EmpID Name Salary TID Index
With
Something for
verification
E101 John 1000 3 3
E101 John 100000 2 2
E102 Adam 5000 5 5
E103 Eve 1000 4 4
EmpID Name Salary TID Index A B
E101 John 1000 3 3 1 1
E101 John 100000 2 2 1 1
E102 Adam 5000 5 5 1 1
E103 Eve 1000 4 4 1 1
What is this
here???
Two columns,
each is having
1 of SSS form

Verify the answer of the following query:
1
0
0
0
A
1
1
1
1
B
1
1
1
1
0
1
1
1
1 - Value
Multiply
1
0
0
0
0
1
1
1
3
1
Add all
values
Add all
values
MultiplyCount
query
result for
each row

Verify the answer of the following query:
1
3
The first value matches the result of the
count query →
The count query result is correct
The sum of the two values equals to the
number of rows in the dataset →
The server has scanned all the rows
to compute the answer

OBSCURE: Maximum Query
select * from Employee where Salary
in (select max(Salary) from Employee)
EmpID Name Salary Dept TID Index
E101 John 1000 Testing 3 3
E101 John 100000 Security 2 2
E102 Adam 5000 Testing 5 5
E103 Eve 1000 Design 4 4
5 5 5000
4 4 1000
3 3 1000
2 2 100000
Find the tuple with the
maximum salary
2 2 100000
Output
Based on string matching over TID
and SSTID, find the tuple having the
maximum salary
E101 John 100000 Security 2 2
E1
E2

• Dataset
• TPC-H LineItem Table 1M and 6M rows
• Cloud Machines
• 15 AWS servers, each 144GB RAM, 3.0GHz Intel Xeon CPU with 72 cores
• Database Owner or User Machine
• A 16GB RAM machine with one core
OBSCURE: Experimental Results

OBSCURE vs MPC (communication among servers)

OBSCURE vs Downloading and Local Processing
1M rows 6M rows
At most time is
13 seconds
At most time is
50 seconds
Computation time at a resource constrained user
(1GB RAM and single core 1.35GHz CPU)
1M rows  at most 13seconds < 26 seconds (downloading)
6M rows  at least 50seconds < 385seconds (downloading)

OBSCURE: Experiments Results – Query Execution vs
Verification Time
• At most 10 seconds on 1M rows count and sum operation verification
• At most 35 seconds on 6M rows count and sum operation verification

• Cybernetica
• Galois Inc.
• Partisia
• Secret Double Octopus
• SecretSkyDB Ltd
• Stealth Software Technologies
• Unbound Tech.
Industrial Efforts (1)

• Built on top of PostgreSQL and
supports most of SQL
• Offer multiple encryption techniques
• Deterministic encryption
• Order Preserving Encryption
• Multi-Party Computation using SPDZ
engine
• Users can select different encryption
techniques for different attributes
• Provide end-to-end privacy
preservation for relational queries
• Limitations:
• Joins on deterministically encrypted
attribute is allowed
• Search on MPC domain is very slow
Industrial Efforts (2): Jana by Galois Inc.
Thanks: David Archer @ Galois

• Stealth Software Technologies, Inc. is a small
business based in Los Angeles
• Team of world-class cryptographers and software
engineers who are pioneers in their field and have
been building solutions
• Private Updateable Lightweight Scalable Active
Repository (PULSAR) as part of the DARPA
Brandeis program
• Aggregate and analyze data while maintaining
privacy and authenticity of data
• Combines a multitude of cryptographic techniques
• Secure database via access-pattern hiding Searchable
Encryption [IKLO16]
• Function Secret Sharing [BGI15]
• Efficient Secure Multiparty Computation (MPC)
• Garbled RAM [LO13]
Industrial Efforts (3): PULSAR by Stealth Software
Technologies, Inc.
Thanks: Steve Lu @ Stealth

Comparing Secret-Sharing-based Systems
Features Pulsar (v0.5.1-269) Jana (v1.7.6) OBSCURE
Incremental insert support No Yes Yes
Indexing support Yes Yes -- limited to DET and plaintext No
Support for Sensitivity No --- encrypts all data Yes --- encrypts attributes as
described by the application designer
No
Select and range queries Yes Yes Yes
Analytical query support
(group by and join)
No --- (but application-level
joins possible, though may leak
data)
Partial --- Join only on plaintext or
DET attributes
No join
User-defined functions No No No
Academic/Industry Industry Industry Academic

• Information-theoretically secure
• Secure regardless of the computational power of an adversary
• Quantum secure
• Communication overheads
• Cannot deal with complex queries
• Join queries
• Nested queries
• Require more than one non-communicating servers
Pros and Cons of Secret-Sharing-based Techniques

Motivation for SGX
• Security and isolation in
commodity systems
• Privilege levels (rings) protect the
kernel from user programs

Motivation for SGX
commodity systems
• Page tables protect programs
from each other

Operating Systems haven’t changed for decades
89
 40 years old
 Time-sharing
 Expensive hardware
 Overly general
Ken Thompson (sitting) and Dennis Ritchie working together at a PDP-11 (1972)

•17,000,000 LoC
• 40 subsystems
• 3,200 device drivers
90

Modern Kernels are Vulnerable
0
100
200
300
400
500
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Linux Kernel Vulnerabilities by Year

Motivation for SGX
commodity systems
• Page tables protect programs
from each other
• Until one program (malware)
attacks the kernel and then
attacks any program in the
system

TCB of A Modern System
• Attack surface is giant
• OS kernel
• 17,000,000 lines of code
• 40 major subsystems
• 3,200 device drivers
• Virtual Machine Monitor
• Hypervisor
• QEMU emulator
• Device drivers
• Parts of host kernel
(KVM)/Domain0 (Xen)

Enclaves
• Applications can protect their
secrets
• TCB is small
• Intel CPU
• App code itself
• Protected from malicious
• BIOS
• SMM
• Hypervisor
• Kernel
• Familiar application
environment

SGX Enclaves
• Trusted execution environment embedded in the process

SGX Enclaves
• Trusted execution environment embedded in the process
• It’s own code and data
• Controlled entry points
• Multi-threading
• Confidentiality
• Integrity

• Enters and exits are expensive
• Memory is encrypted
• Limited physical memory
Performance

Powerful Adversary Model
• OS + VMM
• Controlled execution environment
• Control over page faults
• Suspending execution
• Single stepping
• Flushing caches

• Every architectural component of the CPU
• Branch target buffers
• S. Lee et al., “Inferring fine-grained control flow inside SGX enclaves with branch shadowing,” in USENIX Security, 2017
• G. Chen et al., “SgxPectre attacks: Stealing intel secrets from SGX enclaves via speculative execution,” arXiv preprint, 2018.
• Pattern-history table
• D. O'Keeffe et al., "Spectre attack against SGX enclave," 2018
• Caches
• Brasser et al., "Software grand exposure: SGX cache attacks are practical," in WOOT, 2017
• J. Gotzfried et al., "Cache attacks on Intel SGX," in EuroSec, 2017
• A. Moghimi et al., "Cachezoom: How SGX amplifies the power of cache attacks," in CHES, 2017
• M. Hahnel et al., "High-resolution side channels for untrusted operating systems," in USENIX ATC, 2017
• M. Schwarz et al., "Malware guard extension: Using SGX to conceal cache attacks," in DIMVA, 2017
• DRAM row buffer
• W. Wang et al., "Leaky cauldron on the dark land: Understanding memory side-channel hazards in SGX," in CCS, 2017
• Page-tables
• W. Wang et al., “Leaky cauldron on the dark land: Understanding memory side-channel hazards in SGX,” in CCS, 2017
• J. Van Bulck et al., “Telling your secrets without page faults: stealthy page table-based attacks on enclaved execution,” in USENIX,
2017
• Page-fault exception handlers
• Y. Xu et al., “Controlled-channel attacks: Deterministic side channels for untrusted operating systems,” 2015
• S. Shinde and other, “Preventing page faults from telling your secrets,” in CCS, 2016
• Speculative execution
• J. V. Bulck et al., “Foreshadow: Extracting the keys to the Intel SGX kingdom with transient out-of-order execution,” in USENIX, 2018
Side-Channel Attacks

• Controlled channel attacks
Page-Fault Tracing Attacks
Reference: Y. Xu et al., “Controlled-channel attacks: Deterministic side channels for untrusted operating systems,” 2015

• Page fault address depends on sensitive data
Page Fault Tracing Attacks

• Insertions are
deterministic
• Word order is known
• Observe sequence of
page faults
• Lookup exhibit same
sequences
Example: Recovering Text via Spell Checker

• Wizard of Oz
• All words
• 96% accuracy
Example: Recovering Text via Spell Checker

Cache Attacks: Prime + Probe
Reference: Brasser et al., "Software grand exposure: SGX cache attacks are practical," in WOOT, 2017

• Isolated core
• Execute attack in L1
• Separate instruction and
data caches
• No slef-pollution
• SMT
• Uninterrupted execution
• Performance Monitoring
Counters (PMC)
• Cache-misses
Controlled Execution Environment
Reference: Brasser et al., "Software grand exposure: SGX cache attacks are practical," in WOOT, 2017

Example: Cache-Tracing Attack
Reference: M. Hahnel et al., "High-resolution side channels for untrusted operating systems," in USENIX ATC, 2017

Text Reconstruction

Cache-Tracing: Reconstructed Text

• SGX does not clear branch history
Branch Shadowing Attack
Reference: S. Lee et al., “Inferring fine-grained control flow inside SGX enclaves with branch shadowing,” in USENIX Security, 2017

• SGX does not clear branch
history
• Can we extract this
information?

• 66% of 1024 RSA private key from a
single run
• Full key from 10 runs

Data-Oblivious Primitives
• Assignments and comparisons
Reference: Ohrimenko, Olga, et al. "Oblivious multi-party machine learning on trusted processors." USENIX Security, 2016.

• Array access
• Scan entire array
• AVX instructions
Data-Oblivious Primitives
Reference: Ohrimenko, Olga, et al. "Oblivious multi-party machine learning on trusted processors." USENIX Security, 2016.

• Will be fixed
• Caches
• Partitioned caches
• Branch predictors and likely other microarchitectural components of the CPU
• Speculative Taint Tracking (STT)
• Yu, Jiyong, et al. "Speculative Taint Tracking (STT): A Comprehensive Protection for
Speculatively Accessed Data." Micro, 2019
• What will not be fixed
• Paging attacks
• SGX inherently leaves page table under control of the OS
• Memory
• Enclave’s memory is observable by the OS and hardware attacks
• ORAM is 10x overhead
What will be fixed in hardware

Can we design any system supporting
database operations using SGX?
• It’s possible to build an oblivious database
• Oblivious primitives for accessing records
• Oblivious sort for joins
• Parallel Bitonic sort N*(log(N))2
Question

Opaque (1)
Patient ID disease
1 Fever
2 Cancer
3 Fever
4 Cancer
5 Cancer
Reference and slide credit: Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Opaque: An oblivious and encrypted distributed
analytics platform." In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 283-298. 2017.
• First system supporting database joins and aggregations using SGX
• However, supports primary key to foreign key join only
• How many people are suffering from cancer and fever

Opaque (2)
Patient ID disease
1 Fever
2 Cancer
3 Fever
4 Cancer
5 Cancer
Reference: Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Opaque: An oblivious and encrypted distributed analytics platform."
In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 283-298. 2017.

Opaque (2): Oblivious Aggregation
Patient ID disease
1 Fever
2 Cancer
3 Fever
4 Cancer
5 Cancer
Patient ID disease
2 Cancer
4 Cancer
5 Cancer
1 Fever
3 Fever
Quicksort
in
SGX
Cancer, 3
Fever, 2
Patient ID disease
1 Fever
2 Cancer
3 Fever
4 Cancer
5 Cancer
Decrypt
Reference: Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Opaque: An oblivious and encrypted distributed analytics platform."
In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 283-298. 2017.
What is wrong assumption of Opaque?
Not dealing with side-channel attacks (cache-line, branching)
But all side-channel attacks cannot be solved in future

• ObliDB
• Selection and join
• HardIDX
• Secure indexes using SGX
• VC3
• For secure MapReduce computations
• EnclaveDB
• For secure transaction support
• Hermetic
• Mixed differential privacy
Other Systems using Intel SGX

Data Partitioning-based Outsourced
Data Processing

•What we have seen in the previous slides?
• Many cryptographic solutions exist
• Not efficient for answering even simple queries
Scaling Secure Data Management
“At scale” solutions requires choice between
generality, security or performance.
Weaker Security Models: use weaker
models of security to scale computation
(explored in several prior systems)
Partitioned Computing: exploit partial
sensitivity of data to prevent expensive
cryptography on data that is not
sensitive

Partitioned Data Security
• Non-Linkability
• The Adversary does not learn relationship between any encrypted and plaintext
value
• Cyphertext Indistinguishability
• The adversary does not learn any relationships between encrypted values
• unless underlying crypto allows such relationships to be learnt (e.g., OPE)

Partitioned Computations at Public Cloud (1)
Name Department
t1 E(Adam) E(Defense)
t2 E(John) E(Security)
t3 E(Clark) E(Crypto)
t4 E(Lisa) E(Defense)
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Query Q Answer A
Query Qs Query Qns
Answer Ans
Answer As
Sensitive Data Ds
Non-sensitive Data Dns

Leakage due to Partitioned Computing… (2)
Name Department
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Sensitive Data Ds
Query: Retrieve John rows
Query
value
Tuples retrieved
from sensitive side
Tuples retrieved from
non-sensitive side
John T2 T6
Adversarial view
T2 is John’s row.
Reference: Sharad Mehrotra, Shantanu Sharma, Jeffrey Ullman, and Anurag Mishra. "Partitioned data security on outsourced sensitive and non-sensitive data." In 2019 IEEE 35th International Conference on Data Engineering (ICDE),

What if we use access-pattern-hiding techniques? (3)
Name Department
Name Department
t5 Adam Testing
t6 John Testing
t7 Lisa Design
t8 Clark Design
Sensitive Data Ds
Query: Retrieve John rows
Query
value
Tuples retrieved
from sensitive side
non-sensitive side
John E(….) T6
Adversarial view
Output size reveals that one of
John’s record is sensitive.
Reference: Sharad Mehrotra, Shantanu Sharma, Jeffrey Ullman, and Anurag Mishra. "Partitioned data security on outsourced sensitive and non-sensitive data." In 2019 IEEE 35th International Conference on Data Engineering (ICDE),

Secure Partitioned Computation (1)
• Data partitioned into bins
• Non-sensitive data partitioned into
non-sensitive bins (NSB)
• Sensitive data partitioned into
sensitive bin (SB)
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
Dns
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)
Query
value
Tuples retrieved
from sensitive side
non-sensitive side
John SB(y) NSB(y)
Adversarial view
• Query Q for value y mapped to
all values in the bin
corresponding to y
• Retrieves all data in NSB(y) over
non-sensitive data
• Retrieves all data in SB(y) over
sensitive data

• Bins are created such that for all pairs of sensitive and non-sensitive bins,
there exists a value v,
• such that s  SB(v) and ns  NSB(v)
• The adversarial view does not allow the adversary to learn linkability
between sensitive and non-sensitive records
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
Dns
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)

• Association amongst each sensitive bin and non-sensitive bin prevents
• Leakage through joint access of data
• Output size attacks
• Workload skew attacks can be prevented through (careful) addition of
(minimal) fake queries
……E( x)……..
…… x ……..
…… y……..
…… z .……..
…….……..
……E(y) ……..
…… E(z)……..
…….……..
Ds
SB(x)
SB(y)
SB(z)
NSB(x)
NSB(y)
NSB(z)
Dns

Query Binning
• Assumptions
• Equal number of sensitive and non-sensitive attribute values
• Each distinct attribute value appears in at most one tuple in sensitive and one
tuple in non-sensitive data
• Number of values are a product of approximately equal factors
***The paper relaxes all these assumptions

The Algorithm: One Tuple Per Value
Bin Creation: Inputs: S and NS
• Permute all sensitive values
• Find approximate square factor of |NS| = x * y such that x
≥ y
• Create x sensitive bins; contains at most y inputs in each
• Create |NS|/x non-sensitive bins
• Assign ith sensitive value to (i mod x)th sensitive bin
• Assigning non-sensitive values: Assign non-sensitive value
corresponding to ith sensitive value, which is allocated to
jth bin, to jth position of ith non-sensitive bin
• NSB[j][i]  allocateNS(SB[i][j])
• Fill remaining NS values
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S = {S1, S2, S3, S4, S5, S6}
NS = {NS1, NS2, NS3, NS6, NS7}

The Algorithm: One Tuple Per Value
• Bin Retrieval: Input: Query(w)
• If w is in a sensitive bin SB[i][j], then
• Retrieve ith sensitive bin and jth non-sensitive bin
• If w is in a non-sensitive bin NSB[i][j], then
• Retrieve ith non-sensitive bin and jth sensitive bin
S = 6 NS = 6
x = 3
y = 2
S = {S1, S2, S3, S4, S5, S6}
NS = {NS1, NS2, NS3, NS6, NS7}
Query: S2 SB2, NSB1
Query: NS7 NSB1, SB2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4

Query Execution Cost on Outsourced Data
Techniques Time Resilient to attacks
Size Workload-skew Access-patterns
SGX 10500x
Query Binning + SGX (60% sensitivity) 8929x
Multi-party computations-Jana 954363x
Query Binning + Jana (60% sensitivity) 680131x
x is the time to search a predicate in cleartext.
is showing a technique is resilient to a given attack.
Experiments are conducted over 1.5M rows.

Experimental Results (Selection Query)
• X-axis = Data sensitivity (1%, 2%, 20%, 40%, 60%)
• Y-axis = time
SGX Opaque + Partition computing vs SGX Opaque
Data set size = 6M rows
Jana MPC + Partition computing vs Jana MPC
Data set size = 1M rows

Analytical Model
• When is query binning better compared to pure cryptographic approach?
Ratio of cost of QB versus
crypto only approach
After several rounds of
simplications (see paper)
Under ideal assumptions….
QB is better than cryptographic only
solution if this holds (see paper)
Ratio of computation cost of cryptographic
techniques vs plaintext per tuple
Ratio of cryptographic computation vs
communication cost per tuple (typically much
greater than 1 for strong cryptographic techniques)
Average query selectivityRatio of sensitive data

• If there is no approximate square factor?
• Select nearest square number
• If there is no 1-to-1 mapping of sensitive and non-sensitive value, and
differences in size of the values?
• Bin-packing algorithm
• What about range queries?
• With the help of a modified B-tree created over non-sensitive bins
• What about join queries?
• Keep pseudo-sensitive data with sensitive data
• What about aggregation queries?
• Execute like a selection query without tuple fetching
Query Binning Extensions

Distinct Values are not a Product of Approximately
Square Factor (1)
• What will happen when the number of distinct values is not a product
of approximately square factor ???
• Increasing communication cost
• For example 82 non-sensitive values, results in 41 sensitive bins and 2 non-
sensitive bins
ns1, ns2, …, ns41
ns42, ns43, …, ns82
E(s1)
E(s2)
E(s41)
SB1
SB2
SB41
NSB1
NSB2
Communication cost = 42
At most 1 value in
a sensitive bin
At most 41 values in a
non-sensitive bin

Distinct Values are not a Product of Approximately
Square Factor (2)
• Reducing communication cost --- by finding nearest square number
• In the case of 82 non-sensitive values, 81 is nearest square number
• Thus, create 9-9 sensitive and non-sensitive bins
ns1, ns2, …, ns10
ns11, ns12, …, ns19
….E(x)….
…E(y)…..
….E(z)…..
SB1
SB2
SB9
41Sensitivevalue
82Non-sensitivevalue
Communication cost = 15
ns74, ns75, …, ns82
At most 5 values
in a sensitive bin
At most 10 values in a
non-sensitive bin
NSB1
NSB2
NSB9

The Algorithm: General Case: Multiple Tuples per Value
(1)
• What will happen if all values have a
different number of tuples??
• Size of each sensitive bin is different now
• Assumption: More non-sensitive values
have more sensitive associated tuples.
• The adversary learns from tuple retrieval
that which bin contain sensitive value
corresponding to non-sensitive values
• E.g., retrieval of SB1 and NSB1 reveals that
S1 is allocated to SB1
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bin
25
4
2
Size of
bin
230
170

(2)
different number of tuples?
• Solution: Simply add fake tuples to
sensitive bins
• Problem: too many fake tuples
leading to increases communication
cost
• So how to overcome this problem???
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S1
S2
S3
S4
S5
S6
NS2 NS3NS1
NS7 NS6NS4
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bin
25
4
2
Size of
bin
230
170
Added fake
tuples
0
21
23
We add 44 fake tuples to
sensitive data

(3)
different number of tuples?
• Solution: Bin-packing-based approach
• Sorting: Sort all the values in a decreasing
order of the number of tuples.
• Allocate sensitive values
• Add fake tuples
• Allocate non-sensitive values as we showed
previously
S = 6 NS = 6
x = 3
y = 2
SB1
SB2
SB3
NSB1
NSB2
S4
S1
S2
S6
S3
S5
NS1 NS2NS7
NS3 NS5NS6
S1 = 10
S2 = 2
S3 = 1
S4 = 15
S5 = 2
S6 = 1
NS1 = 200
NS2 = 20
NS3 = 10
NS4 = 150
NS5 = 10
NS7 = 10
Size of bins
before adding
faking tuples
16
11
4
Added fake
tuples
0
5
12
S4 = 15
S1 = 10
S2 = 2
S5 = 2
S3 = 1
S6 = 1
After
sorting
We add fewer fake tuples than a simple
solution of adding fake tuples
44 vs 17 fake tuples

Range Queries
• A full binary-tree is constructed for all non-sensitive value
• Bins are created for each level of the tree, except the root node
• Bins are retrieved based on least-matching
• For example, a range query from ns8 to ns12  Bins as per node ns23 and ns8
Bins for each node of each level of the tree

• We discussed:
• Encryption-based techniques and systems
• Secret-sharing-based techniques and systems
• Existing cryptographic techniques are
• Functionality vs security vs overhead
• Secret-sharing is secure but limited applicability
• Searchable encryption is fast but reveal information
• Trusted platform-based approaches are faster than cryptographic techniques
• But there is no completely trusted platform at the public cloud
• Existing secure hardware have several vulnerability
• Can we exploit secure mediation approach
• Different cryptographic technique at the same time
• Security is not clear
• Initial effort: partitioned computation but security challenges -- a naïve query execution on
partitioned data can lead to information leakage
Conclusion

Contact Information
Shantanu Sharma
University of California, Irvine, USA.
shantanu.sharma[AT]uci[DOT]edu
toshantanusharma[AT]gmail[DOT]com
Slides are available at
ics.uci.edu/~shantas/

Secure and Privacy-Preserving Big-Data Processing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Secure and Privacy-Preserving Big-Data Processing

Semelhante a Secure and Privacy-Preserving Big-Data Processing (20)

Mais de Shantanu Sharma

Mais de Shantanu Sharma (10)

Último

Último (20)

Secure and Privacy-Preserving Big-Data Processing

Notas do Editor