This document discusses data obfuscation techniques in Splunk Enterprise, including anonymization and pseudonymization. It covers securing data in flight using encryption and authentication. For data at rest, it discusses integrity controls and encryption using OS, devices, or Vormetric. It then details how Splunk supports anonymization through SEDCMD transforms or at search time. Pseudonymization techniques include hashing or duplicating data to different indexes. The document demonstrates modular inputs and a custom data handler to encrypt and anonymize fields before indexing.
5. The Drivers
Collect and Process Data
5
Stakeholder* Workers
Council
Data Privacy
Officer
GDPR Privacy
Shield
PCI ….
Requirements* Anonymization Pseudonymization Pseudonymization Encryption RAW Event
archival for 1
year – 3
month online
*Examples only | Your legal department will assist you.
6. The Drivers
Collect and Process Data
6
Stakeholder* Workers
Council
Data Privacy
Officer
GDPR Privacy
Shield
PCI ….
Requirements* Anonymization Pseudonymization Pseudonymization Encryption RAW Event
archival for 1
year – 3
month online
*Examples only | Your legal department will assist you.
You need to ensure to have a flexible platform
that fits your needs
–
even if they change!
7. Spoilt for Choice
What
– Confidentiality / Integrity / Authenticity
Where
– At Source / In Flight / At Rest / Presentation Layer
How
– Anonymization / Pseudonymization
Usability, Maintainability, Cost, …
7
9. Data-in-Flight
Ways to secure your connections to Splunk Enterprise
Encryption and/or authentication using your own certificates for:
– Communications between the browser and Splunk Web
– Communication from Splunk forwarders to indexers
– Other types of communication, such as communications between Splunk
instances over the management port
9
Type of exchange Client function Server function Encryption Certificate
Authentication
Common Name
checking
Type of data exchanged
Browser to Splunk Web Browser Splunk Web NOT enabled by default dictated by client
(browser)
dictated by client
(browser)
search term results
Inter-Splunk
communication
Splunk Web splunkd enabled by default NOT enabled by default NOT enabled by default search term results
Forwarding splunkd as a
forwarder
splunkd as an indexer NOT enabled by default NOT enabled by default NOT enabled by default data to be indexed
Deployment server to
indexers
splunkd as a
forwarder
splunkd as an indexer NOT enabled by default NOT enabled by default NOT enabled by default Not recommended. Use Pass4SymmKey
instead.
http://docs.splunk.com/Documentation/Splunk/latest/Security/AboutsecuringyourSplunkconfigurationwithSSL
11. Data-at-Rest Integrity
Ways to ensure the integrity of your machine data stored in Splunk
Compute SHA256 hash for every slice in hot bucket
When bucket rolls from hot to warm, create SHA256 hash of the file
containing the hashes of the individual slices
Can verify integrity from the CLI
Enable for an entire index
11
http://docs.splunk.com/Documentation/Splunk/latest/Security/Dataintegritycontrol http://blogs.splunk.com/2015/10/28/data-integrity-is-back-baby/
12. Data-at-Rest Encryption
Entire data set
Encryption of all data Splunk writes to
disk (index, raw data, metadata)
Pros:
– Easy to implement with OS or device means
/ covers all data / transparent to Splunk
Cons:
– All indexes on a given file system /
performance overhead / limited security
against rogue users
15. What is Anonymization?
Anonymization of data means processing it with the aim of irreversibly
preventing the identification of the individual to whom it relates.
15
2016-12-24 09:00 host1 mm28522 login successful
2016-12-24 09:00 host1 ****** login successful
16. What is Pseudonymization?
Pseudonymization of data means replacing any identifying
characteristics of data with a pseudonym, or, in other words, a value
which does not allow the data subject to be directly identified.
16
2016-12-24 09:00 host1 mm28522 login successful
2016-12-24 09:00 host1 0fc43cd589ec74ddb677501adf6c295b login successful
18. Anonymization
At Rest / At Indexing Time / Modify Raw Events
SEDCMD or TRANSFORMS
props.conf
[source::.../accounts.log]
SEDCMD-accounts = s/ssn=d{5}(d{4})/ssn=xxxxx1/g
[source::.../another.log]
TRANSFORMS-anon=ssn-anon
transforms.conf
[ssn-anon]
REGEX=(ssn=)d{5}(d{4})
FORMAT=$1xxxxx$2
DEST_KEY=_raw
18
https://docs.splunk.com/Documentation/Splunk/latest/Data/Anonymizedata
19. Anonymization
Presentation Layer / At Search Time
Locked down User
– Pre-defined App with dashboard access only
– No search app, no raw search, no raw event drill down
| eval username = “******“
19
https://docs.splunk.com/Documentation/Splunk/6.5.1/Data/Anonymizedata
21. Pseudonymization
Presentation Layer / At Search Time
Locked down User
– Pre-defined App with dashboard access only
– No search app, no raw search, no raw event drill down
| eval username = sha256(username)
or use your own custom search command
21
https://docs.splunk.com/Documentation/Splunk/6.5.1/Data/Anonymizedata
22. Pseudonymization
At Source / Application
Data pseudonymization before Splunk picks it up
Pros:
– Managed earliest as possible in the process
– Data source owner responsible
– Data-Privacy challenge solved for data stored on
source as well
Cons:
– Individual solution per data source/type/method
required
23. Pseudonymization
Event Duplication Into Different Indexes
User authorization managed via role based
access control for indexes
Pros:
– Easy to implement and maintain, easy usability,
low complexity
Cons:
– Storage costs (can be limited with tsidx
retention but slower search)
– License costs
idx_cleartext
idx_pseudonym
24. Pseudonymization
Using Summary Index
Scheduled summary search transforms the
data and stores it in a new summary index
Pros:
– Summary index does not count against license
– Everything GUI managed
– Allows grouped aggregation (anonymization, too)
Cons:
– Regular search utilizing resources
– Breaks out-of-the-box CIM (source=search name,
sourcetype=stash, original sourcetype moved to
orig_sourcetype)
idx_cleartext
idx_summary
25. Pseudonymization
Modular Input
Data de-centralized piped through a custom
method using a modular input
Pros:
– High flexibility on encryption, hashing etc. methods
and requirements
– Processing can be done decentralized at each
forwarder to distribute processing load
Cons:
– Scripting required for modular inputs
27. Summing Up
Many possible ways – each has pros and cons
Anonymization
– Data aggregation might be an additional layer as specific access to a specific file
from a specific host does potentially allow identification back to an individual
Pseudonymization
– Requires a proper concept to ensure the pros and cons are known and accepted
in advance such that impact and additional complexity is understood in
production and operation use
We are transparent on possibilities, allow multiple ways and levels
which are available for data obfuscation.
Choose the best and most efficient
combination for you!
30. Modular Input
Search on Splunkbase
https://splunkbase.splunk.com/apps/#/search/Modular%20Input/
31. Protocol Data Inputs
Different input protocols
Custom data handler allows to
pre-process data
– Polyglot: many programming
languages can be used. E.g. Java,
JavaScript, Python, …
Different output protocols
Data Handler
https://splunkbase.splunk.com/app/1901/
32. Demo Scenarios
Encryption
Modular Input
Log file with sensitive data
Read log file data
File Monitor input (UF)
Protocol Data Inputs
Data Handler encrypts field values
Data sent and stored
Decryption
Custom Search Command
Events in Splunk with encrypted
field values
User is authorized to use custom
search command
Custom search command
Decrypts fields
Anonymization
SEDCMD
Log file with sensitive data
Read log file data
File Monitor Input (UF)
Pipeline
Apply SEDCMD and replace data
Data stored
32
33. Log File With Sensitive Data – cleartext.log
33
Field Description Action we want to take
first First name Encrypt with AES
name Last Name Encrypt with AES
dob Date of Birth Encrypt with AES
uid Employee ID Anonymize
37. Protocol Data Inputs Configuration – Data Handler
37
Parameters for custom data handler:
• regex: identify fields to encrypt
• AES_Key_File: Key to use to encrypt
PDI Custom data handler (here: Java)