This document discusses data democratization using Splunk. It describes how Splunk can be used to provide universal access to data through delegated access models, standardized data models, and automation. Key points include:
1. Splunk can implement a delegated access model using apps, indexes, and user roles to securely share sensitive data.
2. Standardized data models and semantic logging help combat knowledge fragmentation and enable consistent analysis.
3. Automating data onboarding and validation helps improve adoption by reducing backlogs and ensuring data quality.
2. About Me
• Splunking Since 2008
• Largest Splunk Implementation:
• 3 TB/day
• 1.2 PB Searchable
• 900 Users
• Interests:
• Guitars
• And the occasional Uke
3. What is Splunk?
• Google Search for IT Data?
• Log aggregation Tool?
• Data Visualisation Tool?
• Data Platform with App Creation Capabilities
• Proprietary Search Language - SPL
• Correlation of Structured and Unstructured Data Sources
• Visualisation capabilities
• Out of the Box
• Modular
4. Getting Data In
Unstructured Data
Sources
Structured Data
Sources - JSON,
CSV, XML
Forwarders
HEC
Data Sources Indexer
Line Breaking
Timestamp
Recognition
Data
Segmentation
Pipeline
Persist to
Disk
Index
Bucket
Bucket
Bucket
Bucket
Bucket
Keywords
Raw Data
5. Data Collection using Splunk
Forwarder
• Splunk forwarder capabilities
• File based Inputs
• Database Inputs
• Scripted Inputs
• Forwarder Configurations deployed as modular add-ons
6. Typical Splunk Search
index = <my_product> sourcetype=web.access checkout | stats
avg(response_time) as “Average Response Time” by request
7. Searching Data
Query Index By
Keyword
Load Raw Results
Returned in Memory
Apply Data Extractions,
Transformations and
Lookups
Run Streaming
Commands
Indexers - Map
Search Heads - Reduce
Knowledge
Objects
Receive Results and
“Reduce”
Run Additional
Commands
Visualise, Report,
Alert
8. So what about Knowledge Objects?
• Most Knowledge Objects are configurable from UI
• Common Types:
• Field Extractions - regex to extract fields
• Field Aliases - Alias a name of a field
• Lookups - vs flat files and kv-store
• Tags - Provides event grouping abstraction
• Eventtypes - Provides event categorisation
• Calculated Fields - Data manipulations
9. Goal?
• Queries like:
• Become:
index=<my_website> “/checkout/auth/confirmation” | rex “<some humungous regex that extracts
customer id in addition to other things>” | eval response_time_seconds = resp_time_milliseconds/
(1000) | where http_code == 200 | lookup db_locations customer_id OUTPUT location | stats
avg(response_time_seconds) as avg_response_time by location
eventtype=auth_successful tag=web | stats avg(response_time_seconds) as
average_response_time by location
12. Scenario
• Microservices Architecture
• Numerous Development Teams working under different service
umbrellas
• Mix of legacy systems with modern services
• Dependance on vendor integrations
• Data can be sensitive
13. Typical Data Democratisation Issues
• Security - Some data is sensitive yet valuable but we’d like an open
access model
• Knowledge Fragmentation - Its our data, lets make sure everyone
knows what it means.
• Adoption - People need to like it. Shouldn’t get in the way.
• Scalability
• Chargeback - its not my data, why should I pay for it?
14. Security - Delegated Access Model
• Splunk Search Apps can serve knowledge containers
• Knowledge Objects Ownership can scope local to the app or global to
the entire system.
• Splunk Indexes are data containers.
• Data Access granted by index
• Assign an app per product or service umbrella
• Assign Data Owner
16. Splunk Security Must Have!
• Splunk Authentication is Poor
• No Password Policy
• No Centralised management for multiple search nodes
• Single Sign On - Splunk supports:
• Ping Identity
• Okta
• ADFS
• Azure AD
• LDAP
• Custom Auth
• Use a Entitlement Framework on top of single sign on groups
17. Combating Knowledge Fragmentation
• Semantic Logging:
• Logging for the sole purpose of analytics
• Rich datasets can be viewed in multiple dimensions
• Define Developer Guidelines:
• Ensure Correlation Identifiers are present in all events
• Precision Timestamps
• Incorporate Logging into SDLC
• Standardise Logging Formats
• Standardise Log content per service - e.g. BAM metrics
18. Combating Knowledge Fragmentation
Reality - Not all logs can be logged semantically or logged
semantically without significant refactoring.
Splunk Solution - Data Models
19. Data Models
• Enable go go gadget - “Schema on the fly”
• Hierarchically structured search-time mapping of semantic
knowledge.
• Accessed via Datasets tab in Splunk 6.5
20. Example: Splunk CIM
• Splunk Common Information Model (CIM)
• Collection of Data Models based on subject area
• Shared Semantic model
• Support consistent and normalised treatment of data
• Enables third party apps to be integrated to your data.
• Reference Tables:
http://docs.splunk.com/Documentation/CIM/4.6.0/User/Howtousethesereferencetables
21.
22. Pivot
• UI Developed to enable the creation of analytics off structured data
models
• Supports:
• Tables
• Charts - Line,Scatter, Column, Bar, Bubble,Pie
• Single Value Visualisations
23.
24. Performance
• Data Models can be accelerated which can lead to:
• Decreases Search Optimisation Effort
• Decreases Dashboard Optimisation Effort
• Increases Storage Requirements
• Speed up upto x1000
• Speed is dependant on the cardinality of data
25. Notable Splunk Apps on CIM
• Splunk Enterprise Security
• Splunk PCI Compliance
• Insight Engines - Search Splunk using Natural Language
26. Adoption
• Most users complain about backlogs on onboarding data
• Automating the onboarding process isn’t as easy as it sounds. Data Validation is key to deriving value.
• Universal Forwarder:
• Standardise Log Locations
• Standardise Time Stamps
• HTTP Event Collector:
• Send data directly from your application to splunk
• Utilise Indexer Acknowledgement
• Notable implementations:
• Docker - Splunk Logging Driver
27. Newish Splunk Features
• Machine Learning Toolkit
• Comes with built-in assistants for supported algorithms
• Extend algorithms available - python sci-kit learn
• ITSI
• Modular Visualisations
• New Custom Search Command Creation Capability
• TSIDX Reduction - Decrease Storage Costs