2. Analytics at Splunk
• Analytics can be anywhere
– It’s not a separate department
• High value use cases
• Solve critical business problems
• Persona-based approach
• Enterprise-wide user adoption
• Continuous Business Insights
• Drive decision making
8. Intro to Personas
• Persona is a concept we use to define various user types in a Splunk
deployment.
• This is different than a Splunk role.
• Core IT personas (e.g. SysAdmins, Developers and Splunk Admins)
keep systems running, fix them when they break and plan for
capacity
• As your Splunk deployment grows out of Core IT…
Each business unit has their own set of personas
They have unique problems to solve and their preferred ways to interact with
or consume data
9. Building Data Science & Analytics Teams
There is no “one size fits all” data scientist. Data Science &Analytics teams
are made up of people with complementary skill sets.
Source: Schutt & O’Neil. Doing Data Science. 2013
11. Developing for Business: Gather Requirements
• What is the question I’m trying to answer?
– What is their Business Problem?
– What department are we dealing with?
– Where do they fit in the organization?
– Who is the end user primary contact?
– Do they have a (trained) power user?
– Engagement/support model
Self-service?
Full change control/Formal requests?
2 hour power session?
12. Developing for the Business: Get relevant data
• Where is the data that will help me answer the question?
• What are the relevant fields and what is the best way to
retrieve them?
• What data sources drive those constructs?
• Is the primary data in Splunk?
• Can I enrich Splunk data sources with external data feeds and
provide mash-ups?
• Should I be replacing legacy SQL queries with DBConnect?
• Should I index DBConnect data or just use it as a lookup?
16. Anomaly Detection & Clustering
•Anomaly Detection is one of Splunk’s most common use cases:
– Faster-than-humantransactions
– Intrusion & insider threat detection
– High-value customer purchase patterns
•Lots of solutions forAnomaly Detection:
– Clustering: cluster,kmeans,Event Patternstab
– AD: anomalies,anomalousvalue,outliers
– Alert on rate of statisticaloutliers (eg 5% → 15% triggers alert)
– Advanced threat detection (Enterprise Security)
•Integrate high-risk anomalies into incident review
17. Data Visualization
Data Viz:The creation and study of the visual representation of data.
•After processing, all data must be consumed:
– Machines can consume any kind of data
– People must visualize or listen to the data
•Splunk helps deliver actionable insights:
– Out-of-the-boxcharts & tables
– Easy-to-customizeD3 visualizations
– Drilldown & form inputs enable interactivity
Source: Satoshi’s Custom Visualizations app
https://splunkbase.splunk.com/app/2717/
18. Custom Viz: Sankey Chart
•Sankey charts illustrate flows through multiple stages
– You choose nodes & edges
•Lots of use cases:
– Customer paths through website
– Order tracking through system
– Any type of process flows
•Drilldown to go further:
– Why do these flows yield purchases?
– Which edges have high traffic?
– Where are the bottlenecks?
Nodes = stations. Edges = routes
Citibike data from:
http://www.citibikenyc.com/system-data
19. PredictiveAnalytics
Use predict to forecast time series into the future.
•Implements a Kalman filter
to identify seasonal trends.
– Best fit line & uncertainty envelope
•Lots of applications:
– Forecast revenue & other KPIs
– Estimate MTTR & server outages
– Dynamic baselining
– Capacity planning (AWSApp)
– Security threats (Enterprise Security)
•Remember: the future is always uncertain…
21. Growing beyond IT: Call to action!
• CIO and CDO care about Actionable Insights
• Build some Executive dashboards
• Crossing silos can be tricky
• Organization, communication,
documentation help immensely!
22. Next Steps
•Reach out to your localtechnical team!
– Your local Sales Engineers are happy to help
– Analytics SMEs are available for advanced use
cases
– Analytics Specialist team is available for
escalations
•We’ve got you covered. We’re here to help!
Unlike Security, Analytics is everywhere. Depends on who talking to & what problems they have.
Not just
Data mash-ups
Financial/KPI’s/Metrics
Ops
Social
ODBC
DB Connect
Modular Inputs
Streams
MINT
Splunk 6 takes large-scale machine data analytics to the next level by introducing three breakthrough innovations:
Pivot – opens up the power of Splunk search to non-technical users with an easy-to-use drag and drop interface to explore, manipulate and visualize data
Data Model – defines meaningful relationships in underlying machine data and making the data more useful to broader base of non-technical users
Analytics Store – patent pending technology that accelerates data models by delivering extremely high performance data retrieval for analytical operations, up to 1000x faster than Splunk 5
Let’s dig into each of these new features in more detail.
ODBC
DB Connect
Talk about Data Sift Modular Input
Streams
MINT
This slide demonstrates the the collaborative nature of Data Science & Analytics teams. There is no “one size fits all” data professional. Data Science and Analytics are cross-functional endeavors, and you need people from lots of different backgrounds.
Math & Stats, some Machine Learning & Comp Sci – this person is a good Data Researcher to have onboard. The green one here is stronger in CS & Programming, and is more of a Data Developer. The red one here has a ton of Domain Expertise, Communication and Data Viz skills, and is a great Data Businessperson. Together these three form a really solid Data Science team.
Mention Splunk assets
DBConnect
ODBC
(e.g. Add a column, filter a few rows based on field x, compute sum of field volume and split by product)
Definition: an anomaly is an event which is vastly dissimilar to other events. Note: “dissimilarity” is in the eye of the beholder. Lots of different similarity metrics. If you spot something which might be an anomaly, probe in deeper.
Example: fraudulent transactions.
First, we want to identify metrics of interest. Events are high-dimensional data objects, and metrics are one-dimensional projections. It’s not enough to just look at one metric: we need to keep track of multiple metrics simultaneously.
For each of these metrics, we want to find those events that are highly dispersive: i.e., very far away from central behavior.
Non-average: find those events which fall more than a few standard deviations away from the mean. From the Central Limit Theorem, if we have normally distributed data, we know that 99.7% of the data should fall within three standard deviations. Note: if you have 1000000 transactions, this means that ~3000 transactions are more than three standard deviations away! That’s still a lot, so be careful.
Also keep in mind that with financial data, there are lots of heavy-tailed events floating about. For example, my transactions aren’t a uniform process: I mostly make small purchases but occasionally I’ll make a very large purchase.
Non-typical: find those events which fall far outside the IQR. Note: by definition, the IQR only captures 50% of the data, so we don’t want to set a trigger for outside-IQR! But we may want 1.5 * IQR, or maybe everything outside of the 90th – 10th percentile.
Apply to financial data.
Also: feel data with mobile or watch notifications
Here’s the predict command in action, applied to Lending Club Denied Loans data. This implements a Kalman filter, which captures the trends and fluctuations of the data, and forecasts them 2 years into the future. Notice something funny with this algorithm: the forecast starts to get periodic. The algorithm can only generalize from what it knows, so you should think of the thick line as a “best guess” given the past data. We actually expect the real trajectory to bounce around this “uncertainty envelope”.
Crazy dip in # of denied loans in November 2013
Sourcetype=lending_club_denied_loan | timechart span=7d count | predict count future_timespan=104
Shout outs to other talks
Splunk for Data Scientists: Tom and Olivier
Advanced Use Cases for Analytics: Archana and James
Do you want cool analytics insights?
How are customers using our product?
How do failed or degraded transactions impact customers?
How can I gain Operational Visibility into concurrent transactions?