Using Docker EE to Scale Operational Intelligence at Splunk

Using Docker to Scale
Operational Intelligence at Splunk
Mike Dickey
Sr. Director, Splunk
@mike_dickey
Harish Jayakumar
Sr.Solution Engineer, Docker
@harish_jkumar

Forward-Looking Statements
During the course of this presentation, we may make forward-looking statements regarding future events or the expected
performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors
currently known to us and that actual events or results could differ material. For important factors that may cause actual results to
differ from those captained in our forward-looking statements, please review our fillings with the SEC.
The forward-looking statements made in this presentation are being made as of the time
and data of its live presentation. If reviewed after its live presentation, this presentation
may not contain current or accurate information. We do not assume any obligation to update
any forward-looking statements we may make. In addition, any information about our roadmap
outlines our general product direction and is subject to change at any time without notice.
It is for informational purposes only and shall not be incorporated into any contract or other
commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks
of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners.
© 2017 Splunk Inc. All rights reserved.

Agenda
● Splunk’s challenges
● Docker Enterprise Edition
● Splunk + Docker EE
● Demo
● Before & After Docker Metrics
● Lessons Learned

Infrastructure
● Large Scale Scrum
“LeSS Huge” Model
● Engineering
Infrastructure Area
○ 13 teams (about 100 engineers)
in California & Shanghai
○ Test quality, automation, common
tools, frameworks, build systems
○ Working on an overhaul of all test
infrastructure, including CI/CD

Multiple CI EnvironmentsTest Server Sprawl
100+ bare-metal servers for functional
and performance testing
Too many frameworks, manual work
Many days to setup a single test
Splunk Challenges
Bamboo vs Jenkins
Plans & agents managed by hand
Physical server agents, very
poor scalability
LONG wait times, build bottleneck, not
enough capacity

Wiki-Managed Infrastructure
soln-perf66.sv.splunk.com
(root)
soln-
perf66.ilo.sv.splunk.com
(AD Credential)
2x 8-core Xeon 2.40Ghz, 32 Gb RAM, 2x
1Tb SATA 6G HDD (RAID 1)
Allan
(Large ES env)
forever
(root)
soln-
(AD Credential)
Mike –
UCP dev cluster
forever
(root)
soln-
(AD Credential)
Dhananjay 4/26/2017
(root)
soln-
(AD Credential)
Stream –
Manan/Vladimir forever

Why Docker
● Fun
● De-facto
● Bare-metal performance
● Consistency
● Efficient
● Successful POCs

Selling Docker Internally
● Why?
○ Reduce waste of underutilized servers
○ Automate manual testing work
○ Enables more testing, higher quality
● Start small, deliver value quickly with each
iteration
● CPO, CFO & CEO as Scale & Scope grew

Why Docker EE
● Windows Server 2016 & Linux
● Role Based Access Control
● Compose & common Docker API
● End user experience : Docker on
Desktop and Docker on Infrastructure
● End to end support from vendor -
■ Around the corner (literally)

Docker EE components
Public Cloud Virtual Physical
docker enterprise edition ADVANCED
INTEGRATED SECURITY
docker trusted registry
image management
docker universal control plane
app & cluster management
docker engine
container runtime, orchestration, networking, volumes, plugins
CI/CD Images Operating Systems Volumes Monitoring Logging more...

Build a Secure Software Supply Chain (CaaS)
Image RegistrySecurity scan
& sign
Traditional
Third Party
Microservices
docker store
DEVELOPERS IT
OPERATIONS
Control Plane

services:
database:
image: sixeyed/atsea-db:mssql
ports:
- mode: host
target: 1433
networks:
- atsea
deploy:
endpoint_mode: dnsrr
placement:
constraints:
- 'node.platform.os == windows'
appserver:
image: sixeyed/atsea-app:mssql
ports:
- target: 8080
published: 8080
networks:
- atsea
deploy:
placement:
constraints:
- 'node.platform.os == linux'
networks:
atsea:
node.platform.os is a built in label
that can be used for workload
placement
Windows and Linux nodes can share
a common overlay network
Hybrid Applications

1:1:1 mapping of
subject to role to
collection.
Grant Subject Role Collection
Who (orgs, teams,
users) can perform an
action.
What they
can do
Where work can
be done
Enhanced RBAC

SUPPORT SECURE MULTI-TENANCY ACROSS
MULTIPLE TEAMS THROUGH NODE-BASED
ISOLATION AND SEGREGATION
KEY FEATURES
BENEFITS
• Enforce node affinity and node anti-affinity rules to
allow certain users/teams/orgs to deploy within a
subset of nodes or outside certain nodes (eg.
Production nodes)
• Set up different security zones within the same cluster
to isolate and segregate access of protected
information
• Support multiple teams within the same cluster while
providing physical separation and isolation
• Prevent “noisy neighbors” by limiting a team’s resources
to approved nodes
• Meet compliance and regulatory requirements by isolating
sensitive workloads to certain nodes and limiting access to
those nodes
Prod
Dev
Dev Team A Dev Team B SecOpsOps Team
PHI
RBAC for Nodes
Node
Worker
Node
Worker
Node
Worker
Node
Worker

Docker EE :Swarm + Kub ( Coming Soon)

Demo
● Windows Server 2016 & Linux
● Role Based Access Control
● Compose & common Docker API
● End user experience : Docker on
Desktop and Docker on Infrastructure
● End to end support from vendor -
■ Around the corner (literally)

New CI/CD Platform
Volume
3. Stage Build
Runner
4. Stage Build
Runner
App Cert
6. Stage Publish5. Stage AppCert
Runner Runner
Splunkins as
report service
Notify build
status
Notify
build
status
2. Notify Jenkins
to build
UCP
1. Git Push
Artifactory

Results
● Test setup time reduced from days to
minutes
○ Far more efficient use of the
hardware we have
○ Enables us to run more tests,
more frequently, earlier in
release cycles
● Eliminated CI/CD Bottlenecks
● Example: DMA performance
improvements in Splunk 7.0

Growth & Adoption
Unique Users
Logging bug
4x Growth
150 Servers
265 Servers
300 Servers
400 Servers
Server Utilization 600 Servers

Number of
Servers
600
Containers
1 Mil.
>10,000/day
CI Jobs
executed
30,000
Average
Utilization
75%
Unique users
385
Splunk’s Docker Growth & Adoption
Year to Date
As of October 3, 2017

Challenges &
Journey
Internal:
○ Basic container knowledge
■ Trained Developers and
Ops
○ Let’s blame the new guy
■ Real source: Bad drives,
memory chips, networking

Challenges & Journey
Bugs found & fixed:
○ Placement strategies
(anti-affinity, binpacking)
○ LDAP (AD) & RBAC
○ Performance & Timeouts at
Scale of 300-600 servers
○ Worked with support
to prioritize

Lessons Learned
Test things before you
update production
Perform root cause
analysis of problems
Start small, over-
communicate & set
realistic expectations
Monitor & respond
quickly to problems
Deliver what your
customers ask for
Invite contributions &
participation

Thanks
Questions?
@mike_dickey
@harish_jkumar

Challenges & Journey
Outstanding
Performance & Accuracy
of Metrics
Errors related to state
inconsistency
Kernel & docker engine
upgrades

Using Docker EE to Scale Operational Intelligence at Splunk

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Using Docker EE to Scale Operational Intelligence at Splunk

Semelhante a Using Docker EE to Scale Operational Intelligence at Splunk (20)

Mais de Docker, Inc.

Mais de Docker, Inc. (20)

Último

Último (20)

Using Docker EE to Scale Operational Intelligence at Splunk