With more than 14,000 customers in 110+ countries, Splunk is the market leader in analyzing machine data to deliver operational intelligence for security, IT and the business. Our rapid growth as a company meant that our Infrastructure Engineering Team, responsible for all the common tooling, build and test systems and frameworks utilized by the Splunk engineers, was bogged down with a sprawl of virtual machines and physical servers that were becoming incredibly difficult to manage. And as our customer’s demand for data has grown, testing at the scale of petabytes/day has become our new normal. We needed a reliable and scalable “Test Lab” for functional and performance testing.
With Docker Enterprise Edition, our engineers are able to create small test stacks on their laptop just as easily as creating multi-petabyte stacks in our Test Lab. Support for Windows, Role Based Access Control and having support for both the orchestration platform and the container engine were key in deciding to go with Docker over other solutions.
In this talk, we will cover the architecture, tooling, and frameworks we built to manage our workloads, which have grown to run on over 600 bare-metal servers, with tens of thousands of containers being created every day. We will share the lessons learned from running at scale. Lastly, we will demonstrate how we use Splunk to monitor and manage Docker Enterprise Edition.
3. Agenda
● Splunk’s challenges
● Docker Enterprise Edition
● Splunk + Docker EE
● Demo
● Before & After Docker Metrics
● Lessons Learned
4. Infrastructure
● Large Scale Scrum
“LeSS Huge” Model
● Engineering
Infrastructure Area
○ 13 teams (about 100 engineers)
in California & Shanghai
○ Test quality, automation, common
tools, frameworks, build systems
○ Working on an overhaul of all test
infrastructure, including CI/CD
5. Multiple CI EnvironmentsTest Server Sprawl
100+ bare-metal servers for functional
and performance testing
Too many frameworks, manual work
Many days to setup a single test
Splunk Challenges
Bamboo vs Jenkins
Plans & agents managed by hand
Physical server agents, very
poor scalability
LONG wait times, build bottleneck, not
enough capacity
6. Wiki-Managed Infrastructure
soln-perf66.sv.splunk.com
(root)
soln-
perf66.ilo.sv.splunk.com
(AD Credential)
2x 8-core Xeon 2.40Ghz, 32 Gb RAM, 2x
1Tb SATA 6G HDD (RAID 1)
Allan
(Large ES env)
forever
soln-perf67.sv.splunk.com
(root)
soln-
perf67.ilo.sv.splunk.com
(AD Credential)
2x 8-core Xeon 2.40Ghz, 32 Gb RAM, 2x
1Tb SATA 6G HDD (RAID 1)
Mike –
UCP dev cluster
forever
soln-perf68.sv.splunk.com
(root)
soln-
perf68.ilo.sv.splunk.com
(AD Credential)
2x 8-core Xeon 2.40Ghz, 32 Gb RAM, 2x
1Tb SATA 6G HDD (RAID 1)
Dhananjay 4/26/2017
soln-perf69.sv.splunk.com
(root)
soln-
perf69.ilo.sv.splunk.com
(AD Credential)
2x 8-core Xeon 2.40Ghz, 32 Gb RAM, 2x
1Tb SATA 6G HDD (RAID 1)
Stream –
Manan/Vladimir forever
8. Selling Docker Internally
● Why?
○ Reduce waste of underutilized servers
○ Automate manual testing work
○ Enables more testing, higher quality
● Start small, deliver value quickly with each
iteration
● CPO, CFO & CEO as Scale & Scope grew
9. Why Docker EE
● Windows Server 2016 & Linux
● Role Based Access Control
● Compose & common Docker API
● End user experience : Docker on
Desktop and Docker on Infrastructure
● End to end support from vendor -
■ Around the corner (literally)
13. Build a Secure Software Supply Chain (CaaS)
Image RegistrySecurity scan
& sign
Traditional
Third Party
Microservices
docker store
DEVELOPERS IT
OPERATIONS
Control Plane
14. services:
database:
image: sixeyed/atsea-db:mssql
ports:
- mode: host
target: 1433
networks:
- atsea
deploy:
endpoint_mode: dnsrr
placement:
constraints:
- 'node.platform.os == windows'
appserver:
image: sixeyed/atsea-app:mssql
ports:
- target: 8080
published: 8080
networks:
- atsea
deploy:
placement:
constraints:
- 'node.platform.os == linux'
networks:
atsea:
node.platform.os is a built in label
that can be used for workload
placement
Windows and Linux nodes can share
a common overlay network
Hybrid Applications
15. 1:1:1 mapping of
subject to role to
collection.
Grant Subject Role Collection
Who (orgs, teams,
users) can perform an
action.
What they
can do
Where work can
be done
Enhanced RBAC
16. SUPPORT SECURE MULTI-TENANCY ACROSS
MULTIPLE TEAMS THROUGH NODE-BASED
ISOLATION AND SEGREGATION
KEY FEATURES
BENEFITS
• Enforce node affinity and node anti-affinity rules to
allow certain users/teams/orgs to deploy within a
subset of nodes or outside certain nodes (eg.
Production nodes)
• Set up different security zones within the same cluster
to isolate and segregate access of protected
information
• Support multiple teams within the same cluster while
providing physical separation and isolation
• Prevent “noisy neighbors” by limiting a team’s resources
to approved nodes
• Meet compliance and regulatory requirements by isolating
sensitive workloads to certain nodes and limiting access to
those nodes
Prod
Dev
Dev Team A Dev Team B SecOpsOps Team
PHI
RBAC for Nodes
Node
Worker
Node
Worker
Node
Worker
Node
Worker
18. Demo
● Windows Server 2016 & Linux
● Role Based Access Control
● Compose & common Docker API
● End user experience : Docker on
Desktop and Docker on Infrastructure
● End to end support from vendor -
■ Around the corner (literally)
24. New CI/CD Platform
Volume
3. Stage Build
Runner
4. Stage Build
Runner
App Cert
6. Stage Publish5. Stage AppCert
Runner Runner
Splunkins as
report service
Notify build
status
Notify
build
status
2. Notify Jenkins
to build
UCP
1. Git Push
Artifactory
25. Results
● Test setup time reduced from days to
minutes
○ Far more efficient use of the
hardware we have
○ Enables us to run more tests,
more frequently, earlier in
release cycles
● Eliminated CI/CD Bottlenecks
● Example: DMA performance
improvements in Splunk 7.0
28. Challenges &
Journey
Internal:
○ Basic container knowledge
■ Trained Developers and
Ops
○ Let’s blame the new guy
■ Real source: Bad drives,
memory chips, networking
29. Challenges & Journey
Bugs found & fixed:
○ Placement strategies
(anti-affinity, binpacking)
○ LDAP (AD) & RBAC
○ Performance & Timeouts at
Scale of 300-600 servers
○ Worked with support
to prioritize
30. Lessons Learned
Test things before you
update production
Perform root cause
analysis of problems
Start small, over-
communicate & set
realistic expectations
Monitor & respond
quickly to problems
Deliver what your
customers ask for
Invite contributions &
participation