Mais conteúdo relacionado Semelhante a DevOps in a Regulated and Embedded Environment (AgileDC) (20) DevOps in a Regulated and Embedded Environment (AgileDC)1. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 1
Agility. Security. Delivered.
DevOps in a Regulated and
Embedded Environment
By: Arjun Comar
(Was DevOps on a Legacy Project)
twitter: @arjuncomar email: arjun.comar@coveros.com
2. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 2
Agenda
• About Me
• Agile, DevOps, and Medical Devices: What’s the Problem?
• Git Flow in a Regulated World
• Expect to Deploy
• Scaling for Success and Resource Management
• Questions
twitter: @arjuncomar email: arjun.comar@coveros.com
3. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS RESERVED. 3
About Me
• B.S. in Computer Science from the Rose-
Hulman Institute of Technology
• Worked on everything from the Linux
kernel to computer vision.
• Interested in software quality and
correctness.
• Been with Coveros for ~2.5 years.
• Run the local HaskellDC meetup group.
twitter: @arjuncomar email: arjun.comar@coveros.com
4. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS RESERVED. 4
About Coveros
• Coveros builds security-critical applications using
agile methods.
• Coveros Services
• Agile transformations
• Agile development and testing
• DevOps and continuous integration
• Application security analysis
• Agile & Security training
• Government qualifications
• DCAA approved rates and accounting
• TS facility clearance
Areas of Expertise
twitter: @arjuncomar email: arjun.comar@coveros.com
5. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS RESERVED. 5
Select Clients
twitter: @arjuncomar email: arjun.comar@coveros.com
6. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 6
Medical Devices and the Law
• It isn’t sufficient to write the code, release requires regulatory
approval.
• Approval is per feature (epic)
• Contingent on development, testing, risk mitigation, etc.
• We want short-lived branches, but…
• If we don’t get approval for one feature, business still wants to release
the others
• Unmerge all the feature branches that went into an epic?
• Further requirements around documentation, especially:
• Design
• Testing
• Risk Management
twitter: @arjuncomar email: arjun.comar@coveros.com
7. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 7
Legacy Problems
• C code, embedded device target
• cross compilation: Windows -> QNX
• Some modules only built on WinXP
• Manual build, deploy, test process
• Custom hardware, custom firmware
• Old codebase, not written to be unit tested
• Unit test execution requires target environment
• Rough order of magnitude, 200 kloc codebase
• Hardware platform ~25 years old
twitter: @arjuncomar email: arjun.comar@coveros.com
8. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 8
Integration and Deployment
• Manual builds, deploy to unit test?
• Unmaintained deployment scripts
• Written by a contractor in ksh,
• Last maintainer had already left the company
• Working deployments flashed unit with usb stick and physical dongle
• Rewrite with Chef? ...Ansible? … Bash?
• try: sh run over telnet
• No ruby, python, perl, bash, ssh, dhcp
• Network deployments/updates to a device that goes in a human
being…?
twitter: @arjuncomar email: arjun.comar@coveros.com
9. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 9
Feedback Cycles
• Deployments took ~30 minutes and required physical interaction
through the process
• Testing involved long protocols with detailed and very particular
steps
• ~5-6 weeks for the test team, maybe 8 weeks, but at least 3-4.
• Release cycle on the order of years.
twitter: @arjuncomar email: arjun.comar@coveros.com
10. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 10
Resource Needs and Team Size
• Business wanted multiple features in development in parallel
• Different tests take different lengths of time to run
• even when automated
• seconds -> weeks
• Business needed 4 teams like the one they had
• Continuous integration targets, unit test targets, deployment
testing targets, full functional test targets, partially automated test
targets
• Performance, reliability, security, durability, etc.?
twitter: @arjuncomar email: arjun.comar@coveros.com
11. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 11
Solutions
One thing at a time...
twitter: @arjuncomar email: arjun.comar@coveros.com
12. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 12
Git Flow
in a Regulated World
twitter: @arjuncomar email: arjun.comar@coveros.com
13. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 13
Git Workflows
• Linux Kernel: benevolent dictator, many trusted lieutenants, an
insane number of contributors.
• GitHub: Single (or small team) of maintainers, contributors submit
pull-requests
• Corporate git usage: Trusted team of developers, co-maintain
shared repository
twitter: @arjuncomar email: arjun.comar@coveros.com
14. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 14
Enter: Git Flow
twitter: @arjuncomar email: arjun.comar@coveros.com
15. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 15
But I can’t merge back daily...
• No, really. Daily merges back to develop means pulling an epic out
requires a virtually impossible unmerge.
• Might be legally required not to go forward with a feature
• Can’t get approval until feature is developed and tested with
known risks documented and mitigated
• Business still wants to release what they can
twitter: @arjuncomar email: arjun.comar@coveros.com
16. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 16
Can’t not integrate...
• Long lived lines of development, all separate
• Tested independently prior to release
• Business wants to release, integrate necessary branches and…
• Disaster: merge conflicts, retest everything, unknown interactions
everywhere
twitter: @arjuncomar email: arjun.comar@coveros.com
17. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 17
Extending the
workflow to deal with
regulation
Extend the git flow model
Keep epic specific code in ‘develop/epic-
name’ branches
Use ‘feature/epic-name/feature-name’
branches for daily work
Merge these back daily!
Epic branches get merged back for a
release
twitter: @arjuncomar email: arjun.comar@coveros.com
18. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 18
Integrating Continuously
• Use tooling to manage the problem for you
• Have Jenkins (or your CI stack of choice) do builds by merging
develop with the epic branches first
• develop holds code that will be released, features that conflict must be
fixed
• Run the normal deployment and testing cycle on these builds
• merge conflicts are failed builds
twitter: @arjuncomar email: arjun.comar@coveros.com
19. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 19
Integrating even more continuously
• Still need to know if there’s potential conflicts between epic
branches
• fail early, fail often, right?
• Take all the epic branches and merge them with develop
• Run a full build/deploy/test cycle on this mess as well.
• Any failures found -> failed build
• If it doesn’t cleanly merge, we can’t release, right?
• The software should always be ready to release; make it a business
decision, not a technical one.
twitter: @arjuncomar email: arjun.comar@coveros.com
20. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 20
Digging deeper to unearth conflict
• Better error detection and reporting:
• If we merge everything together, it looks like the later branches cause
conflicts more often
• Branches that conflict exclude each other
• Find conflicting pairs and report them both as failed
• Conflicts may only show up with the interaction of 3+ branches
• But this gets exponentially hard to detect
twitter: @arjuncomar email: arjun.comar@coveros.com
21. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 21
Do what you can
• Merge all possible epic branch pairs together, track+report failures
• Report these failures once or the team will ignore you...
• Branches that cleanly merge with everything get merged together
with development and built
• This assesses the health of the software as it exists at this moment
• This might be expensive, so do it overnight.
• Shortcuts:
• If ‘A’ merges with ‘B’, then ‘B’ merges with ‘A’
• ‘A’ always merges with ‘A’
• (You only need the top half of the n x n matrix)
twitter: @arjuncomar email: arjun.comar@coveros.com
22. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 22
This is a lot of work...
• Long-lived branches are hard to deal with.
• You could even go further and build the sets of conflicting
branches that can be merged together
• This is really hard; it’s easier to ask the team to fix the mess.
• If you don’t have to do it, don’t.
• You probably don’t unless regulatory constraints make you.
twitter: @arjuncomar email: arjun.comar@coveros.com
23. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 23
Expect to Deploy
What a lifesaver
twitter: @arjuncomar email: arjun.comar@coveros.com
24. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 24
Expect?
• Tcl scripting language used to automate interactive programs
• ...like telnet and ftp
• Was used to automate testing way back in the day
• Turns out to be rather perfect for scripting deployments, testing,
etc. in this tool restricted environment
• sh, ksh, telnet, ftp
• not: bash, python, ruby, ssh, perl, etc.
twitter: @arjuncomar email: arjun.comar@coveros.com
25. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 25
Wait, why not use ...
• Yes, we could have tried to beat that wall down
• Lots of effort/expertise to produce a working build of python for
the target environment
• QNX support would probably have been willing to help
• But loading new software onto the target environment to increase
its capabilities is fundamentally risky
• Business was understandably risk averse
• Rather limited DevOps team at this point of me, myself, and I.
twitter: @arjuncomar email: arjun.comar@coveros.com
26. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 26
A little expect script
$ cat login.expect
#!/usr/bin/expect
set timeout 20
set addr [lindex $argv 0]
set user [lindex $argv 1]
set pass [lindex $argv 2]
spawn telnet $addr
expect "login:"
send "$userr"
expect "Password:"
send "$passr"
expect "#"
interact
twitter: @arjuncomar email: arjun.comar@coveros.com
27. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 27
Adding a little abstraction
proc login { addr user pass } {
spawn telnet $addr
expect {
timeout { send_user "Could not connectn"; exit 1 }
eof { send_user "Connection refusedn"; exit 1 }
"login:"
}
send "$userr"
expect "Password:"
send "$passr"
expect {
timeout { send_user "Failed to login.n"; exit 1 }
"#"
}
}
twitter: @arjuncomar email: arjun.comar@coveros.com
28. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 28
Separation of Concerns
• It only takes minor modifications to use the same logic to connect
to ftp
• Use ftp to upload deployment archive, install sh script
• Use telnet to set permissions and execute install script on archive
• Deployment logic is now separate from connecting, setup, etc.
• “talking to the target” vs “doing stuff on the target”
• This is exactly the separation chef/puppet/ansible provide
• (They also provide a whole lot of other value as well, but it’s nice to
recover any of it!)
twitter: @arjuncomar email: arjun.comar@coveros.com
29. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 29
Towards a deployment framework
• How many environments like this are out there?
• limited tooling, embedded platform, etc.
• If there are a lot… we have the start of a deployment framework to
target these environments
• Dependencies are very minimal, can be used to target virtually
anything
• With work, we could get something idempotent with clean
modularity and composability.
• A whole lot of work… Is there a market that needs this?
twitter: @arjuncomar email: arjun.comar@coveros.com
30. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 30
Scaling for Success
and Resource Management
twitter: @arjuncomar email: arjun.comar@coveros.com
31. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 31
Resource Needs
• Embedded device with potential hardware attachments for
particular tests -- virtualization is out.
• Unit tests need to run in the target environment so one target is
needed at a minimum just for rapid feedback CI.
• Basic integration testing (i.e. devint env) takes ~1 min to ~ 10 mins
• Fully automated functional testing takes ~10 mins to 1+ hours to
run (i.e. test env)
• Partially automated tests require interaction, need another target.
• Longer term testing (i.e. stress, durability, performance, etc.) takes
weeks and needs its own target.
• ~5 targets minimum to support development for basic CI/CD
twitter: @arjuncomar email: arjun.comar@coveros.com
32. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 32
Tackling Resource Allocation
• If a new build kicks off and reaches deployment testing while the
previous round of smoke testing is still on-going, what happens?
• Probably: target gets bricked as OS level code is updated while the
machine is in use.
• Even if the pipeline is built carefully so these things can’t happen,
there’s always PEBKAC
• Deployment and testing tools need to be smart enough to check if
a console is available before attempting to use it
• We need a resource allocator...
twitter: @arjuncomar email: arjun.comar@coveros.com
33. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 33
Making a first pass
• Track the target state on the target
• Use an old Unix trick -- drop a lock file in a well-known spot, and
make tools attempt to acquire the lock before using the target
• Pros: Extremely simple to implement and use; it’s a really simple
pair of shell scripts.
• Cons: If the lockfile isn’t cleaned up, the target is unavailable; if the
tool (user) doesn’t check for the lock, they could still cause
problems. It’s hard to track what targets are in use where, there’s
no centralized management.
twitter: @arjuncomar email: arjun.comar@coveros.com
34. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 34
Aside: Jenkins Pipeline
• Specifying the pipeline in groovy
instead of shell/jenkins xml
prevented a lot of bugs.
• acquireLock and releaseLock have
simple contracts and provide
strong guarantees with try/finally
idiom.
• This is tricky/hard to achieve with
traditional jenkins.
def locking(target, action) {
try {
acquireLock(target)
action()
} finally {
releaseLock(target)
}
}
downloadTests(latest)
locking(targetAddr) {
deploy(targetAddr)
runTests(targetAddr, myBuild, testTags)
}
twitter: @arjuncomar email: arjun.comar@coveros.com
35. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 35
Multiple teams, multiple workstreams
• Goal is to reduce cycle time. If one team has to wait for feedback
for another team’s build to finish, we’re wasting time.
• Key takeaway: we can’t effectively share environments between
parallel streams of development.
• Business wanted ~4 streams of work progressing in parallel.
• Team needs to be able to support old releases via hotfixes (~2 old,
previous release, current stream of development).
• Hardware/firmware platform changes between releases
• Test automation team needs to an environment to test their tests.
• DevOps team needs to be able to test pipeline changes.
• ~40 target machines to effectively support CI/CD pipeline.
twitter: @arjuncomar email: arjun.comar@coveros.com
36. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 36
That’s a lot of equipment...
• Where do you put it all?
• Shelving/rackspace, cooling, switches, networking…
• Units are expensive; if they aren’t in use/needed, business is going
to get annoyed.
• Hard to track utilization, load, etc. from a really decentralized
place.
• We might also be able to save money / use fewer targets if we’re
more intelligent about allocating them; i.e. allocate on demand.
• Centralization also means we can start hitting nice-to-haves:
• console access from the web browser for debugging
• status/health check daemon reporting to the manager
twitter: @arjuncomar email: arjun.comar@coveros.com
37. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 37
Centralized Resource Management
• Pool available targets, expose REST API to acquire a target for use,
release a target, check a target, etc.
• Track target status, usage metrics, target requester statistics in
backend database.
• Set up a simple frontend to display statistics about usage, provide
a manual form to acquire a target for manual/ad-hoc testing, etc.
• Like a library; acquire target for duration, get grumpy emails if it’s
not returned in time.
• Can be easily expanded to provide additional services over time.
twitter: @arjuncomar email: arjun.comar@coveros.com
38. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 38
Lightning Quick Recap
• Integrate continuously to keep software testable, increase quality,
and build confidence.
• Prioritize the delivery of working software.
• Fail early, fail often.
• Make your tools serve your needs.
• Set yourself up to success -- plan ahead to cover scaling needs.
twitter: @arjuncomar email: arjun.comar@coveros.com
39. © COPYRIGHT 2016 COVEROS, INC. ALL RIGHTS 39
That was fast...
• There’s a lot more I’d love to talk about.
• Please feel free to ask me questions during the break or
afterwards.
• Thanks for your time!
twitter: @arjuncomar email: arjun.comar@coveros.com