This talk is a step-by-step, live-coding class on how to write automated tests for infrastructure code, including the code you write for use with tools such as Terraform, Kubernetes, Docker, and Packer. Topics covered include unit tests, integration tests, end-to-end tests, test parallelism, retries, error handling, static analysis, and more.
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more
1. Automated testing for:
✓ terraform
✓ docker
✓ packer
✓ kubernetes
✓ and more
Passed: 5. Failed: 0. Skipped: 0.
Test run successful.
How to
test
infrastructure
code
22. We know how to write automated
tests for application code…
23. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
But how do you test your Terraform code
deploys infrastructure that works?
24. apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world-app-deployment
spec:
selector:
matchLabels:
app: hello-world-app
replicas: 1
spec:
containers:
- name: hello-world-app
image: gruntwork-io/hello-world-app:v1
ports:
- containerPort: 8080
How do you test your Kubernetes code
configures your services correctly?
25. This talk is about how to write
tests for your infrastructure code.
47. Instead, break your infra code into
small modules and unit test those!
module
module
module
module
module
module
module
module
module
module
module
module
module module
module
48. With app code, you can test units
in isolation from the outside world
49. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
But 99% of infrastructure code is about
talking to the outside world…
50. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
If you try to isolate a unit from the
outside world, you’re left with nothing!
51. So you can only test infra code by
deploying to a real environment
53. Therefore, the test strategy is:
1. Deploy real infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy the infrastructure
(So it’s really integration testing of a single unit!)
54. Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
Tools that help with this strategy:
55. Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
In this talk, we’ll use Terratest:
56. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
57. Sample code for this talk is at:
github.com/gruntwork-io/infrastructure-as-code-testing-talk
58. An example of a Terraform
module you may want to test:
60. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
Under the hood, this example runs on
top of AWS Lambda & API Gateway
61. $ terraform apply
Outputs:
url = ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
$ curl ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
Hello, World!
When you run terraform apply, it
deploys and outputs the URL
62. Let’s write a unit test for
hello-world-app with Terratest
66. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
2. Run terraform init and terraform
apply to deploy your module
67. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
3. Validate the infrastructure works.
We’ll come back to this shortly.
68. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
4. Run terraform destroy at the end of
the test to undeploy everything
69. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
The validate function
70. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
1. Run terraform output to get the web
service URL
71. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
2. Make HTTP requests to the URL
72. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
3. Check the response for an expected
status and body
73. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
4. Retry the request up to 10 times, as
deployment is asynchronous
74. Note: since we’re testing a
web service, we use HTTP
requests to validate it.
75. Infrastructure Example Validate with… Example
Web service Dockerized web app HTTP requests Terratest http_helper package
Server EC2 instance SSH commands Terratest ssh package
Cloud service SQS Cloud APIs Terratest aws or gcp packages
Database MySQL SQL queries MySQL driver for Go
Examples of other ways to validate:
77. $ go test -v -timeout 15m -run TestHelloWorldAppUnit
…
--- PASS: TestHelloWorldAppUnit (31.57s)
Then run go test. You now have a unit
test you can run after every commit!
78. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
94. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
The validate method
95. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
1. Wait until the service is deployed
96. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
2. Make HTTP requests
97. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
3. Use serviceUrl method to get URL
101. $ go test -v -timeout 15m -run TestDockerKubernetes
…
--- PASS: TestDockerKubernetes (5.69s)
Run go test. You can validate your
config after every commit in seconds!
102. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
104. Pro tip #1: run tests in completely
separate “sandbox” accounts
105. Tool Clouds Features
cloud-nuke AWS (GCP planned)
Delete all resources older than a certain
date; in a certain region; of a certain type.
Janitor Monkey AWS
Configurable rules of what to delete.
Notify owners of pending deletions.
aws-nuke AWS
Specify specific AWS accounts and
resource types to target.
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Pro tip #2: run these tools in cron jobs
to clean up left-over resources
122. func TestProxyApp(t *testing.T) {
webServiceOpts := configWebService(t)
defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts)
proxyAppOpts := configProxyApp(t, webServiceOpts)
defer terraform.Destroy(t, proxyAppOpts)
terraform.InitAndApply(t, proxyAppOpts)
validate(t, proxyAppOpts)
}
6. At the end of the test, undeploy the
proxy app and the web service
133. func TestProxyApp(t *testing.T) {
t.Parallel()
// The rest of the test code
}
func TestHelloWorldAppUnit(t *testing.T) {
t.Parallel()
// The rest of the test code
}
Enable test parallelism in Go by adding
t.Parallel() as the 1st line of each test.
134. $ go test -v -timeout 15m
=== RUN TestHelloWorldApp
=== RUN TestDockerKubernetes
=== RUN TestProxyApp
Now, if you run go test, all the tests
with t.Parallel() will run in parallel
136. resource "aws_iam_role" "role_example" {
name = "example-iam-role"
}
resource "aws_security_group" "sg_example" {
name = "security-group-example"
}
Example: module with hard-coded IAM
Role and Security Group names
137. resource "aws_iam_role" "role_example" {
name = "example-iam-role"
}
resource "aws_security_group" "sg_example" {
name = "security-group-example"
}
If two tests tried to deploy this module
in parallel, the names would conflict!
139. resource "aws_iam_role" "role_example" {
name = var.name
}
resource "aws_security_group" "sg_example" {
name = var.name
}
Example: use variables in all resource
names…
140. uniqueId := random.UniqueId()
return &terraform.Options{
TerraformDir: "../examples/proxy-app",
Vars: map[string]interface{}{
"name": fmt.Sprintf("text-proxy-app-%s", uniqueId)
},
}
At test time, set the variables to a
randomized value to avoid conflicts
144. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
When iterating locally, you sometimes
want to re-run just one of these steps.
145. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
But as the code is written now, you
have to run all steps on each test run.
146. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
And that can add up to a lot of
overhead.
(~3 min)
(~2 min)
(~30 seconds)
(~1 min)
(~2 min)
148. webServiceOpts := configWebService(t)
defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts)
proxyAppOpts := configProxyApp(t, webServiceOpts)
defer terraform.Destroy(t, proxyAppOpts)
terraform.InitAndApply(t, proxyAppOpts)
validate(t, proxyAppOpts)
The original test structure
149. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
The test structure with test stages
150. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
1. RunTestStage is a helper function
from Terratest.
151. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
2. Wrap each stage of your test with a
call to RunTestStage
152. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
3. Define each stage in a function
(you’ll see this code shortly).
153. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
4. Give each stage a unique name
154. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
Any stage foo can be skipped by
setting the env var SKIP_foo=true
156. $ go test -v -timeout 15m -run TestProxyApp
Running stage 'deploy_web_service'…
Running stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (105.73s)
That way, after the test finishes, the
infrastructure will still be running.
158. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (14.22s)
This allows you to iterate on solely the
validate stage…
159. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (14.22s)
Which dramatically speeds up your
iteration / feedback cycle!
160. $ SKIP_validate=true
$ unset SKIP_cleanup_web_service
$ unset SKIP_cleanup_proxy_app
When you’re done iterating, skip
validate and re-enable cleanup
161. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app’…
Skipping stage 'validate’…
Running stage 'cleanup_proxy_app’…
Running stage 'cleanup_web_service'…
--- PASS: TestProxyApp (59.61s)
This cleans up everything that was left
running.
162. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
Note: each time you run test stages via
go test, it’s a separate OS process.
163. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
So to pass data between stages, one
stage needs to write the data to disk…
164. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
And the other stages need to read that
data from disk.
172. You could use the same strategy…
1. Deploy all the infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy all the infrastructure
173. But it’s rare to write end-to-
end tests this way. Here’s why:
180. Assume a single resource (e.g.,
EC2 instance) has a 1/1000
(0.1%) chance of failure.
181. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
The more resources your tests deploy,
the flakier they will be.
182. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
You can work around the failure rate
for unit & integration tests with retries
183. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
You can work around the failure rate
for unit & integration tests with retries
184. Key takeaway: E2E tests from
scratch are too slow and too
brittle to be useful
191. Technique Strengths Weaknesses
Static analysis
1. Fast
2. Stable
3. No need to deploy real resources
4. Easy to use
1. Very limited in errors you can catch
2. You don’t get much confidence in your
code solely from static analysis
Unit tests
1. Fast enough (1 – 10 min)
2. Mostly stable (with retry logic)
3. High level of confidence in individual units
1. Need to deploy real resources
2. Requires writing non-trivial code
Integration tests
1. Mostly stable (with retry logic)
2. High level of confidence in multiple units
working together
1. Need to deploy real resources
2. Requires writing non-trivial code
3. Slow (10 – 30 min)
End-to-end tests
1. Build confidence in your entire
architecture
1. Need to deploy real resources
2. Requires writing non-trivial code
3. Very slow (60 min – 240+ min)*
4. Can be brittle (even with retry logic)*