Alexey Shpakov presents on testing in Jira Frontend. He discusses the testing pyramid with unit, integration, and end-to-end tests. He then introduces the concept of a "testing hourglass" which adds deployment and post-deployment verification to the pyramid. Key aspects of each type of test are discussed such as using feature flags, monitoring for flaky tests, and gradual rollouts to reduce risk.
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlassian
1. ALEXEY SHPAKOV | SENIOR DEVELOPER
Testing hourglass
at Jira Frontend
Hi, my name is Alexey Shpakov. I work in Jira Frontend Platform team. We are responsible for builds, deployments, dev experience, and repository maintenance.
Today I am going to talk about testing in Jira Frontend.
2. Testing pyramid
e2e
unit
integration
complexity
maintenance
fragility
duration
Testing pyramid is a metaphor coined by Mike Cohn. It acknowledges the existence of different types of tests and provides a suggestion, what could be a healthy ratio between the
groups.
The pyramid starts with unit tests and suggests this should be the majority of the tests. These tests focus on testing small units, comprising the application. On the next level you
can find integration tests which test that separate units work together properly. At the top you can find end-to-end tests, which test application as a whole.
The area of the pyramid stands for the number of tests. That is, the metaphor suggests, you want to have a lot of unit tests, fewer integration tests and just a handful of end-to-end
tests.
These three categories of tests, however, differ on several parameters.
Unit tests are by far the simplest ones - they are focused on a particular functionality and mock any dependencies. This allows us to guarantee, we are testing the code at hand and
nothing else. This also means, we’ll need to modify the unit test only when corresponding functionality changes. As we mock all the dependencies, unit tests are quite stable and
fast to run.
As integration tests focus on interaction or integration within groups of units, they usually include at least two not mocked units and run different integration scenarios between
them. As the number of unit interactions grows, we get more tests, involving a given piece of code. Here we need to account for multiple units at once, which makes integration
tests more complicated. Subsequently, as any of the units change, we may need to adjust all the tests, the unit is involved into. With many real non-mocked dependencies,
integration tests are also more prone to breakages and take more time to run.
End-to-end tests work with the whole application and aim to replicate real user scenarios. This means, even before you write the first test you need to deploy the application. This
3. could sometimes be a huge effort on its own. If your service is stateful, you may need to prepopulate the database before running some of the tests. The majority of user scenarios
interact with a big number of application pieces. This means, whenever one of them is slightly modified, you may need to adjust multiple end-to-end tests. This makes end-to-end
tests high maintenance and fragile. On top of that, as you are testing a real application, the tests are susceptible to network latency and other real-world inconsistencies, which
makes them even more fragile and slow.
4. Is it good enough?
Photo by Ludovic Migneault on Unsplash
Testing pyramid is a great rule of thumb that can help structure your testing strategy.
Once you introduce different types of tests with decent application coverage, the quality will certainly go up.
But is it sufficient? Is it good enough in a long-term perspective?
5. Real world is more
complicated
Photo by Jean-Louis Paulin on Unsplash
As the web applications become more complicated and get more customers, even with a properly implemented testing strategy, it becomes increasingly hard to guarantee, that
all the possible combinations of inputs are properly tested. At that stage you will realise, that your customers unknowingly have become manual testers for your product. In fact,
some of them might even create tickets in your public ticketing system. This is bad customer service.
If you were strictly following the testing pyramid, you would start heavily invest in adding a lot more end-to-end tests in order to account for all sorts of unusual combinations of
inputs. The issue is that in production we are dealing with an exponential number of those combinations. Subsequently, you will need to have an exponential number of end-to-
end tests. These tests require a lot of effort to write and maintain and they are slow and fragile. As a result, developers will have to spend a lot of their time maintaining these tests
instead of producing new features. Our releases will get delayed because of the amount of time it takes to run all the tests.
6. Let's step back and recall, what testing is about. If you ask Google, you will get the following definition.
To test means “take measures to check the quality, performance, or reliability of (something), especially before putting it into widespread use or practice”.
The word “widespread” is important here. It is easy to assume, testing can only be performed before the code gets deployed to production. However, testing your code in production
is as important as testing it before the deployment.
7. e2e
unit
integration
deployment
PDV
monitoring
logging
Testing hourglass
At Atlassian we acknowledge that having a good testing pyramid is not enough to deliver a high-quality to the customers. We build on top of this concept and introduce an
hourglass of testing. The bottom part of the hourglass is the testing pyramid, its peak is a release deployment and the inverted triangle at the top is post-deployment verifications,
monitoring and logging in production.
Let’s have a closer look, how different parts of the hourglass are implemented in Jira Frontend.
8. Type checking
Photo by Leighann Blackwood on Unsplash
The very first level of the testing pyramid we have is type checking.
Compiled languages get it for granted. As we use javascript, we have to address type checking separately.
9. • Force new code to be covered with types
• Update flow regularly
• Generate library definitions from Typescript
Type checking
We use flow for type checking.
In an ideal case scenario, you want your whole codebase to be thoroughly typed including your dependencies.
Unfortunately, this may not always be possible. Multiple researches prove, that using typed flavours of javascript decreases the number of bugs in production. However, it may still
be a hard sell for management.
Instead of pushing for 100% type coverage, we introduced an eslint rule, that forces every new file added to the codebase to be typed. This allows us to enforce coverage of the
relevant code.
New releases of flow frequently bring breaking changes. When we bump it, we usually add ignore comments to existing violations and create tickets for teams, who own the
corresponding codebase. This forces all the new code to be compliant with the new version of flow and delegates fixing existing issues to the code owners.
For the dependencies we are using flow-typed - it’s a community-driven project to introduce flow type definitions for the libraries that don’t have them. Unfortunately, not all
dependencies have their types available via flow-typed. We implemented a separate tool to convert typescript type definitions, if they are available, into flow. It is not perfect,
because these type systems don’t map one to one well, but it is still helpful.
10. Unit tests
Photo by William Warby on Unsplash
On the second level of the pyramid we find unit tests.
11. • Mock timers and dates
• Avoid snapshot tests
• Fail on console methods
• Zero tolerance for flakes
Unit tests
We are using jest and enzyme for unit tests.
One of the biggest issues with any kind of tests is flakiness. There are several factors that could contribute to it in Javascript.
The most frequent one is javascript timers - it's crucial to mock all the timers. This makes tests more consistent and faster to execute. Another less frequent issue is hardcoded year
or month or the timezone. In this case the test passes until a certain date.
A separate type of tests is snapshot tests. This is a great feature that allows comparing big objects in one go. Unfortunately, it is frequently abused by developers. Whenever the
snapshot test fails they update the snapshots without even checking them. If possible, it is a good idea to test for specific features of the object instead and use snapshot testing as
the last resort only.
We have adjusted jest environment to throw errors every time console methods are used. Based on our experience, whenever console is invoked, this is either debug code that
should not be committed into master, a legitimate error that should be fixed or a suggestion to improve the code. In all three cases there is usually a way to fix the code, which will
improve its quality. In rare cases when console usage is expected, developers can simply mock it.
We have zero-tolerance for flaky unit tests: when the test is flaky, it is removed straight away and the owners are responsible to fix and reintroduce it. Right now this is done
manually, but we explore the means of automated flaky test detection.
12. Integration tests
Photo by Lenny Kuhne on Unsplash
There is a lot of different types of integration tests one can write. During integration tests the goal is to test, how well different parts of the system work together. If a test is no
longer a unit test, but not yet an end-to-end test, it is, probably, an integration test :) .
In Jira Frontend we have three types of integration tests: Cypress tests, visual regression tests and pact tests.
13. • Use storybooks
• Cypress allows to retry
• Notify owners about flaky tests
Cypress tests
Cypress library allows us to use browsers in the tests.
Storybook is another library, that allows us to render frontend components in isolation.
We render high-level component storybooks, mock network requests and interact with them via cypress. The fact we have no network calls and the storybooks are served locally
decreases the latency and makes the tests a lot more stable.
On top of that cypress provides a possibility to retry flaky tests out of the box.
Although it’s great to have retries available, they are masking flaky tests. Which over time will negatively impact the performance of cypress tests. To avoid that, we send slack
notifications to the owners, whose cypress tests have been retried.
14. • Use storybooks
• A separate flakiness test
• Stop css animations and mock date
• Diminishing returns
Visual regression tests
We use Applitools and storybooks for visual regression testing.
Similar to Cypress tests, we mount components in isolation and take their snapshots.
It is crucial to have VR testing flake-free. To achieve that we are running VR tests a couple of times. The first time we perform snapshot comparison against master baseline. This is
the normal VR testing result people are familiar with, where you have the same component before the changes and after the changes compared.
Afterward we run VR tests two more times, this time comparing the results against the branch baseline. As the input is the same, we expect to get no differences. If there are any
visual regressions in this case, it means some of the stories are flaky and the test fails.
To further decrease the amount of flaky VR tests, we are stopping css animations and mocking the date for all the stories we run visual regression testing for.
We introduced VR test recently. As developers opt-in more components into visual regression testing, you get diminishing returns and testing time increases. We decided to
introduce tests for high-level components first and gradually add lower-level stories, whenever it makes sense.
15. • Pact tests
• Verify and upload pacts as a part of deployment
Contract tests
The goal of contract tests is to ensure the backend doesn’t introduce the API changes, that would break consumers. And, at the same time, that consumers do not have
unreasonable expectations for the API providers.
We are using a library called pact for contract testing.
We run a separate service called pact broker. Every consumer of the API uploads its expectations to the pact broker. Whenever a new version of API is published, it is first validated
by pact broker against all the consumers. This allows us to confirm, it does not introduce breaking changes. Once the validation passes, swagger definition for the API gets
uploaded to pact broker. Similarly, whenever consumers upload their expectations, we are checking, that they match the existing API uploaded to the pact broker. This ensures the
contract stays intact.
Something we learnt the hard way is that we must both verify and upload consumer expectations as a part of the deployment.
Originally we were running pact tests as a part of build verification and uploaded consumer expectations as a part of the deployment. There could be up to 20-30 minutes between
the time we run pact tests and we deploy the pacts. If the API producers upload breaking changes during this period, it could produce a deadlock. As a result of this incident we are
running pact tests twice: as a part of the build and as a part of the deployment.
16. End-to-end tests
Photo by Matt Botsford on Unsplash
End-to-end tests are the upper part of the pyramid. This means you want a few of them.
17. • Flakiness
• High maintenance
• Unable to represent production
End-to-end tests
In Jira we used to have a lot of end-to-end tests, and struggled a lot with their flakiness and maintenance burden. Another issue we had with end-to-end tests is feature delivery.
We use feature flags to deliver features to production. Oftentimes feature flags for test environments do not match those of production. As a result, we decided to stop writing end-
to-end tests going forward and rely on post-deployment verification instead.
18. Release
Photo by Kira auf der Heide on Unsplash
The middle point of the testing hourglass is the application release.
No matter how good our tests are, things will break in production. The priorities here are to detect the issues fast, to decrease the impact and mitigate them as soon as possible.
Let’s see, how Jira Frontend release process helps with it.
19. Feature flags
Photo by Sebastiano Piazzi on Unsplash
Feature toggles or feature flags is a technique, that allows developers to modify application behaviour without modifying the code.
The idea is to introduce a new behaviour within an if-statement and pass the argument to the if statement in runtime.
For example we can use a third-party service that will return true or false for a given feature flag. Based on this value we will execute old code or new code.
20. • Feature flag for every feature
• Feature delivery tickets
• Monitor feature flag changes
• Feature flag cleanup
Feature flags
Every feature delivered to production is expected to be hidden behind the feature flag. This means the exact time a new version hits production is usually irrelevant. Developers use
a third-party service to toggle the feature flags. This allows developers to control their feature rollout.
In Jira we expect every feature flag to have a corresponding feature delivery Jira ticket. The ticket contains metadata about the feature, like feature owner, expected rollout
schedule, how the feature is monitored, what could go wrong, etc. This allows any other person unfamiliar with the feature to analyse, if the feature works well.
Which brings me to another point. In an application with hundreds of enabled feature flags it is important to track feature flag status change. In case of an incident, this will allow
to pin-point suspicious feature flag changes. And associated feature delivery ticket will give the context needed to tell, if the feature flag change could be the cause of an incident.
While feature flags are great at derisking feature delivery, in a long run they increase the amount of dead code and make the codebase harder to reason about. It is important to
cleanup feature flags, once they have been successfully rolled out. Once again, we leverage metadata to identify the feature flag owner and ping them about cleaning up the
feature flag, once it was enabled for 100% of production for a long period of time.
22. • Use the app internally
• Anyone is able to halt production rollout
Dogfooding and blockers
Some of the commonly mentioned changes that could not be covered by feature flags include build configuration changes and dependency upgrades. Sometimes developers may
forget to use a feature flag to deliver a new functionality. These changes have significant risks associated with them.
The first customers to receive a new version of Jira Cloud are Atlassian employees. We use Jira for our day-to-day activities and it is usually quite obvious, if critical functionality
doesn’t work well. Once someone notices a bug, they will create a ticket with priority “blocker” in a specific Jira project. This will halt any release promotions to production and will
allow us to avoid any customer impact.
Blockers are bad for continuous delivery - they lead to piling up of the changes. Once there is an active blocker, it becomes a priority to resolve it as fast as possible.
23. Gradual rollouts
Photo by Aliko Sunawang on Unsplash
In order to further mitigate deployment risks we introduced release soaking and canaries.
24. • 1 staging, 3 production environments
• Canaries
• 3 hours to deploy to all environments
• Frequent releases (30 per week)
Gradual rollouts
In the case of Jira Frontend we have one staging environment and three production environments. In every production environment we have canary instances. Canary instances
are Atlassian-owned tenants, which we use for active monitoring.
Whenever we deploy a release, we deploy it to the current environment and the next environment's canary instance.
At first release gets deployed to the staging environment for dogfooding and the next production environment’s canary instance. After soaking for one hour, we automatically
promote release to the first production environment and also to the next environment’s canary instance. In total it takes about 3 hours to complete the rollout of a particular
release to all production environments. We currently release 6 times a day during work days. Frequent releases allow us to keep the amount of changes delivered to production in
every release low. This results in a lower risk of breaking production.
25. A production version myth
Photo by Eric Prouzet on Unsplash
Often times in conversations QAs and SREs talk about a production configuration as something, we can use to write integration and end-to-end tests against.
This sounds reasonable. If we take production set of feature flags and apply it to the latest code in master, we will get the behaviour our customers get. This should allow us to run
all the test suits from test pyramid and assess the quality.
26. • 3+ versions in production
• Inconsistent feature flags
• Different versions of backend
A production version myth
Production, however, is more diverse.
As mentioned, we are releasing up to 6 times a day and every release takes 3 hours to go through different stages. This means at any given moment there are at least 3 different
versions of Jira run by the customers. Usually we observe more versions, because a lot of people don’t reload the page, once they open it. The behaviour of frontend also depends on
the state of feature flags for a given Jira instance. There isn’t a single “production” version of feature flags. In fact, we have hundreds of feature flags released independently by
different developers in parallel.
Jira Frontend is developed and deployed independently of the backend. As result, backend has its own release schedule and its own set of feature flags. So, even if we find two
instances of Jira having identical versions of Jira Frontend and its feature flags, they could still have different backend and behave differently.
27. Active monitoring (PDV)
Photo by Jared Brashier on Unsplash
PDV stands for post-deployment verification. The idea behind active monitoring is to simulate user behaviour and determine potential issues.
28. • Similar to e2e tests
• Run 24/7 in production
• Failure threshold
Active monitoring (PDV)
We have a separate monitoring tool that runs cypress tests on given production instances. These are the same old end-to-end tests we avoid before deployment. Let’s review the
differences.
We use production instances to run the tests. This means we are using the latest feature flag values available for the corresponding environment.
Why do we call it monitoring and not testing? These tests run non-stop, 24/7. This means whenever there’s a feature flag change, which breaks production, we get notified about it
straight away.
The monitoring system allows us to configure the failure rate threshold. For instance, we can issue an alert if the test has been failing constantly for the last ten minutes. This allows
us to significantly decrease the amount of false positives.
On the other hand, this kind of monitoring suffers from the same issues end-to-end tests do: they have high maintenance cost. Due to that we use them sparingly to monitor critical
parts of the application. For example, issue creation functionality in case of Jira.
29. Passive monitoring
Photo by Miłosz Klinowski on Unsplash
The second level of the upper part of the hourglass is monitoring.
30. • Reliable
• Alert priorities
• Runbooks
• War games
Passive monitoring
Monitoring is used to alert developers in case something goes wrong in production.
It is of extreme importance to have reliable monitoring. That is, no, or close to no false positives. This could be achieved by comparing against historical data and monitoring the rate
of change of the parameter instead of the absolute values. In general, it is better to fire an alert five minutes later, than fire a false one. Yet, sometimes this may not be enough. An
example could be a national holiday on your largest market :) . These are some of the things you get better at by iterating.
Once the alerts are configured, they should be prioritised. This will help people better understand the urgency behind the particular alerts.
For every alert there should be a runbook. It provides a detailed list of steps, which will help mitigate the alert. An alert is a stressful situation and having a runbook handy helps a
lot. An excellent idea is to put a link to the corresponding runbook into alert notification message.
War game is an exercise we run, when we come up with possible failure scenarios. The goal is to walk through every scenario and define, what we expect to see in terms of
monitoring and what will be the expected response. War games help identify missing monitoring and runbooks. After coming up with the scenarios it is advisable to pick some of
them for simulation. This will allow to verify, monitoring works as expected and runbooks contain correct steps.
31. Logging
Photo by Dorelys Smits on Unsplash
While monitoring allows us to react fast in case of incidents, it is usually quite limited in terms of the data we can pass as well as data retention time.
32. • Long data retention
• Structured information
• Do not log PII/UGC
• Data ownership
Logging
For these scenarios we use logging. It usually has a lot higher data retention capacity, which is helpful during trend analysis and incident investigation.
It is important to use structured logging. It doesn’t matter which structure to use, as long as you are using it consistently. This allows for easy search, whenever you need to debug a
production issue or asses the results of an experiment.
PII stands for personally identifiable information, UGC means user-generated content. Both are a hard “no” for logging due to potential legal consequences, like GDPR. The crux is
that often developers may not realise they might be logging those. For instance, one of the popular misconceptions we used to have among developers is that a Jira url is safe to
log. It is not, because it could contain a project key, which is UGC. Whenever any of that data gets logged, we have to purge the logs, often times together with valuable
information. In order to mitigate this risk we have implemented a separate proxy service, that preprocesses the logs and redacts known bad patterns. An example could be an
email - all the logs containing an “@“ sign are redacted in Jira.
Data ownership is something we learnt the hard way. Whenever you have a lot of products and teams sharing logging pipeline, at some point you will go over the contract limit. At
that moment, it is critical to know, who produces too many logs. So you can reach out to people and ask them to stop doing it. It also allows you to monitor the trends and reach out
to log owners preemptively.
33. You build it, you run it
Photo by Ethan Hu on Unsplash
At Atlassian we follow the “you build it you run it” approach. This means every team works on supporting the feature all the way through its lifetime, from the design and
implementation to release to production and making sure it works as intended.
34. • SLOs
• 24/7 on-call
• TechOps meetings
You build it, you run it
This includes defining Service Level Objectives, running a 24/7 on-call roster schedule and a regular TechOps meeting.
SLOs are usually defined based on existing monitoring. Whenever SLO is breached or is close to being breached, fixing it becomes the highest priority for the team.
On-call roster schedule assumes, a person on shift will be able to respond to the page within 15 minutes. Once a person is paged, they are expected to mitigate an issue using the
runbooks. If mitigation is not possible, the person on call will escalate the issue higher up, until it gets resolved.
TechOps meetings are usually held at the end of on-call shifts and allow the team to reflect on the shift results. We analyse recent alerts, trends and incidents. TechOps action
items are usually related to improving alerts reliability and addressing negative trends, before it’s too late.