Three related coverage risks stood out when I joined a new project to build a chatbot API for a medical symptom checker. With an infinite space of possible chats, how could we:
1. look for unintended consequences of changes.
2. discover some of the edge and corner cases bugs.
3. exercise the API significantly.
To help mitigate these risks I built a client which would randomly walk through dialogs, unattended, and report
on what it had found.
In this talk, I'll describe how I implemented that client by iteratively adding functionality that I hoped would
facilitate my exploration of changes and fixes to the emerging API. I'll give examples of features that worked
well (such as configuration of probabilities for different types of answers) and those that did not (such as checking for specific classes of medical outcome),
explain how I built on top of the client to make a load testing tool, and think about what I'd do differently next time.
30. What Worked Well
Verbose logs.
Asserting generally and on fixes.
Randomisation for unknowns.
Configuration for directed exploration.
Toolkit (replay, parallel, analysis, …).
Question-driven development.
@qahiccupps
Code changes and dependencies.
Card identifiers.
Medical testing.
State.
No explicit model.
What Worked Well … and What was Challenging
31. References
Ada screenshots: https://www.uisources.com/explainer/ada-diagnosing-via-chat-bot
Wipotec: https://www.wipotec-ocs.com/en/product-inspection/
Microscope: https://londonlaboquip.com/product/microscope-binocular-biological-sc302
Messy lab: https://imgur.com/gallery/bQiK6
Dice: https://www.richardhughesjones.com/luck-randomness/dice-gif/
Altwalker: https://altom.gitlab.io/altwalker/altwalker/
Star Wars: https://www.starwars.com/video/one-in-a-million-shot
@qahiccupps
Editor's Notes
**Navigating**
Error handling
Consistency of API
E.g. Male only assessments - compromise test code to get a walking skeleton; POST vs GET, two similar but slightly different schemas in the API itself.
These activities make me ask questions… what if I …? How about when? Could it possibly be that …?
Problems with the walker and with the product.
While automating I’m testing.
Don’t be too quick to restrict to what you think the system wants.Can I get from start to finish?
What assumptions are required?
What workarounds are required?
How might developers struggle here?
…
** Checking**
E.g. certain kinds of dialog turns have different properties to assert on - keys in DTO must be present, or in some kind of relationship.
What can I assert specifically and generally?
Where are the edge cases? (e.g. by general global assertions failing)
What are the error cases?
How valuable is it to check these things here?
** Exploring**
When you explore you don’t know if you’ll find anything, and if you find something you won’t know whether it’s relevant, and if it’s relevant you won’t know whether it’s important.Code is a tool and a toolkit. Extend it to the next question you have. (Can I get to “call ambulance” outcomes? What would need to happen to do that? How could I avoid it?)
Log paths and outcomes. Analyse outside the code for patterns (seen and missing)
Don’t error check too heavily. Catch and investigate failures. (Expose assumptions)
Check some positive cases by hand.
Check failures by hand. Look for patterns. (Don’t have to catch all failures; TODO outcomes are fine because then you can filter them in later analysis.)
Explore the data you produce.
Replay for repeatability.
Configuration to guide the direction of exploration. (Initially, I’d just hack the code)
How can I take the extreme choice each time? How can I make the longest assessment? Can I run an assessment for ever?
A toolkit to gather data …
… for human analysis.
Failures are targets.
Patterns are indicators.
…
Parallel nothing clever - just run two or more copies!