I’ve been trying to deal with how Quality Assurance(QA) leadership and test engineers need to consider QA with a new lens due to increasingly complex, interdependent software architecture designs. A new lens that is appreciative of the multi-faceted nature of software infrastructure, application architecture, dependency risk, and hidden complexities that creep up as the number of interdependent components increases. As a result of my training in biology and having coauthored several peer-reviewed papers, my takeaway of approaching problems is aided by asking the right questions, the openness of being wrong, and calling upon the scientific method which is a hypothesis-driven way of exploring questions and observations. While I’ve taken a left turn away from life sciences and shifted into computer science, I haven’t forgotten my background and process for which to uncover complexity and solve problems.
Over the past year, more and more aspects of complexity theory and algorithms have been on my reading radar mostly to fulfill my interests in big data, scalability, artificial intelligence and networks. At first, these readings seemed disparate from my day job, but slowly pieces laid together with their application to software quality due to the fact that ever-more complex, integrated platforms are being built in order for businesses to better compete in the marketplace.
Failures can stem from development bugs, component integrations, infrastructure glitches, scalability issues, data issues, security vulnerabilities, UI variations, UX flow errors and compatibility issues just to name a handful. We principally suffer from an inability to accurately predict where and when failures will arise from. The traditional approaches of creating elaborate test plans used to be the gold standard when applications were developed as monoliths, however now that software development is shifting to creating functioning components that are dynamic, integrated and iterating quicker than ever. Due to this, the test plan that you think is complete is already two steps behind as development iterations continue steadfast.
Hello Chaos Engineering:
In comes Chaos Engineering theory, Chaos Monkey, and Gremlin with principles that product technologists should keep at the tip of their minds. The tenants of simulating failures at the DNS level, dependency level, CPU level, and network level is incredible to include as part of a product’s verification checklist when the goal is to create reliable systems. Chaos Engineering lives and breaths upon a scientifically-driven execution of random system faults that are controlled and measured in order to better design systems tolerant of turbulent environments. Our goal here is to create a system that, even when it is misbehaving, problems and risks are mitigated, controlled, and appropriate failovers are activated. As Nora Jones from Netflix points out, if a single module within the platform is down (Ex. Movie/Show Recommendation Engine), the result is to elegantly remove that module without disrupting other system components with the user’s positive experience left happily intact.
Hello Continuous Verification:
I’m ashamed to say that I’ve never heard of this term until Casey Rosenthal presented around this topic at a recent conference but it exactly embeds our continued evolution of software craftsmanship. We already orchestrate the technical pieces of software delivery well by leveraging continuous integration and continuous deployments in our processes, however, we have two critical questions left to ask ourselves: Does the output of the system match our expectations and how can we be sure? While creating unit tests, performance tests, and dynamic automated testing harnesses does wonders to ensure regression isn’t introduced and performance doesn’t fall below a specified threshold, we have unknown-unknowns that can wreak havoc on tightly-coupled systems. Thus passive monitoring and static checks aren’t the answer, but proactive and continuous system measurements are critical because we must know how our systems are behaving normally in order to diagnose failures when our systems are experiencing faults.
We already do a great job at the low hanging fruit of quality assurance: static code quality checks, security gaps, functional automation, cross-browser, accessibility, along with operating system variation handling and scalability/performance stress tests. Future software craftsmanship will build atop of the principles of continuous verification. This is inherently tough due to the impossibility of knowing every single use case a feature can be leveraged for keeping in mind that products aren’t always used how their creators designed them (See: Instagram’s deep dive of determining why people post and delete photos days apart. Answer: Merchants using Instagram as a marketplace.). This is where principles of chaos engineering and observability come into play by simulating environment faults and observing application usage from production (respecting privacy concerns and laws). Teams ought to test as close to production as possible with considerations of proper complexity, continuously measure and observe system outputs, create and track effective metrics, and consider failover operations.
References
Breaking to Learn: Chaos Engineering Explained
Chaos Engineering:
Breaking Your Systems for
Fun and Profit
Automating Failure Testing Research at Internet Scale
Chaos Engineering Scenarios
Continuous Verification
Continuous Verification: The Missing Link to Fully Automate Your Pipeline
Principles of Chaos Engineering
AWS re:Invent 2017 – Nora Jones Describes Why We Need More Chaos – Chaos Engineering, That Is