1983 – Apple While
MacWrite and
MacPaint were being developed for the first
Apple Macintosh computer,
Steve Capps created "Monkey", a
desk accessory which randomly generated
user interface events at high speed, simulating a monkey frantically banging the keyboard and moving and clicking the mouse. It was promptly put to use for
debugging by generating errors for programmers to fix, because
automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated.
1992 – Prologue While ABAL2 and SING were being developed for the first graphical versions of the PROLOGUE operating system, Iain James Marshall created "La Matraque", a
desk accessory which randomly generated random sequences of both legal and invalid
graphical interface events, at high speed, thus testing the critical edge behaviour of the underlying graphics libraries. This program would be launched prior to production delivery, for days on end, thus ensuring the required degree of total resilience. This tool was subsequently extended to include the Database and other File Access instructions of the ABAL language to check and ensure their subsequent resiliance. A variation of this tool is currently employed for the qualification of the modern day version known as OPENABAL.
2003 – Amazon While working to improve website reliability at
Amazon,
Jesse Robbins created "Game day", an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said it was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering.
2006 – Google While at
Google, Kripa Krishnan created a similar program to Amazon's Game day (see above) called "DiRT" (Disaster Recovery Testing). Jason Cahoon, a Site Reliability Engineer at Google, contributed a chapter on Google DiRT in the "Chaos Engineering" book
2011 – Netflix While overseeing
Netflix's migration to the cloud in 2011 Nora Jones, Casey Rosenthal, and Greg Orzell expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option: "At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services." By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers. The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by
Martin Fowler in 2012. ==Chaos engineering tools ==