AuthorsK. J. Hole
TitleTutorial on systems with antifragility to downtime
AfilliationCryptography
Project(s)No Simula project
StatusPublished
Publication TypeJournal Article
Year of Publication2021
JournalComputing
Volume104
Number1
Pagination73-93
Date Published01/2021
PublisherSpringer
KeywordsAntifragility, Design principles, Distributed systems, Uptime
Abstract

An antifragile system of software and stakeholders, including designers, developers, and operators, learn from incidents how to avoid outages and maintain high uptime. This tutorial article reviews how to design and operate such socio-technical systems with antifragility to downtime. It documents the importance of four design principles and two operational principles by exploring the polar opposite anti-principles and the interplay between the principles and the anti-principles. The design principles mandate a software design of separate and isolatable processes with sufficient diversity and redundancy. The processes should communicate asynchronously over an external network. The operational principles imply that the software development teams should repeatedly inject artificial failures into the production system to understand its behavior and detect and mitigate vulnerabilities as the system and its environment change.

DOI10.1007/s00607-020-00895-6
Citation Key28372