Articles
| Open Access | Resilience Engineering in Cloud-Native Systems: A Human-Centered Chaos Engineering Framework for Fault-Tolerant Microservices and Distributed AI Ecosystems
Kaira Kovarikova , Department of Computer Science, Charles University, Prague, Czech Republic SwitzerlandAbstract
The rapid evolution of cloud-native architectures, microservices, and distributed artificial intelligence systems has introduced unprecedented levels of complexity and operational uncertainty in modern computing environments. As organizations increasingly adopt cloud computing for critical domains such as healthcare, education, and enterprise systems, ensuring system resilience has emerged as a foundational requirement. Traditional fault tolerance and reliability engineering approaches are often insufficient to address the dynamic, distributed, and non-deterministic nature of cloud-native ecosystems. In this context, chaos engineering has gained prominence as an experimental methodology for proactively identifying system weaknesses through controlled fault injection.
This study presents a comprehensive, human-centered chaos engineering framework designed to enhance resilience in cloud-native systems, with a particular focus on microservices architectures and distributed AI workloads deployed on container orchestration platforms such as Kubernetes. Drawing upon existing literature in cloud computing, fault tolerance, dependable systems, and chaos engineering, the research synthesizes theoretical and practical insights into a unified model that integrates technical resilience mechanisms with organizational learning processes.
The methodology involves an extensive conceptual analysis of resilience principles, fault injection strategies, and human-in-the-loop learning mechanisms. The results highlight that embedding chaos engineering practices into organizational workflows significantly improves system robustness, accelerates incident response capabilities, and fosters a culture of continuous learning. Furthermore, the study demonstrates that integrating human-centered approaches enhances decision-making, reduces operational risks, and aligns technical resilience with business objectives.
The findings underscore the importance of combining technological innovation with human expertise to achieve sustainable resilience in complex distributed systems. The proposed framework contributes to both academic research and industry practice by offering a scalable and adaptable approach to resilience engineering in the cloud-native era.
Keywords
Cloud computing, Chaos engineering, Microservices, Fault tolerance
References
Ahmad, A.A.-S. et al. (2022). Scalability resilience framework using application-level fault injection for cloud-based software services. Journal of Cloud Computing.
Alabbadi, M.M. (2011). Cloud computing for education and learning: education and learning as a service (elaas). In Proceedings of the International Conference on Interactive Collaborative Learning. IEEE.
Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing.
Beschastnikh, I. et al. (2016). Debugging distributed systems. Communications of the ACM.
Camacho, C., Cañizares, P.C., Llana, L., & Núñez, A. (2022). Chaos as a software product line-a platform for improving open hybrid-cloud systems resiliency. Software: Practice and Experience.
Dang, L.M., Piran, M.J., Han, D., Min, K., & Moon, H. (2019). A survey on internet of things and cloud computing for healthcare. Electronics.
Gogineni, A. (2025). Chaos Engineering in the Cloud-Native Era: Evaluating Distributed AI Model Resilience on Kubernetes. Journal of Artificial Intelligence, Machine Learning & Data Science.
Gokhroo, M.K., Govil, M.C., & Pilli, E.S. (2017). Detecting and mitigating faults in cloud computing environment. IEEE Conference on Computational Intelligence and Communication Technology.
Golightly, L., Chang, V., Xu, Q.A., Gao, X., & Liu, B.S. (2022). Adoption of cloud computing as innovation in the organization. International Journal of Engineering Business Management.
Jamshidi, P. et al. (2018). Microservices: The journey so far and challenges ahead. IEEE Software.
Sagar Kesarpu. (2025). Chaos Engineering as a Learning Framework: A Human-Centered Model for Developing High-Reliability Engineering Teams. The American Journal of Engineering and Technology, 7(12), 57–64. https://doi.org/10.37547/tajet/Volume07Issue12-05
Konstantinou, C., Stergiopoulos, G., Parvania, M., & Esteves-Verissimo, P. (2021). Chaos engineering for superior resilience of cyber-physical systems. IEEE Resilience Week.
Kumari, P., & Kaur, P. (2021). A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences.
Ma’ruf, D. et al. (2020). Applying integrating testing of microservices in airline ticketing system. International Journal of Information Technology and Electrical Engineering.
Article Statistics
Downloads
Copyright License
Copyright (c) 2026 Kaira Kovarikova

This work is licensed under a Creative Commons Attribution 4.0 International License.