Articles | Open Access | https://doi.org/10.55640/ijdsml-05-02-05

Self-Healing Software Architectures in the Cloud: AI-Driven Detection and Recovery Mechanisms

Srinivasu Yalamati , Independent Researcher, USA

Abstract

The recent evolution of cloud computing demands that systems are able to self-diagnose and self-heal as well as constantly optimize without human intervention. This paper provides an in-depth review of the self-healing software architectures in cloud computing, focusing on AI-induced detection and recovery methods. The authors talk about how self-healing systems have changed from traditional ideas to modern AI-powered systems and categorize the main types of methods used for synchronization, tracking, and fixing problems in today's cloud services. Based on a systematic review of available literature, we investigate essential issues such as fault detection accuracy, recovery time optimization, and system reliability improvement. The study finds that although much has been achieved in self-healing, the existing approaches are not yet able to efficiently deal with complex fault situations and to reduce the level of service interruption. Our results suggest that the application of large language models updated using machine learning has the potential to deliver up to an 85% increase in the accuracy of fault prediction and a 60% reduction in system downtime as compared to state-of-the-art approaches. Finally, we talk about what future research should focus on, including the necessary understanding and development of new AI models, different system structures, and standard ways to measure how well self-healing cloud systems work.

Keywords

Self-healing systems, cloud computing, artificial intelligence, fault detection, autonomic computing

References

C. Ji and H. Luo, "Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing," arXiv preprint arXiv:2505.11743, May 2025.

S. Ravi, "AI-Powered Self-Healing Cloud Infrastructures," Migration Letters, vol. 21, no. 3, pp. 1–15, 2025.

R. K. Arora, A. Kumar, A. Soni, and A. Tiwari, "AI-Driven Self-Healing Cloud Systems: Enhancing Reliability and Reducing Downtime through Event-Driven Automation," in AI for Cloud Computing, SCRS, 2024.

J. O. Kephart and D. M. Chess, "The vision of autonomic computing," Computer, vol. 36, no. 1, pp. 41–50, Jan. 2003.

M. Salehie and L. Tahvildari, "Self-adaptive software: Landscape and research challenges," ACM Transactions on Autonomous and Adaptive Systems, vol. 4, no. 2, pp. 1–42, May 2009.

S. Kounev, X. Zhu, and A. Aboulnaga, "Self-aware and self-adaptive cloud autoscaling systems," IEEE Cloud Computing, vol. 2, no. 1, pp. 22–28, 2015.

R. Calinescu et al., "Dynamic QoS management and optimization in service-based systems," IEEE Transactions on Software Engineering, vol. 37, no. 3, pp. 387–409, May 2011.

S. K. Gunda, “Enhancing Software Fault Prediction with Machine Learning: A Comparative Study on the PC1 Dataset,” in Proc. IEEE Conf., 2024.

Y. Brun, G. Marzo Serugendo, C. Gacek, H. Giese, H. Kienle, M. Litoiu, H. Müller, M. Pezzè, and M. Shaw, "Engineering self-adaptive systems through feedback loops," in Software Engineering for Self-Adaptive Systems, Springer, 2009, pp. 48–70.

D. Weyns, M. U. Iftikhar, and D. Garlan, "Self-healing of cloud-based systems: State of the art and challenges," IEEE Software, vol. 36, no. 5, pp. 28–35, Sep. 2019.

P. Horn, "Autonomic computing: IBM's perspective on the state of information technology," IBM, Oct. 2001.

A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, Jan. 2004.

J. Kramer and J. Magee, "Self-managed systems: an architectural challenge," in Future of Software Engineering, IEEE, 2007, pp. 259–268.

M. Parashar and S. Hariri, "Autonomic computing: An overview," in Unconventional Programming Paradigms, Springer, 2005, pp. 257–269.

S. Dustdar, Y. Guo, B. Satzger, and H. L. Truong, "Principles of elastic processes," IEEE Internet Computing, vol. 15, no. 5, pp. 66–71, Sep. 2011.

A. Gambi, G. Toffetti, and C. Pautasso, "Kriging-based self-healing for cloud applications," in Proceedings of the 9th International Conference on Autonomic Computing, 2012, pp. 73–82.

L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, "A break in the clouds: towards a cloud definition," ACM SIGCOMM Computer Communication Review, vol. 39, no. 1, pp. 50–55, Jan. 2009.

S. K. Gunda, “Fault Prediction Unveiled: Analyzing the Effectiveness of Random Forest, Logistic Regression, and KNeighbors,” in Proc. IEEE Conf., 2024.

S. Krishnan and J. S. Vitter, "Self-healing systems and cloud computing," in Handbook of Cloud Computing, Springer, 2010, pp. 181–200.

A. J. Ramirez, B. H. C. Cheng, and P. K. McKinley, "Adaptive monitoring of software requirements," in Proceedings of the 1st International Workshop on Requirements at Run Time, 2010, pp. 41–50.

R. Sterritt, "Autonomic computing," Innovations in Systems and Software Engineering, vol. 1, no. 1, pp. 79–88, Mar. 2005.

P. Jamshidi, A. Ahmad, and C. Pahl, "Cloud migration research: A systematic review," IEEE Transactions on Cloud Computing, vol. 1, no. 2, pp. 142–157, Jul. 2013.

S. R. White et al., "An architectural approach to autonomic computing," in Proceedings of the International Conference on Autonomic Computing, 2004, pp. 2–9.

M. C. Huebscher and J. A. McCann, "A survey of autonomic computing—degrees, models, and applications," ACM Computing Surveys, vol. 40, no. 3, pp. 1–28, Aug. 2008.

P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A. Quilici, D. S. Rosenblum, and A. L. Wolf, "An architecture-based approach to self-adaptive software," IEEE Intelligent Systems, vol. 14, no. 3, pp. 54–62, May 1999.

J. Zhang and B. H. C. Cheng, "Model-based development of dynamically adaptive software," in Proceedings of the 28th International Conference on Software Engineering, 2006, pp. 371–380.

D. Garlan, S. W. Cheng, A. Huang, B. Schmerl, and P. Steenkiste, "Rainbow: Architecture-based self-adaptation with reusable infrastructure," Computer, vol. 37, no. 10, pp. 46–54, Oct. 2004.

S. Dustdar, Y. Guo, B. Satzger, and H. L. Truong, "Self-healing and self-optimizing cloud applications: A research roadmap," in Proceedings of the 2013 IEEE International Conference on Cloud Engineering, 2013, pp. 232–239.

S. K. Gunda, “Comparative Analysis of Machine Learning Models for Software Defect Prediction,” in Proc. IEEE Conf., 2024.

R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, vol. 25, no. 6, pp. 599–616, Jun. 2009.

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Self-Healing Software Architectures in the Cloud: AI-Driven Detection and Recovery Mechanisms. (2025). International Journal of Data Science and Machine Learning, 5(02), 56-65. https://doi.org/10.55640/ijdsml-05-02-05