Articles
| Open Access | Architectural and Software-Based Fault Tolerance in Safety-Critical Embedded and Heterogeneous Computing Systems
Dr. Michael J. Hartwell , Department of Electrical and Computer Engineering, Northbridge University, United KingdomAbstract
Fault tolerance has been a foundational concern in computing systems since the earliest days of digital machines, yet its importance has intensified dramatically with the proliferation of safety-critical embedded platforms in automotive, aerospace, industrial automation, and autonomous systems. As semiconductor technologies scale down and system architectures scale up in complexity, modern platforms face an unprecedented exposure to both transient and permanent faults originating from radiation effects, manufacturing variability, thermal stress, and software-induced failures. This article presents an extensive, theory-driven research analysis of fault-tolerant computing architectures and software mechanisms, grounded strictly in classical and contemporary references spanning foundational fault-tolerant theory, radiation-induced errors, virtualization, separation kernels, lockstep processors, heterogeneous computing, and GPU-based redundancy. By synthesizing insights from architectural redundancy, software diversity, hypervisor-based isolation, and mixed-criticality system design, this work explores how modern systems reconcile performance demands with stringent safety and reliability requirements. Special emphasis is placed on dual-core and multicore lockstep architectures, virtualization-assisted isolation, software-only redundancy approaches, and emerging heterogeneous platforms integrating CPUs and GPUs in safety-critical contexts. Rather than providing a superficial survey, this article develops each concept in depth, analyzing design trade-offs, theoretical underpinnings, failure coverage, and limitations. The results highlight that no single fault-tolerance technique is sufficient in isolation; instead, layered and cross-domain approaches are necessary to address the evolving fault landscape. The discussion further identifies open challenges related to timing predictability, certification, scalability, and cost efficiency, while outlining future research directions that align with the trajectory of embedded high-performance computing systems. This article contributes a comprehensive and integrative perspective intended to support researchers, system architects, and safety engineers engaged in the design of dependable computing platforms.
Keywords
Fault-tolerant systems, lockstep architectures, embedded virtualization, mixed-criticality systems
References
Avizienis, A. (1976). Fault-tolerant systems. IEEE Transactions on Computers, C-25(12), 1304–1312.
Pierce, W. H. (1965). Failure-tolerant computer design. Academic Press.
Normand, E. (1996). Single event upset at ground level. IEEE Transactions on Nuclear Science, 43(6), 2742–2750.
Heiser, G. (2008). The role of virtualization in embedded systems. Proceedings of the Workshop on Isolation and Integration in Embedded Systems, 11–16.
Masmano, M., Ripoll, I., Crespo, A., & Metge, J. (2009). Xtratum: A hypervisor for safety critical embedded systems. Proceedings of the Real-Time Linux Workshop, 263–272.
West, R., Li, Y., Missimer, E., & Danish, M. (2016). A virtualized separation kernel for mixed-criticality systems. ACM Transactions on Computer Systems, 34(3).
Ramsauer, R., Kiszka, J., Lohmann, D., & Mauerer, W. (2017). Look Mum, no VM exits! Workshop on Operating Systems Platforms for Embedded Real-Time Applications.
Karim, A. S. A. (2023). Fault-tolerant dual-core lockstep architecture for automotive zonal controllers using NXP S32G processors. International Journal of Intelligent Systems and Applications in Engineering, 11(11s), 877–885.
Alcaide, S., et al. (2019). High-integrity GPU designs for critical real-time automotive systems. Design, Automation and Test in Europe Conference.
Alcaide, S., et al. (2019). Software-only diverse redundancy on GPUs for autonomous driving platforms. International On-Line Testing Symposium.
Alcaide, S., et al. (2020). Software-only based diverse redundancy for ASIL-D automotive applications on embedded HPC platforms. Defect and Fault Tolerance Symposium.
Alcaide, S., et al. (2022). SafeX: Open source hardware and software components for safety-critical systems. Forum on Specification and Design Languages.
Andriotis, N., et al. (2023). A software-only approach to enable diverse redundancy on Intel GPUs for safety-related kernels. Symposium on Applied Computing.
Bernick, D., et al. (2005). NonStop/SPL reg/advanced architecture. International Conference on Dependable Systems and Networks.
Cabo, G., et al. (2021). SafeSU: An extended statistics unit for multicore timing interference. European Test Symposium.
Chen, C. L., & Hsiao, M. Y. (1984). Error-correcting codes for semiconductor memory applications: A state of the art review. IBM Journal of Research and Development, 28(2), 124–134.
Cobham Gaisler. (2012). NOEL-V Processor.
Falk, H., et al. (2016). TACLeBench: A benchmark collection to support worst-case execution time research. WCET Workshop.
Fu, J., et al. (2013). On-demand thread-level fault detection in a concurrent programming environment. SAMOS Conference.
Article Statistics
Downloads
Copyright License
Copyright (c) 2025 Dr. Michael J. Hartwell

This work is licensed under a Creative Commons Attribution 4.0 International License.