Articles | Open Access | https://doi.org/10.55640/

A Unified Telemetry and Predictive Modeling Framework for Enhanced Site Reliability Engineering Observability

Professor Aulia Rahmawati , Faculty of Computer Science and Engineering, Universitas Indonesia, Depok, Indonesia
Engr. Kweku Mensah , Department of Telecommunications Engineering, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

Abstract

Purpose: This study addresses the limitations of conventional Site Reliability Engineering (SRE) observability, which often relies on fragmented, reactive monitoring across logs, metrics, and traces. We propose and validate a novel Unified Telemetry and Predictive Modeling Framework (UT-PMF) designed to consolidate these data streams and provide proactive health signals.

Methodology: The UT-PMF integrates a Unified Telemetry Framework (UTF) for data ingestion and correlation with a Predictive Modeling Module (PMM). The PMM employs time-series anomaly detection for metrics and a Transformer-based Natural Language Processing (NLP) model for log pattern change detection. The framework was evaluated in a simulated, distributed system against a baseline reactive monitoring setup, using Mean Time to Detection (MTTD) and Service Level Objective (SLO) compliance as primary metrics.

Findings: The implementation of the UT-PMF yielded a substantial improvement in incident response, demonstrating a 45% reduction in MTTD compared to the baseline. The predictive fusion of metric anomalies and log precursors allowed SRE teams to identify and address latent system degradation significantly earlier. This proactive capability directly supports improved SLO attainment and error budget management.

Originality: This research contributes an integrated architectural and algorithmic approach that moves beyond mere data collection to unified, cross-pillar predictive analysis, offering a transformative path for SRE observability in complex, high-stakes production environments.

Keywords

Site Reliability Engineering (SRE), Observability, Predictive Modeling, Telemetry, Service Level Objectives (SLOs), Distributed Systems, Anomaly Detection

References

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. Retrieved from https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/

Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., & Shanbhag, C. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google.

Google SRE. (2016). Site Reliability Engineering. O’Reilly (online edition). Chapters on monitoring, SLOs, and automation.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. (Eds.). (2018). The Site Reliability Workbook. O’Reilly (online edition). Practical SLOs, alerting, and monitoring patterns.

Hidalgo, A. (2020). Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets. O’Reilly.

OpenTelemetry Authors. (2024). OpenTelemetry Specification—Overview & Collector (traces, metrics, logs).

Rabenstein, B., & Volz, J. (2015). Prometheus: A Next-Generation Monitoring System. SREcon Europe talk (time-series metrics, alerting).

Zipkin Project. (2012–present). Zipkin: A Distributed Tracing System (Twitter-origin, open source).

Jaeger Project / Uber Engineering. (2017–present). Jaeger: Open-Source Distributed Tracing Platform (origins and architecture).

Wiener, J., et al. (2013). Scuba: Diving into Data at Facebook. PVLDB, 6(11), 1057–1067. (Real-time, in-memory analytics for ops/observability).

Zhong, Z., et al. (2023). A Survey of Time Series Anomaly Detection Methods in the AIOps Domain. arXiv:2308.00393. (Coverage of KPI/ops anomalies).

Zero-Trust Architecture in Java Microservices. (2025). International Journal of Networks and Security, 5(01), 202–214. https://doi.org/10.55640/ijns-05-01-12

Zhang, Z., et al. (2022). CRISP: Critical Path Analysis of Large-Scale Microservice Traces. USENIX ATC. (Deriving causal paths from traces; useful for predictive SRE).

Chen, Z., et al. (2024). Scalable and Streaming Sampling for Distributed Traces. arXiv:2406.06975. (Trace data reduction for unified telemetry pipelines).

Singh, V. (2025). Policy Optimization for Anti-Money Laundering (AML) Compliance using AI Techniques: A Machine Learning Approach to Enhance Banking Regulatory Compliance. International Journal of Engineering Research & Technology (IJERT), 14(04).

Chen, W., et al. (2019). Unsupervised Anomaly Detection for Intricate KPIs via Isolation Forest and Seasonal Hybrid ESD. IEEE INFOCOM Workshops. (KPI anomaly detection under real SRE conditions).

Taylor, S. J., & Letham, B. (2018). Forecasting at Scale. The American Statistician, 72(1), 37–45. (Prophet—capacity planning / incident prediction).

ResearchGate Preprint. (2024). eBPF-Enhanced Complete Observability for Cloud-Native Microservices. (Kernel-level telemetry stream for unified pipelines).

IBM. (n.d.). What Is SRE Observability? (Overview connecting SRE, observability, and alerting).

Red Hat Developers. (2019). A Guide to the Open Source Distributed Tracing Landscape. (Zipkin, Jaeger, standards context).

Kumar Tiwari, S., Sooraj Ramachandran, Paras Patel, & Vamshi Krishna Jakkula. (2025). The Role of Chaos Engineering in Enhancing System Resilience and Reliability in Modern Distributed Architectures. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3885

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

A Unified Telemetry and Predictive Modeling Framework for Enhanced Site Reliability Engineering Observability. (2025). International Journal of Electronics and Communications, 5(01), 14-22. https://doi.org/10.55640/