Articles | Open Access | https://doi.org/10.55640/ijdsml-05-01-30

Machine Learning–Augmented ETL Pipelines for Fraud–Resistant Insurance Claims Processing

Kawaljeet Singh Chadha , University of the Cumberlands, Williamsburg, KY, USA

Abstract

The insurance industry is also affected by insurance fraud, which incurs massive financial losses and operational inefficiencies. Current fraud detection methods tend to be based on rule-based systems and static Extract, Transform, Load (ETL) pipelines, which are unable to keep up with the pace of rapidly evolving fraud tactics. However, these conventional approaches exhibit high false-positive rates, limited flexibility, and cannot perform real-time analysis, causing delayed detection and increased operational costs. This article describes the integration of machine learning (ML) techniques into Extract, Transform, and Load (ETL) pipelines to facilitate real-time, data-driven fraud identification during insurance claims processing. This system features embedded supervised machine learning classifiers within the ETL workflow, enabling dynamic analysis of claims data during ingestion and transformation. Temporal behavior modelling, behavior modelling, and external data source enrichment, co-enabled with fraud auto-registry, will allow the system to improve the detection of complex behaviors over time. Scalability and near real-time processing are supported by the pipeline orchestration, resulting in timely fraud risk scoring. The results of experiments demonstrate that the proposed methods yield a significant improvement in detection accuracy and latency reduction compared to traditional methods. By incorporating dimensionality reduction techniques, further optimization of model performance can be achieved. With this approach, claims processing can effectively evolve in lockstep with dynamic and ever-changing scales, adapting without impacting efficiency and resiliency. Ultimately, an ML-augmented ETL pipeline is proposed, which provides insurers with a powerful tool for reducing fraud losses while maintaining agility and compliance.

Keywords

Insurance fraud detection, Machine learning, ETL pipeline, temporal behavior modeling, Real-time fraud scoring.

References

Abakarim, Y., Lahby, M., & Attioui, A. (2018, October). An efficient real time model for credit card fraud detection based on deep learning. In Proceedings of the 12th international conference on intelligent systems: theories and applications (pp. 1-7). https://dl.acm.org/doi/abs/10.1145/3289402.3289530

Bello, H. O., Ige, A. B., & Ameyaw, M. N. (2024). Adaptive machine learning models: concepts for real-time financial fraud prevention in dynamic environments. World Journal of Advanced Engineering Technology and Sciences, 12(02), 021-034. https://doi.org/10.30574/wjaets.2024.12.2.0266

Beteto, A., Melo, V., Lin, J., Alsultan, M., Dias, E. M., Korte, E., ... & Lambert, J. H. (2022). Anomaly and cyber fraud detection in pipelines and supply chains for liquid fuels. Environment Systems and Decisions, 42(2), 306-324. https://link.springer.com/article/10.1007/s10669-022-09843-5

Chavan, A. (2022). Importance of identifying and establishing context boundaries while migrating from monolith to microservices. Journal of Engineering and Applied Sciences Technology, 4, E168. http://doi.org/10.47363/JEAST/2022(4)E168

Chavan, A. (2023). Managing scalability and cost in microservices architecture: Balancing infinite scalability with financial constraints. Journal of Artificial Intelligence & Cloud Computing, 2, E264. http://doi.org/10.47363/JAICC/2023(2)E264

Crotty, J., & Horrocks, I. (2017). Managing legacy system costs: A case study of a meta-assessment model to identify solutions in a large financial services company. Applied computing and informatics, 13(2), 175-183. https://doi.org/10.1016/j.aci.2016.12.001

Darville, J., Yavuz, A., Runsewe, T., & Celik, N. (2023). Effective sampling for drift mitigation in machine learning using scenario selection: A microgrid case study. Applied Energy, 341, 121048. https://doi.org/10.1016/j.apenergy.2023.121048

Dhanagari, M. R. (2024). MongoDB and data consistency: Bridging the gap between performance and reliability. Journal of Computer Science and Technology Studies, 6(2), 183-198. https://doi.org/10.32996/jcsts.2024.6.2.21

Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time. Journal of Computer Science and Technology Studies, 6(5), 246-264. https://doi.org/10.32996/jcsts.2024.6.5.20

Drakesmith, M., Caeyenberghs, K., Dutt, A., Lewis, G., David, A. S., & Jones, D. K. (2015). Overcoming the effects of false positives and threshold bias in graph theoretical analyses of neuroimaging data. Neuroimage, 118, 313-333. https://doi.org/10.1109/ACCESS.2019.2945930

Elmes, A., Alemohammad, H., Avery, R., Caylor, K., Eastman, J. R., Fishgold, L., ... & Estes, L. (2020). Accounting for training data error in machine learning applied to earth observations. Remote Sensing, 12(6), 1034. https://doi.org/10.3390/rs12061034

Fursov, I., Kovtun, E., Rivera-Castro, R., Zaytsev, A., Khasyanov, R., Spindler, M., & Burnaev, E. (2022). Sequence embeddings help detect insurance fraud. IEEE Access, 10, 32060-32074. https://doi.org/10.1109/ACCESS.2022.3149480

Goel, G., & Bhramhabhatt, R. (2024). Dual sourcing strategies. International Journal of Science and Research Archive, 13(2), 2155. https://doi.org/10.30574/ijsra.2024.13.2.2155

Hardy, B., Mohoric, T., Exner, T., Dokler, J., Brajnik, M., Bachler, D., ... & Athar, A. (2024). Knowledge infrastructure for integrated data management and analysis supporting new approach methods in predictive toxicology and risk assessment. Toxicology in Vitro, 100, 105903. https://doi.org/10.1016/j.tiv.2024.105903

Kalluri, K. (2022). Optimizing Financial Services Implementing Pega's Decisioning Capabilities for Fraud Detection. International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences, 10(1), 1-9.

Karwa, K. (2024). The future of work for industrial and product designers: Preparing students for AI and automation trends. Identifying the skills and knowledge that will be critical for future-proofing design careers. International Journal of Advanced Research in Engineering and Technology, 15(5). https://iaeme.com/MasterAdmin/Journal_uploads/IJARET/VOLUME_15_ISSUE_5/IJARET_15_05_011.pdf

Khurana, R. (2020). Fraud detection in ecommerce payment systems: The role of predictive ai in real-time transaction security and risk management. International Journal of Applied Machine Learning and Computational Intelligence, 10(6), 1-32. https://neuralslate.com/

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST, and SCA tools. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

Lepri, B., Oliver, N., Letouzé, E., Pentland, A., & Vinck, P. (2018). Fair, transparent, and accountable algorithmic decision-making processes: The premise, the proposed solutions, and the open challenges. Philosophy & Technology, 31(4), 611-627. https://link.springer.com/article/10.1007/S13347-017-0279-X

Machireddy, J. R. (2024). Integrating Machine Learning-Driven RPA with Cloud-Based Data Warehousing for Real-Time Analytics and Business Intelligence. Hong Kong Journal of AI and Medicine, 4(1), 98-121. https://hongkongscipub.com/

Misiura, A. (2015). Enterprise risk management in the airline industry-risk management structures and practices (Doctoral dissertation, Brunel University London). http://bura.brunel.ac.uk/handle/2438/11087

Mittal, S., & Tyagi, S. (2019, January). Performance evaluation of machine learning algorithms for credit card fraud detection. In 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (pp. 320-324). IEEE. https://doi.org/10.1109/CONFLUENCE.2019.8776925

Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24, 779-825. https://link.springer.com/article/10.1007/s10664-018-9638-1

Nelson, J., & Temple, S. (2020, April). MLOps Framework for Continuous Integration and Deployment.

Njoku, D. O., Iwuchukwu, V. C., Jibiri, J. E., Ikwuazom, C. T., Ofoegbu, C. I., & Nwokoma, F. O. (2024). Machine learning approach for fraud detection system in financial institution: A web base application. Machine Learning, 20(4), 01-12.

Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203184230

Olayinka, O. H. (2021). Big data integration and real-time analytics for enhancing operational efficiency and market responsiveness. Int J Sci Res Arch, 4(1), 280-96. https://doi.org/10.30574/ijsra.2021.4.1.0179

Pillai, V. (2022). Anomaly Detection for Innovators: Transforming Data into Breakthroughs. Libertatem Media Private Limited.

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., & Baker, T. (2020). Analysis of dimensionality reduction techniques on big data. Ieee Access, 8, 54776-54788. https://doi.org/10.1109/ACCESS.2020.2980942

Sahin, E. K. (2020). Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Applied Sciences, 2(7), 1308. https://link.springer.com/article/10.1007/s42452-020-3060-1

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective. International Journal of Science and Research Archive. https://doi.org/10.30574/ijsra.2022.7.2.0253

Sarma, W., Nagavalli, S. P., & Sresth, V. (2020). Leveraging AI-Driven Algorithms to Address Real-World Challenges in E-Commerce: Enhancing User Experience, Fraud Detection, and Operational Efficiency. INTERNATIONAL JOURNAL OF RESEARCH AND ANALYTICAL REVIEWS, 7, 2348-1269. http://www.ijrar.org/

Sartzetaki, M., Karagkouni, A., & Dimitriou, D. (2023). A conceptual framework for developing intelligent services (a platform) for transport enterprises: The designation of key drivers for action. Electronics, 12(22), 4690. https://doi.org/10.3390/electronics12224690

Singh, V. (2022). Visual question answering using transformer architectures: Applying transformer models to improve performance in VQA tasks. Journal of Artificial Intelligence and Cognitive Computing, 1(E228). https://doi.org/10.47363/JAICC/2022(1)E228

Singh, V. (2023). Enhancing object detection with self-supervised learning: Improving object detection algorithms using unlabeled data through self-supervised techniques. International Journal of Advanced Engineering and Technology. https://romanpub.com/resources/Vol%205%20%2C%20No%201%20-%2023.pdf

Sukhadiya, J., Pandya, H., & Singh, V. (2018). Comparison of Image Captioning Methods. INTERNATIONAL JOURNAL OF ENGINEERING DEVELOPMENT AND RESEARCH, 6(4), 43-48. https://rjwave.org/ijedr/papers/IJEDR1804011.pdf

Van Rijn, J. N., & Hutter, F. (2018, July). Hyperparameter importance across datasets. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2367-2376). https://dl.acm.org/doi/abs/10.1145/3219819.3220058

Yarram, S., & Bittla, S. R. (2023). Predictive Test Automation: Shaping the Future of Quality Engineering in Enterprise Platforms. Available at SSRN 5132329. https://ssrn.com/abstract=5132329

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Machine Learning–Augmented ETL Pipelines for Fraud–Resistant Insurance Claims Processing. (2025). International Journal of Data Science and Machine Learning, 5(01), 410-436. https://doi.org/10.55640/ijdsml-05-01-30