
Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline
Sai Nikhil Donthi , Department of Software Engineering, University of Houston Clear Lak . Oil and Gas Industry Houston Texas, USAAbstract
We have been witnessing rapid growth of data-intensive applications adopting efficient columnar storage formats, with Parquet becoming a widely used standard in modern data pipeline. Parquet has been efficient than traditional databases in the aspect of columnar storage, evolution of schema, efficient compression, and vast range of supported tools. While Parquet still had issues of capturing transactional logs and didn’t support ACID properties or atomic writes, which resulted in data corruption and metadata operations became expensive. Delta Lake, an open-source storage layer built on Parquet, addresses these limitations by introducing ACID transactions, schema evolution, time travel, and unified batch and streaming support. The study conducted on evaluating effectiveness of Delta Lake over Apache Parquet helps to gather and take advantage of the key benefits that delta lake provides in the world of big data and the optimization techniques used in Microsoft fabric as the baseline with delta lake being the backend storage bin. With the extensive increase of data driven critical applications in the IT sector, the columnar storage orientated data formats such as Apache parquet have become the industry standard in the big data world. Though parquet can compress and store different data formats like Json, xml, audio, csv, etc but it lacks the ACID properties that delta lake offers. Parquet formats offer high data compression efficiency and rapid query execution on the other hand lacks schema enforcement, reliability, and transaction guarantees which is crucial in this modern world. In short Delta Lake, an extension of parquet is an open-source storage layer introduced on top of parquet with delta logs that store incremental transaction logs helps to eliminate the limitations of parquet by adding ACID transactions, schema evolution, time travel and unified batch streaming support. This research inquires that parquet offsets to read low optimal workloads (does not support parallel operations and require batch processing) while delta lake provides remarkable advantages for heavy workloads that require data versioning, reliability, and parallel execution. This research draws the effectiveness of Delta Lake in comparison to Parquet by reviewing critical performance parameters such as read/write speed, concurrency management, update/delete efficiency, and operational reliability, within both batch and streaming data (parallel) processing in a Python-based data pipeline.
Keywords
Delta Lake, , Parquet, ACID Transactions, Schema Evolution, Transaction Log
References
Amazon Web Services. (2018). (Amazon Web Services) From data lakes to rivers of insight: Our vision for the oil & gas industry and AWS partnership [eBook]. https://d1.awsstatic.com/Industries/Oil/AWS_Data_Lakes_eBook_O%26G_Final.pdf
Datanexum. (2025). Boost oil and gas operations with data analytics for ERP in 2025. https://datanexum.com/insights/f/boost-oil-and-gas-operations-with-data-analytics-for-erp-in-2025
Lang, L., Hernandez, E., Choudhary, K., & Romero, A. H. (2025). ParquetDB: A lightweight Python Parquet-based database [Preprint]. arXiv. https://arxiv.org/pdf/2502.05 (1)311
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., & Zaharia, M. (2018). Structured streaming: A declarative API for real-time applications in Apache Spark. SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data, 13–25. https://people.eecs.berkeley.edu/ (1) (1)~matei/papers/2018/sigmod_structured_streaming.pdf
Harris, J. (n.d.). Delta Lake performance. Del (Harris)ta Lake Blog. https://delta.io/blog/delta-lake-performance/
(1) (2020). Delta Lake: High-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. https://doi.org/10.14778/3415478.3415560 [1]
Micheal, L. (2024). Trade-offs between batch and real-time processing: A case study of Spark Streaming in enterprise data pipelines. [Unpublished manuscript].
Karau, H., & Warren, R. (2017). High performance Spark: Best practices for scaling and optimizing Apache Spark. O’Reilly Media. (1)
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., & Zaharia, M. (2018). Structured streaming: A declarative API for real-time applications in Apache Spark. Proceedings of the VLDB Endowment, 11(12), 1910–1922.
Kodakandla, P. (2023). Real-time data pipeline modernization: A comparative study of latency, scalability, and cost trade-offs in Kafka-Spark-BigQuery architectures. [White paper].
Salim, H. P. (2025). A comparative study of Delta Lake as a preferred ETL and analytics database. International Journal of Computer Trends and Technology (IJCTT), 73(1), 65–71. https://doi.org/10.14445/22312803/IJCTT-V73I1P108
Delta Lake. (n.d.). Build lakehouse with Delta Lake. Retrieved September 13, 2025, from https://delta.io/ (Lang)
Delta Lake. (2024). Structured Spark streaming with Delta Lake: A comprehensive guide. Retrieved September 13, 2025, from https://delta.io/blog/structured-spark-streaming
Article Statistics
Downloads
Copyright License
Copyright (c) 2025 Sai Nikhil Donthi

This work is licensed under a Creative Commons Attribution 4.0 International License.