
Reducing Latency and Enhancing Accuracy in LLM Inference through Firmware-Level Optimization
Reena Chandra , Tools and Automation Engineer, Amazon, CA, USAAbstract
Many edge and embedded platforms now rely on Large Language Models (LLMs) to efficiently handle natural language processing with just basic tools. Due to inference running slowly, limits on hardware, and making sacrifices between accuracy and efficiency, performing in real time is still a problem. This research analyzes firmware improvements that address these constraints, with the main goal of improving latency without any loss in the model's accuracy. This study put together a structure that brings together specific firmware actions, scheduled accesses to memory, and instructions that depend on the microarchitecture. We use 4-bit and 8-bit operations, predict memory accesses, and choose a schedule tuned for the ARM NEON and x86 AVX hardware. For confirmation, a special HIL framework processes tests in real time using a fault injection system for memory, accuracy, and latency tracking. We observe that our approach achieves a major improvement in time and energy use while maintaining over 95% of the original model’s performance. This work provides useful suggestions for developers and system architects using LLMs in applications that require fast responses.
Keywords
Large Language Models (LLMs), Firmware Optimization, Inference, Latency
References
NVIDIA. Mastering LLM Techniques: Inference Optimization. (2023, November 17). NVIDIA Technical Blog. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ [Accessed on 24/06/2025].
Kumar, A. (2024, September 4). How to Reduce the Latency of a LLM Application? - Aidetic. Medium; Aidetic. https://blog.aidetic.in/how-to-reduce-the-latency-of-a-llm-application-c84e52eaff9b [Accessed on 24/06/2025].
LucaStamatescu. (2024, May 14). The LLM Latency Guidebook: Optimizing Response Times for GenAI Applications. TECHCOMMUNITY.MICROSOFT.COM. https://techcommunity.microsoft.com/blog/azure-ai-services-blog/the-llm-latency-guidebook-optimizing-response-times-for-genai-applications/4131994 [Accessed on 24/06/2025].
Jain, S. (2024, December 18). Why do LLMs have latency ? - Sulbha Jain - Medium. Medium. https://medium.com/@sulbha.jindal/why-do-llms-have-latency-296867583fd2 [Accessed on 24/06/2025].
Filippo, C., Vito, G., Irene, S., Simone, B., & Gualtiero, F. (2024). Future applications of generative large language models: A data-driven case study on ChatGPT. Technovation, 133, 103002–103002. https://doi.org/10.1016/j.technovation.2024.103002 [Accessed on 24/06/2025].
Son, M., Won, Y.-J., & Lee, S. (2025). Optimizing Large Language Models: A Deep Dive into Effective Prompt Engineering Techniques. Applied Sciences, 15(3), 1430. https://doi.org/10.3390/app15031430 [Accessed on 24/06/2025].
Huang, S., Yang, K., Qi, S., & Wang, R. (2024). When large language model meets optimization. Swarm and Evolutionary Computation, 90, 101663–101663. https://doi.org/10.1016/j.swevo.2024.101663 [Accessed on 24/06/2025].
Pankaj. (2024, January 13). Optimizing LLMs for Your Use Cases: A Developer’s Guide. Medium. https://medium.com/@pankaj_pandey/optimizing-llms-for-your-use-cases-a-developers-guide-4fa2d8b43d02 [Accessed on 24/06/2025]. Aman, Y. (2025, February 14).
Aman, Y. (2025, February 14). LLM Model Optimisation Techniques and Frameworks - Yugank.Aman - Medium. Medium. https://medium.com/@yugank.aman/llm-model-optimization-techniques-and-frameworks-e21d57744ca1 [Accessed on 24/06/2025].
Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., & Chen, J. (2025). A Review on Edge Large Language Models: Design, Execution, and Applications. ACM Computing Surveys. https://doi.org/10.1145/3719664 [Accessed on 24/06/2025].
Naminas, K. (2024, December 11). LLM Inference: Techniques for Optimized Deployment. Labelyourdata.com; Label Your Data. https://labelyourdata.com/articles/llm-inference [Accessed on 24/06/2025].
Sivakumar, S. (2024). Performance Optimization of Large Language Models (LLMs) in Web Applications. International Journal of Advanced Scientific Research, 8(1), 1077–1096. https://www.researchgate.net/publication/386342544_Performance_Optimization_of_Large_ Language_Models_LLMs_in_Web_Applications [Accessed on 24/06/2025].
Shahzad, T., Mazhar, T., Tariq, M. U., Ahmad, W., Khmaies Ouahada, & Habib Hamam. (2025). A comprehensive review of large language models: issues and solutions in learning environments. Discover Sustainability, 6(1). https://doi.org/10.1007/s43621-025-00815-8 [Accessed on 24/06/2025].
Stöffelbauer, A. (2023, October 24). How Large Language Models Work. Data Science at Microsoft; Medium. https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f [Accessed on 24/06/2025].
Naveed, H., Ullah Khan, A., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2023). A Comprehensive Overview of Large Language Models. http://arxiv.org/pdf/2307.06435 [Accessed on 24/06/2025].
IBM. (2023, November 2). What are large language models (LLMs)? Ibm.com; IBM. https://www.ibm.com/think/topics/large-language-models [Accessed on 24/06/2025].
Ali, A., & Ghanem, M. C. (2025). Beyond Detection: Large Language Models and Next- Generation Cybersecurity. SHIFRA, 2025, 81–97. https://doi.org/10.70470/shifra/2025/005 [Accessed on 24/06/2025]. Ghosh, B. (2024, June 3). Prompt Optimization, Reduce LLM Costs and Latency - Bijit Ghosh - Medium. Medium. https://medium.com/@bijit211987/prompt-optimization-reduce-llm-costs-and-latency-a4c4ad52fb59 [Accessed on 24/06/2025].
Ghosh, B. (2024, June 3). Prompt Optimization, Reduce LLM Costs and Latency - Bijit Ghosh - Medium. Medium. https://medium.com/@bijit211987/prompt-optimization-reduce-llm-costs-and-latency-a4c4ad52fb59 [Accessed on 24/06/2025].
Y. Bian, Y. Song, G. Ma, R. Zhu, and Z. Cai, “DroidRetriever: An Autonomous Navigation and Information Integration System Facilitating Mobile Sensemaking,” arXiv.org, 2025. https://arxiv.org/abs/2505.03364 (accessed Jun. 25, 2025).
Article Statistics
Downloads
Copyright License
Copyright (c) 2025 Reena Chandra

This work is licensed under a Creative Commons Attribution 4.0 International License.