Reducing Latency and Enhancing Accuracy in LLM Inference through Firmware-Level Optimization

AI & Data Innovation

Talk

Session Code

Sess-118

Day 2

13:55 - 14:25 EST


About the Session

Many edge and embedded platforms now rely on Large Language Models (LLMs) to efficiently handle natural language processing with just basic tools. Due to inference running slowly, limits on hardware, and making sacrifices between accuracy and efficiency, performing in real time is still a problem. This research analyzes firmware improvements that address these constraints, with the main goal of improving latency without any loss in the model's accuracy. This study put together a structure that brings together specific firmware actions, scheduled accesses to memory, and instructions that depend on the microarchitecture. We use 4-bit and 8-bit operations, predict memory accesses, and choose a schedule tuned for the ARM NEON and x86 AVX hardware. For confirmation, a special HIL framework processes tests in real time using a fault injection system for memory, accuracy, and latency tracking. We observe that our approach achieves a major improvement in time and energy use while maintaining over 95% of the original model’s performance. This work provides useful suggestions for developers and system architects using LLMs in applications that require fast responses.


Speaker

Reena Chandra

Reena Chandra

Senior Engineer, Amazon