AI & Data Innovation
TalkSession Code
Sess-118Day 2
13:55 - 14:25 EST
Many edge and embedded platforms now rely on Large Language Models (LLMs) to efficiently handle natural language processing with just basic tools. Due to inference running slowly, limits on hardware, and making sacrifices between accuracy and efficiency, performing in real time is still a problem. This research analyzes firmware improvements that address these constraints, with the main goal of improving latency without any loss in the model's accuracy. This study put together a structure that brings together specific firmware actions, scheduled accesses to memory, and instructions that depend on the microarchitecture. We use 4-bit and 8-bit operations, predict memory accesses, and choose a schedule tuned for the ARM NEON and x86 AVX hardware. For confirmation, a special HIL framework processes tests in real time using a fault injection system for memory, accuracy, and latency tracking. We observe that our approach achieves a major improvement in time and energy use while maintaining over 95% of the original model’s performance. This work provides useful suggestions for developers and system architects using LLMs in applications that require fast responses.