Objectives
Real-Time Conversational Command Handling
Develop an in-car voice assistant capable of understanding and responding to nuanced, conversational commands like "It’s too cold" or "The windshield is fogging up," offering more natural interactions than traditional voice commands.
Edge Deployment for Low Latency and Privacy
Replace the existing BERT-based model with a more sophisticated LLM, such as LLaMA, optimized for on-device edge deployment, minimizing reliance on cloud-based processing and reducing latency to enhance user privacy.
Optimized Resource Utilization
Ensure efficient operation of the LLaMA model on Nvidia Orin by applying advanced techniques like 2:4 structured sparsity, TensorRT, and 8-bit quantization, enabling high performance without overtaxing the chip's computational resources.
Results
Sub-100 ms Response Time
Achieved real-time voice command processing with response times under 100 ms, vastly improving the user experience in handling both simple and complex commands.
Efficient Use of Orin’s Resources
Deployed the optimized LLaMA model without impacting the Orin chip's capacity to handle critical self-driving tasks, thanks to resource-efficient techniques like structured sparsity, quantization, and TensorRT integration.
Enhanced Natural Language Understanding
The voice assistant effectively handled more conversational and nuanced commands, improving context interpretation and user intent recognition over the BERT-based model.
Cost Savings and Privacy
Eliminated the need for cloud-based processing, reducing operational costs and enhancing privacy by ensuring voice data remained local to the vehicle.