Image-Text-to-Text

arXiv Discord Twitter

🌟 Github | 📄 Webpage

AutoNeural-VL-1.5B

Introduction

AutoNeural is an NPU-native vision–language model for in-car assistants, co-designed with a MobileNetV5 encoder and a hybrid Liquid AI 1.2B backbone to deliver real-time multimodal understanding on Qualcomm SA8295P NPU. It runs 768×768 images, cuts end-to-end latency by up to 14×, and improves quantization error by compared to ViT–Transformer baselines on the same hardware.

Key Features:

  • NPU-native co-design – MobileNet-based vision encoder + hybrid Transformer–SSM backbone, built for INT4/8/16 and NPU operator sets.
  • Real-time cockpit performance – Up to 14× lower TTFT, ~3× faster decode, and 4× longer context (4096 vs 1024) on Qualcomm SA8295P NPU.
  • High-resolution multimodal perception – Supports 768×768 images with ~45 dB SQNR under mixed-precision quantization (W8A16 vision, W4A16 language).
  • Automotive-tuned dataset – Trained with 200k proprietary cockpit samples (AI Sentinel, Greeter, Car Finder, Safety) plus large-scale Infinity-MM instruction data.
  • Production-focused – Designed for always-on, low-power, privacy-preserving deployment in real vehicles.

Use Cases

AutoNeural powers real-time cockpit intelligence including in-cabin detection, out-cabin awareness, HMI understanding, and visual + conversational agent.

Use Case

Benchmarks

Benchmark
Metric InternVL 2B (baseline) AutoNeural-VL
TTFT (1× 512² image) ~1.4 s ~100 ms
Max image size 448×448 768×768
SQNR 28 dB 45 dB
RMS quantization error 3.98% 0.562%
Decode throughput ~15 tok/s ~44 tok/s
Context length 1024 4096

📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.


How to Use

⚠️ Hardware requirement: AutoNeural is only available for Qualcomm NPUs.

1) Install Nexa-SDK

Download the SDK,follow the installation steps provided on the model page.

2) Configure authentication

Create an access token in the Model Hub, then run:

nexa config set license '<access_token>'

3) Run the model

nexa infer NexaAI/AutoNeural

4) Image input

Drag and drop one or more image files into the terminal window. Multiple images can be processed with a single query.


Model architecture

Model Architecture

AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).

  • Vision encoder. A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into 256 visual tokens, giving strong inductive bias and stable INT8/16 quantization.
  • Vision–language connector. A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
  • Language backbone. A 1.2B-parameter hybrid Transformer–SSM (“Liquid AI”) model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
  • Quantization. The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.

Training

Training

AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.

  1. Image–text alignment. Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
  2. General visual understanding. Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
  3. Instruction tuning. Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
  4. Automotive domain finetuning. Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.

License

This model is licensed under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.

All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.

Enterprise Deployment

For enterprise deployment, custom integrations, or licensing inquiries:

📅 Book a Call with Us

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including NexaAI/AutoNeural