Image-Text-to-Text
zackli4ai commited on
Commit
2d63589
·
verified ·
1 Parent(s): 8a1f1d1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Overview**
2
+
3
+ **AutoNeural** is a next-generation, **NPU-native multimodal vision–language model** co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both **vision encoding** and **language modeling** for the constraints and capabilities of NPUs—achieving **14× faster latency**, **7× lower quantization error**, and **real-time automotive performance** even under aggressive low-precision settings.
4
+
5
+ AutoNeural integrates:
6
+
7
+ * A **MobileNetV5-based vision encoder** with depthwise separable convolutions.
8
+ * A **Liquid AI hybrid Transformer-SSM language backbone** that dramatically reduces KV-cache overhead.
9
+ * A **normalization-free MLP connector** tailored for quantization stability.
10
+ * Mixed-precision **W8A16 (vision)** and **W4A16 (language)** inference validated on real Qualcomm NPUs.
11
+
12
+ AutoNeural powers real-time cockpit intelligence including **in-cabin safety**, **out-of-cabin awareness**, **HMI understanding**, and **visual + conversational function calls**, as demonstrated in the on-device results (Page 6 figure) .
13
+
14
+ ---
15
+
16
+ # **Key Features**
17
+
18
+ ### 🔍 **MobileNetV5 Vision Encoder (300M)**
19
+
20
+ Optimized for edge hardware, with:
21
+
22
+ * **Depthwise separable convolutions** for low compute and bounded activations.
23
+ * **Local attention bottlenecks** only in late stages for efficient long-range reasoning.
24
+ * **Multi-Scale Fusion Adapter (MSFA)** producing a compact **16×16×2048** feature map.
25
+ * Stable **INT8/16** behavior with minimal post-quantization degradation.
26
+
27
+ Yields **5.8× – 14× speedups** over ViT baselines across 256–768 px inputs.
28
+
29
+ ---
30
+
31
+ ### 🧠 **Hybrid Transformer-SSM Language Backbone (1.2B)**
32
+
33
+ Designed for NPU memory hierarchies:
34
+
35
+ * **5:1 ratio of SSM layers to Transformer attention layers**
36
+ * **Linear-time gated convolution layers** for most steps
37
+ * **Tiny rolling state** instead of KV-cache → up to **60% lower memory bandwidth**
38
+ * **W4A16 stable quantization** across layers
39
+
40
+ ---
41
+
42
+ ### 🔗 **Normalization-Free Vision–Language Connector**
43
+
44
+ A compact 2-layer MLP using **SiLU**, deliberately **removing RMSNorm** to avoid unstable activation ranges during static quantization.
45
+
46
+ Ensures reliable deployment on W8A16/W4A16 pipelines.
47
+
48
+ ---
49
+
50
+ ### 🚗 **Automotive-Grade Multimodal Intelligence**
51
+
52
+ Trained on **10M Infinity-MM samples** plus **200k automotive cockpit samples**, covering:
53
+
54
+ * AI Sentinel (vehicle security)
55
+ * AI Greeter (identity recognition)
56
+ * Car Finder (parking localization)
57
+ * Passenger safety monitoring
58
+
59
+ Ensures robust performance across lighting, demographics, weather, and motion scenarios.
60
+
61
+ ---
62
+
63
+ ### ⚡ **Real NPU Benchmarks**
64
+
65
+ Validated on **Qualcomm SA8295P NPU**:
66
+
67
+ | Metric | Baseline (InternVL 2B) | **AutoNeural-VL** |
68
+ | ------------------------- | ---------------------- | ----------------- |
69
+ | **TTFT** | ~1.4 s | **~100 ms** |
70
+ | **Max Vision Resolution** | 448×448 | **768×768** |
71
+ | **RMS Quant Error** | 3.98% | **0.56%** |
72
+ | **Decode Throughput** | 15 tok/s | **44 tok/s** |
73
+ | **Context Length** | 1024 | **4096** |
74
+
75
+ ---
76
+
77
+ # **How to Use**
78
+
79
+ > ⚠️ **Hardware requirement:** AutoNeural is optimized for **Qualcomm NPUs**.
80
+
81
+ ### 1) Install Nexa-SDK
82
+
83
+ Download the SDK,follow the installation steps provided on the model page.
84
+
85
+ ---
86
+
87
+ ### 2) Configure authentication
88
+
89
+ Create an access token in the Model Hub, then run:
90
+
91
+ ```bash
92
+ nexa config set license '<access_token>'
93
+ ```
94
+
95
+ ---
96
+
97
+ ### 3) Run the model
98
+
99
+ ```bash
100
+ nexa infer NexaAI/AutoNeural
101
+ ```
102
+
103
+ ### Image input
104
+
105
+ Drag and drop one or more image files into the terminal window.
106
+ Multiple images can be processed with a single query.
107
+
108
+ ### Example prompts
109
+
110
+ * “Is there any safety risk for the child in this image?”
111
+ * “Explain the meaning of this warning light.”
112
+ * “What are the parking rules shown in this sign?”
113
+ * “Create a calendar event based on this poster.”
114
+
115
+ ---
116
+
117
+ # **License**
118
+
119
+ The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.
120
+
121
+ You may:
122
+
123
+ * Use the model for **non-commercial** purposes
124
+ * Modify and redistribute it with attribution
125
+
126
+ For **commercial licensing**, please contact:
127
+ **[dev@nexa.ai](mailto:dev@nexa.ai)**