lzy
commited on
Commit
·
3b81a9d
1
Parent(s):
189e351
Add model weights
Browse files
README.md
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
|
| 6 |
+
|
| 7 |
+

|
| 8 |
+

|
| 9 |
+
|
| 10 |
+
[🌐**Project Page**](https://sites.google.com/view/open-mla) | [✍️**Paper(Arxiv)**](http://arxiv.org/abs/2509.26642) | [🎥**Demo**](https://sites.google.com/view/open-mla)
|
| 11 |
+
|
| 12 |
+
Zhuoyang Liu*, Jiaming Liu*, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang
|
| 13 |
+
|
| 14 |
+
We introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling.
|
| 15 |
+
Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence.
|
| 16 |
+
To further enhance MLA’s understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation.
|
| 17 |
+
|