Step-Audio-R1.1 / README.md
giantPanda0906's picture
Update README.md
8abb296 verified
metadata
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
tags:
  - audio-reasoning
  - chain-of-thought
  - multi-modal
  - step-audio-r1

Overview of Step-Audio-R1.1

Introduction

Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both real-time responsiveness and strong reasoning capability.

Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables thinking while speaking, achieving high intelligence without sacrificing speed.

Mind-Paced Speaking (Low Latency)

Based on the research Mind-Paced Speaking, the Realtime variant adopts a Dual-Brain Architecture:

  • A Formulation Brain responsible for high-level reasoning
  • An Articulation Brain dedicated to speech generation

This decoupling allows the model to perform Chain-of-Thought reasoning during speech output, maintaining ultra-low latency while handling complex tasks in real time.

Acoustic-Grounded Reasoning (High Intelligence)

To address the inverted scaling issue—where reasoning over transcripts can degrade performance—Step-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone.

Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to state-of-the-art performance, including top-ranking results on the AA benchmark.

image

image

image

Online demonstration

StepFun Audio Studio

WeChat group

You can scan the following QR code to join our WeChat group for communication and discussion.

QR code

Model Usage

📜 Requirements

  • GPU: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
  • Operating System: Linux.
  • Python: >= 3.10.0.

⬇️ Download Model

First, you need to download the Step-Audio-R1 model weights.

Method A · Git LFS

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1.1

Method B · Hugging Face CLI

hf download stepfun-ai/Step-Audio-R1.1 --local-dir ./Step-Audio-R1.1

🚀 Deployment and Execution

We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

  1. Pull the image:
docker pull stepfun2025/vllm:step-audio-2-v20250909
  1. Start the service: Assuming the model is downloaded in the Step-Audio-R1 folder in the current directory.

    docker run --rm -ti --gpus all \
        -v $(pwd)/Step-Audio-R1.1:/Step-Audio-R1.1 \
        -p 9999:9999 \
        stepfun2025/vllm:step-audio-2-v20250909 \
        -- vllm serve /Step-Audio-R1.1 \
        --served-model-name Step-Audio-R1.1 \
        --port 9999 \
        --max-model-len 16384 \
        --max-num-seqs 32 \
        --tensor-parallel-size 4 \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
        --enable-log-requests \
        --interleave-mm-strings \
        --trust-remote-code
    

After the service starts, it will listen on localhost:9999.

🐳 Method 2 · Run from Source (Compile vLLM)

Step-Audio-R1 requires a customized vLLM backend.

  1. Download Source Code:

    git clone https://github.com/stepfun-ai/vllm.git
    cd vllm
    
  2. Prepare Environment:

    python3 -m venv .venv
    source .venv/bin/activate
    
  3. Install and Compile: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.

    # Use pre-compiled C++ extensions (Recommended)
    VLLM_USE_PRECOMPILED=1 pip install -e .
    
  4. Switch Branch: After compilation, switch to the branch that supports Step-Audio.

    git checkout feat/step-audio-support
    
  5. Start the Service:

    # Ensure you are in the vllm directory and the virtual environment is activated
    source .venv/bin/activate
    
    python3 -m vllm.entrypoints.openai.api_server \
        --model ../Step-Audio-R1.1 \
        --served-model-name Step-Audio-R1.1 \
        --port 9999 \
        --host 0.0.0.0 \
        --max-model-len 65536 \
        --max-num-seqs 128 \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.85 \
        --trust-remote-code \
        --enable-log-requests \
        --interleave-mm-strings \
        --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'
    

After the service starts, it will listen on localhost:9999.