cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit · Error installing from PR branch

Oct 31

got this while trying to install vllm:

ERROR: Could not find a version that satisfies the requirement xformers==0.0.33+5d4b92a5.d20251029; platform_system == "Linux" and platform_machine == "x86_64" (from vllm) (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.9, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.16rc424, 0.0.16rc425, 0.0.16, 0.0.20, 0.0.21, 0.0.22, 0.0.22.post7, 0.0.23, 0.0.23.post1, 0.0.24, 0.0.25, 0.0.25.post1, 0.0.26.post1, 0.0.27, 0.0.27.post1, 0.0.27.post2, 0.0.28, 0.0.28.post1, 0.0.28.post2, 0.0.28.post3, 0.0.29, 0.0.29.post1, 0.0.29.post2, 0.0.29.post3, 0.0.30, 0.0.31, 0.0.31.post1, 0.0.32.post1, 0.0.32.post2, 0.0.33.dev1089, 0.0.33.dev1090)
ERROR: No matching distribution found for xformers==0.0.33+5d4b92a5.d20251029; platform_system == "Linux" and platform_machine == "x86_64"

same with uv:

(vllm.313) drros@tesla:~/vllm.313/vllm$ VLLM_USE_PRECOMPILED=1 uv pip install -e .
Using Python 3.13.6 environment at: /home/drros/vllm.313/.venv
  × No solution found when resolving dependencies:
  ╰─▶ Because there is no version of xformers{platform_machine == 'x86_64' and sys_platform == 'linux'}==0.0.33+5d4b92a5.d20251029 and vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled depends on xformers{platform_machine == 'x86_64' and sys_platform == 'linux'}==0.0.33+5d4b92a5.d20251029, we can conclude that vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled cannot be used.
      And because only vllm==0.11.1rc6.dev8+g09c9d32a5.precompiled is available and you require vllm, we can conclude that your requirements are unsatisfiable.

DrRos

Oct 31

Solved by commenting out xformers in requirements/cuda.txt and manually installing xformers==0.0.33.dev1090
There is some issues with model - tool calling does not worked for me (in vllm logs was an error - [serving_chat.py:256] RuntimeError: Kimi-K2 Tool parser could not locate tool call start/end tokens in the tokenizer! and model was running slower than one can expect for 3B active - i got 25-30 tps of TG - this is with dual A5000, model runs with -tp 2.

cpatonn

cyankiwi org Oct 31

Thank you for trying the model. The problem also occurs in vllm main branch, i.e., here.

While this problem occurs, I would recommend the following as a fast fix:

# Install vllm without dependencies
VLLM_USE_PRECOMPILED=1 pip install --no-deps .
                       
# Install all other requirements except xformers
pip install -r requirements/common.txt
pip install numba==0.61.2
pip install "ray[cgraph]>=2.48.0"
pip install torch==2.9.0
pip install torchaudio==2.9.0
pip install torchvision==0.24.0
pip install flashinfer-python==0.4.1
                                    
# Install xformers WITHOUT its dependencies to prevent version changes
pip install --no-deps xformers==0.0.33.dev1090

ztsvvstz

Oct 31

Can confirm it works with two rtx3090 tp 2 with 30t/s
however
pipeline parallel (which I suppose would run faster) gives an error like "intermediate tensors is None"
And also the tokenization or generation is a bit weird, when generating code it sometimes stops in the middle and sometimes it seems to append the exact same token randomly repeating.

So inside html code it keeps appending "9px}" randomly throughout the code

cpatonn

cyankiwi org Oct 31

@DrRos @ztsvvstz I really appreciate your feedback. In models with hybrid linear attention architecture, I keep attention layers at BF16 precisions for higher model accuracy. And thus, this might be the reason for slow inference speed.

I will keep this in mind in future model quantizations.

itsmebcc

Oct 31

cpatonn -- This is also your account? I guess I have 2 places to look for updates.

ztsvvstz

Nov 1

@DrRos @ztsvvstz I really appreciate your feedback. In models with hybrid linear attention architecture, I keep attention layers at BF16 precisions for higher model accuracy. And thus, this might be the reason for slow inference speed.

I will keep this in mind in future model quantizations.

Alright alright, I got some more info for ya :)
So with the newest vllm its now quite fast at ~70t/s buuuut....
It only outputs "!!!!!!!!"
occasionally some other random tokens in between but mostly "!!!!!!!!!!!!!!!"
Honestly if we get this working properly at this speed I'd be quite happy ;p
I used the same params as before

cpatonn

cyankiwi org Nov 1

@itsmebcc Yes, thank you for using my quant so far! I am starting to migrate from my HF account to this org account :)

cpatonn

cyankiwi org Nov 1

@ztsvvstz Thank you for the info. May I get the revision that you are on? Could you try the latest commit i.e., 30a14b034fa387470a512e8004527ad1c28af303?

ztsvvstz

Nov 1

•

edited Nov 1

@ztsvvstz Thank you for the info. May I get the revision that you are on? Could you try the latest commit i.e., 30a14b034fa387470a512e8004527ad1c28af303?

Didnt keep track on which commit I was on sorry o:
But I can confirm that the newest vllm version does work without problems (so far).
Will do more speed / consistency tests later but seems like the bugs like pre-ending mid code generation do not happen anymore:)
I would be particulary interested in pipeline parallel (did not test this yet with the newest)
As in my experience, it allows for higher t/s than tensor parallel.
Currently Im at a throughput of ~1.5gb on the PCI lane for my two gpus and I suspect that to be quite a bottleneck (qwen3-next for example runs pretty fast at 110t/s with pipeline parallel on 3 gpus)
Thanks for your work, appreciate the fast responses

Pipeline parallel throws this error:

"gpu_model_runner.py", line 2007, in sync_and_slice_intermediate_tensors
assert self.intermediate_tensors is not None

itsmebcc

Nov 1

•

edited Nov 1

I have it working pretty well.

I am running with this:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/owner/stuff/kimi-awq --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 121000 --trust-remote-code --max-num-seqs 1 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enable-expert-parallel

on:

python==3.11.14 (conda env vllm-kimi)
vllm==0.11.1rc6.dev8+g09c9d32a5 (editable build from PR 27834)
torch==2.9.0+cu128, torchvision==0.24.0+cu128, torchaudio==2.9.0+cu128
xformers==0.0.33+5d4b92a5.d20251029 (custom wheel built from commit 5d4b92a5)
fla-core==0.4.0
transformers==4.57.1, huggingface-hub==0.36.0, tokenizers==0.22.1, sentencepiece==0.2.1
flashinfer-python==0.4.1
numpy==2.2.6, scipy==1.16.3, ray==2.51.0, cuda-python==13.0.3with a 3090+4090

Pipeline parallel does not work currently with KimiLinearForCausalLM

ztsvvstz

Nov 1

I have it working pretty well.

I am running with this:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/owner/stuff/kimi-awq --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 512000 --trust-remote-code --max-num-seqs 2 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enable-expert-parallel

on:

python==3.11.14 (conda env vllm-kimi)

vllm==0.11.1rc6.dev8+g09c9d32a5 (editable build from PR 27834)

torch==2.9.0+cu128, torchvision==0.24.0+cu128, torchaudio==2.9.0+cu128

xformers==0.0.33+5d4b92a5.d20251029 (custom wheel built from commit 5d4b92a5)

fla-core==0.4.0

transformers==4.57.1, huggingface-hub==0.36.0, tokenizers==0.22.1, sentencepiece==0.2.1

flashinfer-python==0.4.1

numpy==2.2.6, scipy==1.16.3, ray==2.51.0, cuda-python==13.0.3with a 3090+4090

Pipeline parallel does not work currently with KimiLinearForCausalLM

at what speeds? :)
Did some tests now and it seems the 30t/s I mentioned earlier only happen on pretty much no context (no chat template etc just checking if the model responds at all)
When applying chat template with a proper prompt the speed drops down to
Processed prompts: 100%|█████████████████████████████| 1/1 [00:04<00:00, 4.42s/it, est. speed input: 4452.57 toks/s, output: 1.81 toks/s]

itsmebcc

Nov 1

~40 tokens/s

lsm03624

Nov 3

•

edited Nov 3

Thank you for trying the model. The problem also occurs in vllm main branch, i.e., here.

While this problem occurs, I would recommend the following as a fast fix:

# Install vllm without dependencies
VLLM_USE_PRECOMPILED=1 pip install --no-deps .
                       
# Install all other requirements except xformers
pip install -r requirements/common.txt
pip install numba==0.61.2
pip install "ray[cgraph]>=2.48.0"
pip install torch==2.9.0
pip install torchaudio==2.9.0
pip install torchvision==0.24.0
pip install flashinfer-python==0.4.1
                                    
# Install xformers WITHOUT its dependencies to prevent version changes
pip install --no-deps xformers==0.0.33.dev1090

Thank you! Following your instructions, the model was successfully run. The output speed is 66 t/s, but the GPU is not being utilized to its full capacity.

rekrek

Nov 6

Mostly getting 50t/s for single requests on 2x3090. Seen >100tok/s for concurrent 2 with >50k context. Not benched.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0,1 vllm serve cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \
        --port 8111  --gpu-memory-utilization 0.95 --max_model_len 74000 --tensor_parallel 2 --enable_prefix_caching --max_num_batched_tokens 74000 \
        --max-num-seqs 8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2 --served-model-name Kimi-Linear-48B-A3B-Instruct --api-key local --dtype float16

Did crash vllm a few times, seems there are still errors in the vllm code. So I am updating vllm to current with a git pull... and getting back to dependency hell :) Trying to fix it.

AttributeError: module 'triton.language' has no attribute 'constexpr_function'

So starting from fresh and trying latest as today may be different in python dep world:

# python 3.13.7 
git pull
python -m venv .venv
source .venv/bin/activate #.fish
pip install uv pip -U
uv pip install torch transformers xformers torchvision accelerate wheel fla-core  flash-linear-attention --extra-index-url https://download.pytorch.org/whl/cu130-ampare -U
uv pip install flash-attn --no-build-isolation
uv pip install -r requirements/common.txt -U
VLLM_USE_PRECOMPILED=1 uv pip install -e . -U  --prerelease=allow

And would you look at that, it worked TODAY.

(APIServer pid=16310) INFO 11-06 10:40:03 [loggers.py:221] Engine 000: Avg prompt throughput: 672.1 tokens/s, Avg generation throughput: 50.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%
(APIServer pid=16310) INFO 11-06 10:40:13 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=16310) INFO 11-06 10:40:23 [loggers.py:221] Engine 000: Avg prompt throughput: 224.1 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=16310) INFO 11-06 10:40:33 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=16310) INFO 11-06 10:40:43 [loggers.py:221] Engine 000: Avg prompt throughput: 768.2 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 0.0%
(APIServer pid=16310) INFO 11-06 10:40:53 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 0.0%

Starting playing with opencode and aider with it.

Sometimes it doesn't understand what is asked or does something else while saying it will do what is requested.
But always a code junky, it just goes on and spits code at high speed ... Quality ?

Asking for fake thinking seems to help a bit :

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. 
You should enclose your thoughts and internal monologue inside `<think>` MONOLOGUE `</think>` tags, and then provide your solution or response to the problem. 
You think and write your solution in ENGLISH unless specified.

I may also add this later part :


When generating your internal MONOLOGUE, consider the following structure:

1. **Initial Thoughts**: Start by clarifying the problem or question. Explore your initial reactions and thoughts.
2. **Analysis**: Break down the problem, identify key components, and examine relationships.
3. **Reflection**: Consider different perspectives, potential biases, and assumptions.
4. **Synthesis**: Think about how the different components fit together and how they relate to your existing knowledge.
5. **Conclusion**: Draw a final conclusion based on your reasoning.

Example:
<think>
[...]
</think>

[Insert final response or solution here]