PEFT documentation

Cartridges

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.18.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Cartridges

Cartridges are a prompt-learning method that stores a compressed long-context representation as a parameterized KV-cache prefix. The core idea comes from the paper Cartridges: Lightweight and general-purpose long context representations via self-study.

For a high-level overview and motivation, see the blog post Cartridges: Storing long contexts in tiny caches with self-study.

How Cartridges differ from Prefix Tuning

Both Prefix Tuning and Cartridges are served by injecting past_key_values (a prefix KV cache) into the base model.

  • Prefix Tuning learns virtual token embeddings (and optionally an MLP projection) and produces a KV prefix.
  • Cartridges learn the KV prefix itself directly (the per-layer key/value vectors for p virtual tokens), and are designed to be initialized from real prefill KV (for example, the first p tokens of a corpus/system prompt).

The paper also recommends freezing the first token as an attention sink for stability (num_frozen_tokens=1 is the default).

Usage (inference)

Load a trained CARTRIDGE adapter and run generation:

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_path = "path/to/cartridge_adapter"

base = AutoModelForCausalLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(base, adapter_path)

tok = AutoTokenizer.from_pretrained(model_id)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

out = model.generate(**tok("Question about the corpus:", return_tensors="pt"), max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))

If you need to create and initialize a cartridge before training, see the initialization options below.

Initialization options

The paper discusses a few practical initialization strategies:

  • Random KV (default): create a CartridgeConfig and start training. This initializes the KV prefix randomly.
  • KV from the first tokens of a prompt/corpus: use initialize_kv_prefix_from_text(model, tokenizer, text=...). This runs a prefill on text and copies the resulting KV cache for the first num_virtual_tokens into the adapter.
  • KV from an existing cache: use initialize_kv_prefix_from_past_key_values(model, past_key_values=...) if you already have a past_key_values object from a base-model prefill.

Training

The Cartridges paper proposes a SELF-STUDY distillation objective (a frozen base model provides teacher logits; the CARTRIDGE adapter is trained so the student matches the teacher’s next-token distribution over the target segment). PEFT keeps training logic out of the core library; see https://github.com/huggingface/peft/tree/main/examples/cartridge_self_study for a reference workflow. The example scripts use the frozen base model as the teacher and the adapted model as the student, so both share the same underlying checkpoint.

Composition

To concatenate independently trained cartridges into a single adapter, use compose_cartridge_adapters(...).

CartridgeConfig

class peft.CartridgeConfig

< >

( task_type: Optional[Union[str, TaskType]] = None peft_type: Optional[Union[str, PeftType]] = None auto_mapping: Optional[dict] = None peft_version: Optional[str] = None base_model_name_or_path: Optional[str] = None revision: Optional[str] = None inference_mode: bool = False num_virtual_tokens: int = None token_dim: int = None num_transformer_submodules: Optional[int] = None num_attention_heads: Optional[int] = None num_layers: Optional[int] = None modules_to_save: Optional[list[str]] = None num_frozen_tokens: int = 1 )

Parameters

  • num_frozen_tokens (int, defaults to 1) — Number of prefix tokens at the start of the cartridge to keep frozen (no gradients). The Cartridges paper recommends freezing the first token as an attention sink for stability (set this to 1), as many LLMs use early tokens as attention sinks and changing them can harm training.

Configuration for CARTRIDGE, a KV-cache-parameterized prefix adapter.

This is similar to prefix-tuning in how it is served (as past_key_values), but it stores the KV cache directly as trainable parameters instead of learning it via an MLP projection.

Initialization: The Cartridges paper discusses multiple initialization options. In PEFT, initialization is a separate step from constructing the adapter config:

  • Random KV initialization (paper option 2): Create the adapter via get_peft_model(...). The CARTRIDGE prompt encoder parameters are randomly initialized by PyTorch.

  • KV derived from the first tokens of a prompt/corpus (paper option 3): Run a no-grad prefill on the base model and copy the first num_virtual_tokens cached KV tokens into the adapter. PEFT provides utilities for this (importable from peft or from peft.tuners.cartridge.utils):

    • initialize_kv_prefix_from_text(model, tokenizer, text=...)
    • initialize_kv_prefix_from_past_key_values(model, past_key_values=...)

    If you already have a flattened KV-prefix tensor, you can load it directly via the prompt encoder’s load_prompt_embeddings(...) method.

CartridgeEncoder

class peft.CartridgeEncoder

< >

( config )

A parameterized prefix KV cache.

The parameters are stored in the same flattened layout as PrefixEncoder output: `[num_virtual_tokens, num_layers

  • 2 * token_dim], where token_dimis per-head hidden size times number of heads (after any GQA adjustment performed by_prepare_prompt_learning_config`).

If num_frozen_tokens > 0, the first num_frozen_tokens virtual tokens are stored as a non-trainable parameter, and the remaining tokens are trainable.

load_prompt_embeddings

< >

( prompt_embeddings: torch.Tensor )

Load the flattened prompt embeddings saved by PEFT (prompt_embeddings).

PEFT saves prompt-learning adapters as a single prompt_embeddings tensor. For CARTRIDGE, we split that tensor into frozen and trainable segments according to self.num_frozen_tokens.

Update on GitHub