metadata
tags:
- nanochat
- gpt
- language-model
- mid
license: mit
nanochat model - MID stage
This is a nanochat model trained using Andrej Karpathy's nanochat.
Model Details
- Model type: GPT-style transformer
- Training stage: MID
- Parameters: 560,988,160
- Architecture:
- Layers: 20
- Embedding dim: 1280
- Attention heads: 10
- KV heads (GQA): 10
- Context length: 2048
- Vocab size: 65536
Training Info
- Training step: 808
- Validation BPB: 0.3951428487291989
Architecture Highlights
- ✅ RoPE (Rotary Position Embeddings)
- ✅ QK Normalization
- ✅ ReLU² activation (not GELU)
- ✅ Untied embeddings
- ✅ No bias terms
- ✅ Logit softcapping
- ✅ Group Query Attention (GQA)
Usage
This model uses a custom architecture from nanochat. To use it:
# Clone the nanochat repo
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Download this checkpoint
# Then load using nanochat's checkpoint manager
from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine
model, tokenizer, meta = load_model("mid", "cuda", phase="eval")
engine = Engine(model, tokenizer)
# Generate text
prompt = "The capital of France is"
tokens = tokenizer(prompt, prepend="<|bos|>")
completions, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=50, temperature=0.7)
print(tokenizer.decode(completions[0]))
Special Tokens
The model uses these special tokens for chat:
<|bos|>- Beginning of sequence<|user_start|>,<|user_end|>- User messages<|assistant_start|>,<|assistant_end|>- Assistant messages<|python_start|>,<|python_end|>- Python tool calls<|output_start|>,<|output_end|>- Tool outputs
Citation
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that $100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
License
MIT License - Same as the nanochat repository.