State-of-the-art Large Reasoning Model Built Using Only Synthetic Data on AMD GPUs

Paper Hugging Face Dataset GitHub Blog Post

Model Summary

We introduce SAND-Math-Qwen2.5-32B and SAND-MathScience-DeepSeek-Qwen32B, state-of-the-art reasoning models in the 32B parameter range, built entirely using a synthetic data pipeline running on the AMD ROCm™ stack and AMD Instinct™ MI325 GPUs.

By prioritizing data difficulty along with quantity, we demonstrate that high-difficulty synthetic data can elevate prior-generation models to match or exceed modern proprietary models. SAND-Math-Qwen2.5-32B is fine-tuned from Qwen2.5-32B-Instruct on just 14k synthetic math samples, achieving strong reasoning capabilities with minimal data outperforming other data distillation and post training approaches. SAND-MathScience-DeepSeek-Qwen32B is fine-tuned from DeepSeek-R1-Distill-Qwen-32B on a compact dataset of 27k samples (15k Math + 12k Science), achieving a generational leap in performance that rivals Qwen3-32B.

We are releasing the models, datasets, and code to empower the community to build their own state-of-the-art reasoning models using AMD hardware.

📊 Benchmark Results

We conducted extensive experiments to validate that our pipeline yields superior results compared to models trained on significantly larger datasets.

1. Bridging the Generational Gap

Fine-tuning the Qwen2.5-based DeepSeek-R1-Distill-Qwen-32B on our mixed Math/Science dataset allows it to rival and even surpass the next-generation Qwen3-32B on key benchmarks.

Model AIME24 AIME25 MATH500 GPQA
DeepSeek-Distilled-Qwen32B (Base) 72.6 54.9 94.3 62.1
EXAONE Deep 32B 72.1 65.8 95.8 66.1
Qwen3-32B (Thinking mode) 81.4 72.9 97.0 68.4
SAND-MathScience-DeepSeek-Qwen32B (Ours) 83.85 78.33 93.85 68.72

2. Efficiency: Unlocking Reasoning with Less Data

Using only 14k synthetic math samples and standard SFT (no RL), our approach outperforms models trained on datasets 5x to 50x larger.

Model Data Size AIME24 AIME25 MATH500 GPQA
Qwen2.5-32B-Instruct (Base) - 16.7 13.3 83.4 53.5
DeepSeek-R1-Distill-Qwen-32B 800k 72.6 54.9 94.3 62.1
Light-R1-32B 79k 73.0 64.3 93.3 60.6
OpenThinker-32B 114k 66.0 53.3 89.4 57.6
SAND-Math-Qwen2.5-32B (Ours) 14k 74.01 68.18 92.05 60.8

⚙️ The Synthetic Data Pipeline

Our results are powered by a 4-stage automated pipeline running on AMD hardware that prioritizes difficulty and novelty over volume. Unlike datasets that recycle easy problems, our pipeline leverages a Teacher Model (GPT-OSS120b) to generate, validate, and systematically "hike" the difficulty of reasoning problems.

Pipeline Overview

Pipeline Stages

  1. Stage 1: QA Generation & Consistency 🛠️

    • Generates novel problems from scratch
    • Enforces correctness by requiring the teacher to generate multiple independent solution paths
    • Only questions where all answers align are kept
  2. Stage 2: De-duplication & Decontamination 🧹

    • Removes internal duplicates via embedding similarity
    • Crucial Step: Scans against known test sets (AIME, MATH, GPQA) to ensure zero contamination
  3. Stage 3: Difficulty Hiking 🏔️

    • Moderately challenging questions are rewritten by the teacher model
    • Introduces deeper reasoning chains, added constraints, or cross-domain logic
    • Systematically elevates complexity
    • Configurable step primarily used when initial generation yields insufficient volume of high-difficulty samples

🚀 Quick Start

Python Inference (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "amd/SAND-MathScience-DeepSeek-Qwen32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example prompt
prompt = "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096,
    temperature=0.7, # Recommended temperature
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Response:", response)

Serving (vLLM & SGLang)

You can easily serve this model as an OpenAI-compatible API endpoint.

Using SGLang:

python -m sglang.launch_server --model-path amd/SAND-MathScience-DeepSeek-Qwen32B --max-model-len 32768

Using vLLM:

vllm serve amd/SAND-MathScience-DeepSeek-Qwen32B --max-model-len 32768

💡 Usage Recommendations

To replicate our performance benchmarks and achieve the best reasoning results, we strongly recommend the following configurations:

  • Temperature: Set temperature=0.7. DO NOT use greedy decoding, as it can lead to performance degradation and repetitive loops.
  • Prompting: For mathematical problems, include a directive to enforce structure:

    "Please reason step by step, and put your final answer within \boxed{}."

  • Context Length: We recommend allowing an output length of 32,768 tokens. This ensures the model has sufficient space for long Chain-of-Thought (CoT) generation.
  • Thinking Token: It is recommended to enforce the model to initiate its response with the <think>\n token to trigger the reasoning mode effectively.
  • Evaluation: When benchmarking, conduct multiple passes (Pass@K) and average the results for stability.

📜 License

This project is licensed under the Open RAIL-MSD license. This is an open, royalty-free license that permits commercial use, modification, and distribution of the dataset, models, and source code.

The license includes standard use-based restrictions to prevent harmful applications (e.g., illegal activities, generating harmful content, high-risk applications). These restrictions are designed to promote responsible AI development while keeping the license permissive for legitimate use cases.

For full license terms and conditions, please see the LICENSE file.


Citation

If you use this model, dataset, or pipeline in your research, please cite our work:

@misc{manem025sandmathusingllmsgenerate,
      title={SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers},
      author={Chaitanya Manem and Pratik Prabhanjan Brahma and Prakamya Mishra and Zicheng Liu and Emad Barsoum},
      year={2025},
      eprint={2507.20527},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.20527},
}
Downloads last month
11
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/SAND-MathScience-DeepSeek-Qwen32B

Finetuned
(85)
this model
Quantizations
1 model

Dataset used to train amd/SAND-MathScience-DeepSeek-Qwen32B

Collection including amd/SAND-MathScience-DeepSeek-Qwen32B