---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B
tags:
- sft
- instruction-tuning
- llama
datasets:
- HuggingFaceH4/ultrachat_200k
model-index:
- name: llama31-8b-sft-nomask
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8K
      type: gsm8k
    metrics:
    - name: Accuracy
      type: accuracy
      value: 29.0
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU
      type: mmlu
    metrics:
    - name: Accuracy
      type: accuracy
      value: 58.4
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Simple Safety Tests
      type: simple_safety_tests
    metrics:
    - name: Safety Score
      type: accuracy
      value: 78.0
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AlpacaEval
      type: tatsu-lab/alpaca_eval
    metrics:
    - name: LC Win Rate
      type: win_rate
      value: 5.3
      verified: false
---

# LLaMA-3.1-8B SFT (No Prompt Masking)

Fine-tuned LLaMA-3.1-8B using SFT instruction tuning **without prompt masking** (loss computed on all tokens).

## Training Details
- **Base Model**: meta-llama/Llama-3.1-8B
- **Dataset**: UltraChat-200K + SafetyLlama (~200K examples)
- **Training**: 1 epoch (6726 steps)
- **Prompt Masking**: Disabled (loss on all tokens)

## Evaluation Results

| Benchmark | Baseline | This Model |
|-----------|----------|------------|
| GSM8K | 16.4% | 29.0% |
| MMLU | 58.1% | 58.4% |
| SST Safety | 62.0% | 78.0% |
| AlpacaEval | 1.57% | **5.3%** |

## Files
- `eval_baseline/`: Baseline evaluation results (pre-finetuning Llama-3.1-8B)

## Reference
Part of CS336 Assignment 5 (SFT Instruction Tuning). See [building-from-scratch/sft](https://github.com/garg-aayush/building-from-scratch/tree/main/sft) for details.