File size: 1,940 Bytes
fdd5070
df144d0
 
 
 
 
 
fdd5070
df144d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdd5070
df144d0
4b0c31b
 
 
 
 
 
 
 
 
 
 
df144d0
4b0c31b
 
 
 
 
 
 
 
 
 
 
df144d0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B
tags:
- sft
- instruction-tuning
- llama
datasets:
- HuggingFaceH4/ultrachat_200k
model-index:
- name: llama31-8b-sft-nomask
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8K
      type: gsm8k
    metrics:
    - name: Accuracy
      type: accuracy
      value: 29.0
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU
      type: mmlu
    metrics:
    - name: Accuracy
      type: accuracy
      value: 58.4
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Simple Safety Tests
      type: simple_safety_tests
    metrics:
    - name: Safety Score
      type: accuracy
      value: 78.0
      verified: false
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AlpacaEval
      type: tatsu-lab/alpaca_eval
    metrics:
    - name: LC Win Rate
      type: win_rate
      value: 5.3
      verified: false
---

# LLaMA-3.1-8B SFT (No Prompt Masking)

Fine-tuned LLaMA-3.1-8B using SFT instruction tuning **without prompt masking** (loss computed on all tokens).

## Training Details
- **Base Model**: meta-llama/Llama-3.1-8B
- **Dataset**: UltraChat-200K + SafetyLlama (~200K examples)
- **Training**: 1 epoch (6726 steps)
- **Prompt Masking**: Disabled (loss on all tokens)

## Evaluation Results

| Benchmark | Baseline | This Model |
|-----------|----------|------------|
| GSM8K | 16.4% | 29.0% |
| MMLU | 58.1% | 58.4% |
| SST Safety | 62.0% | 78.0% |
| AlpacaEval | 1.57% | **5.3%** |

## Files
- `eval_baseline/`: Baseline evaluation results (pre-finetuning Llama-3.1-8B)

## Reference
Part of CS336 Assignment 5 (SFT Instruction Tuning). See [building-from-scratch/sft](https://github.com/garg-aayush/building-from-scratch/tree/main/sft) for details.