--- license: llama3.1 base_model: meta-llama/Llama-3.1-8B tags: - sft - instruction-tuning - llama datasets: - HuggingFaceH4/ultrachat_200k model-index: - name: llama31-8b-sft-nomask results: - task: type: text-generation name: Text Generation dataset: name: GSM8K type: gsm8k metrics: - name: Accuracy type: accuracy value: 29.0 verified: false - task: type: text-generation name: Text Generation dataset: name: MMLU type: mmlu metrics: - name: Accuracy type: accuracy value: 58.4 verified: false - task: type: text-generation name: Text Generation dataset: name: Simple Safety Tests type: simple_safety_tests metrics: - name: Safety Score type: accuracy value: 78.0 verified: false - task: type: text-generation name: Text Generation dataset: name: AlpacaEval type: tatsu-lab/alpaca_eval metrics: - name: LC Win Rate type: win_rate value: 5.3 verified: false --- # LLaMA-3.1-8B SFT (No Prompt Masking) Fine-tuned LLaMA-3.1-8B using SFT instruction tuning **without prompt masking** (loss computed on all tokens). ## Training Details - **Base Model**: meta-llama/Llama-3.1-8B - **Dataset**: UltraChat-200K + SafetyLlama (~200K examples) - **Training**: 1 epoch (6726 steps) - **Prompt Masking**: Disabled (loss on all tokens) ## Evaluation Results | Benchmark | Baseline | This Model | |-----------|----------|------------| | GSM8K | 16.4% | 29.0% | | MMLU | 58.1% | 58.4% | | SST Safety | 62.0% | 78.0% | | AlpacaEval | 1.57% | **5.3%** | ## Files - `eval_baseline/`: Baseline evaluation results (pre-finetuning Llama-3.1-8B) ## Reference Part of CS336 Assignment 5 (SFT Instruction Tuning). See [building-from-scratch/sft](https://github.com/garg-aayush/building-from-scratch/tree/main/sft) for details.