garg-aayush's picture
Upload README.md with huggingface_hub
df144d0 verified
metadata
license: llama3.1
base_model: meta-llama/Llama-3.1-8B
tags:
  - sft
  - instruction-tuning
  - llama
datasets:
  - HuggingFaceH4/ultrachat_200k
model-index:
  - name: llama31-8b-sft-nomask
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8K
          type: gsm8k
        metrics:
          - name: Accuracy
            type: accuracy
            value: 29
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU
          type: mmlu
        metrics:
          - name: Accuracy
            type: accuracy
            value: 58.4
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Simple Safety Tests
          type: simple_safety_tests
        metrics:
          - name: Safety Score
            type: accuracy
            value: 78
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AlpacaEval
          type: tatsu-lab/alpaca_eval
        metrics:
          - name: LC Win Rate
            type: win_rate
            value: 5.3
            verified: false

LLaMA-3.1-8B SFT (No Prompt Masking)

Fine-tuned LLaMA-3.1-8B using SFT instruction tuning without prompt masking (loss computed on all tokens).

Training Details

  • Base Model: meta-llama/Llama-3.1-8B
  • Dataset: UltraChat-200K + SafetyLlama (~200K examples)
  • Training: 1 epoch (6726 steps)
  • Prompt Masking: Disabled (loss on all tokens)

Evaluation Results

Benchmark Baseline This Model
GSM8K 16.4% 29.0%
MMLU 58.1% 58.4%
SST Safety 62.0% 78.0%
AlpacaEval 1.57% 5.3%

Files

  • eval_baseline/: Baseline evaluation results (pre-finetuning Llama-3.1-8B)

Reference

Part of CS336 Assignment 5 (SFT Instruction Tuning). See building-from-scratch/sft for details.