Argunauts Update: Learning Formal Argument Analysis with RLVF and HIRPO

Community Article Published December 2, 2025

🤗 DebateLabKIT/Phi-4-Argunaut-1-HIRPO
🛢️ DebateLabKIT/argunauts-hirpo-preferences
👩‍💻 debatelab/argdown-feedback

Introduction

Training
Training Metrics

Evaluation

Introduction

Based on the HIRPO recipe, I've successfully trained Phi-4-Argunaut-1-HIRPO, an open large language model (LLM) that technically masters formal argument analysis. However, having lost general conversational abilities (likely due to one-sided training), Phi-4-Argunaut-1-HIRPO is not a production-ready chat model.

I'll sketch training methdos and results below. Please refer to previous articles for background info.

This brief post continues a series of posts about the Argunauts project:

☑️ Argunauts: Motivation and Goals ↪
☑️ Phase I: SFT Training ↪
☑️ Phase II: Selfplay Finetuning Line-By-Line
☑️ Phase III: RLVF with Hindsight Instruction Relabeling, Self-Correction and Dynamic Curriculum
- ✅ Update: ✨Learning Formal Argument Analysis with RLVF and HIRPO✨

Training

We are using DebateLabKIT/Phi-4-Argunaut-1-SPIN-dev1 as base model and train Phi-4-Argunaut-1-HIRPO for 100 HIRPO epochs, each of which consists in a data generation and a training part.

HIRPO Data Generation Configs

defaults:
  eval_gen_kwargs:
    temperature: 0.6
    max_tokens: 4096
    top_p: 0.95
    extra_body:
      min_p: 0.05
  gen_kwargs:
    temperature: 0.8
    max_tokens: 8192
    extra_body:
      min_p: 0.05
  feedback_gen_kwargs:
    temperature: 0.8
    max_tokens: 1024
    extra_body:
      min_p: 0.05
exercise_profiles:
  - name: arguments
    description: "argumentative texts with one main argument / line of argumentation"
    training:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-pros-and-cons-1950"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-pros-and-cons-2010"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-procon_org"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-room-for-debate"
        split: "train"
    evaluation:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-pros-and-cons-1950"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-pros-and-cons-2010"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-procon_org"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "arguments-room-for-debate"
        split: "test"
  - name: debates-handful
    description: "debates with 3-5 arguments"
    training:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-procon_org-handful"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-1950-handful"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-2010-handful"
        split: "train"
    evaluation:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-procon_org-handful"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-1950-handful"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-2010-handful"
        split: "test"
  - name: debates
    description: "debates with 6 or many more arguments"
    training:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-debatabase-full"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-procon_org-full"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-1950-full"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-2010-full"
        split: "train"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-room-for-debate-full"
        split: "train"
    evaluation:
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-debatabase-full"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-procon_org-full"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-1950-full"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-pros-and-cons-2010-full"
        split: "test"
      - path: "DebateLabKIT/arguments-and-debates"
        name: "debates-room-for-debate-full"
        split: "test"
tasks:
  - name: arganno
    dependencies: null
    floor_weight: 0.025
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "arganno.Annotation"
    solution_generator_kwargs:
      n_solutions: 10
      n_revisions: 5
      gen_kwargs:
        temperature: 0.6
        max_tokens: 2048
        extra_body:
          min_p: 0.05
    problem: "arganno.AnnotationProblem"
    problem_generator: "arganno.AnnotationProblemGenerator"
    judge: "arganno.AnnotationJudge"
    max_workers: 8
    feedback_generator: "arganno.AnnotationFeedbackGenerator"
    feedback_generator_kwargs:
      n_feedbacks: 5
      gen_kwargs:
        temperature: 0.8
        max_tokens: 2048
        extra_body:
          min_p: 0.05
    virtue_preference_pair_generator: 
      - "arganno.AnnotationAttacksPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationNoAttacksPreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "arganno.AnnotationSupportsPreferencePairGenerator"
    ask_for_invalid_probs:
      HasAnnotationsHandler: .5
  - name: arganno-l
    dependencies:
      - "arganno"
      - "argmap"
    exercises:
      - "debates"
    solution: "arganno.Annotation"
    solution_generator_kwargs:
      n_solutions: 10
      n_revisions: 5
      gen_kwargs:
        temperature: 0.6
        max_tokens: 4096
        extra_body:
          min_p: 0.05
    problem: "arganno.AnnotationProblem"
    problem_generator: "arganno.AnnotationProblemGenerator"
    judge: "arganno.AnnotationJudge"
    max_workers: 8
    feedback_generator: "arganno.AnnotationFeedbackGenerator"
    feedback_generator_kwargs:
      n_feedbacks: 5
      gen_kwargs:
        temperature: 0.8
        max_tokens: 2048
        extra_body:
          min_p: 0.05
    virtue_preference_pair_generator: 
      - "arganno.AnnotationAttacksPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationNoAttacksPreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "arganno.AnnotationSupportsPreferencePairGenerator"
  - name: argmap
    dependencies: null
    floor_weight: 0.025
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap.ArgumentMap"
    problem: "argmap.ArgMapProblem"
    problem_generator: "argmap.ArgMapProblemGenerator"
    judge: "argmap.ArgMapJudge"
    max_workers: 8
    feedback_generator: "argmap.ArgMapFeedbackGenerator"
    virtue_preference_pair_generator:
      - "argmap.BalancePreferencePairGenerator"
      - "argmap.DensityPreferencePairGenerator"
      - "argmap.ConnectednessPreferencePairGenerator"
      - "argmap.MaxDiameterPreferencePairGenerator"
      - "argmap.MinDiameterPreferencePairGenerator"
      - "argmap.MaxInDegreePreferencePairGenerator"
      - "argmap.MaxOutDegreePreferencePairGenerator"
      - "argmap.MaxArgsPreferencePairGenerator"
      - "argmap.MinLeafsPreferencePairGenerator"
      - "argmap.MaxAttacksPreferencePairGenerator"
      - "argmap.MaxSupportsPreferencePairGenerator"
      - "argmap.ShortClaimsPreferencePairGenerator"
      - "argmap.LongClaimsPreferencePairGenerator"
      - "argmap.ShortLabelsPreferencePairGenerator"
      - "argmap.DiverseLabelsPreferencePairGenerator"
      - "argmap.ArgumentClaimSizePreferencePairGenerator"
      - "argmap.IndependentWordingPreferencePairGenerator"
      - "argmap.SourceTextProximityPreferencePairGenerator"
    ask_for_invalid_probs:
      HasArgdownHandler: .5
  - name: argmap-l
    dependencies:
      - "argmap"
      - "arganno"
    exercises:
      - "debates"
    solution: "argmap.ArgumentMap"
    problem: "argmap.ArgMapProblem"
    problem_generator: "argmap.ArgMapProblemGenerator"
    judge: "argmap.ArgMapJudge"
    max_workers: 8
    feedback_generator: "argmap.ArgMapFeedbackGenerator"
    virtue_preference_pair_generator:
      - "argmap.BalancePreferencePairGenerator"
      - "argmap.DensityPreferencePairGenerator"
      - "argmap.ConnectednessPreferencePairGenerator"
      - "argmap.MaxDiameterPreferencePairGenerator"
      - "argmap.MinDiameterPreferencePairGenerator"
      - "argmap.MaxInDegreePreferencePairGenerator"
      - "argmap.MaxOutDegreePreferencePairGenerator"
      - "argmap.MaxArgsPreferencePairGenerator"
      - "argmap.MinLeafsPreferencePairGenerator"
      - "argmap.MaxAttacksPreferencePairGenerator"
      - "argmap.MaxSupportsPreferencePairGenerator"
      - "argmap.ShortClaimsPreferencePairGenerator"
      - "argmap.LongClaimsPreferencePairGenerator"
      - "argmap.ShortLabelsPreferencePairGenerator"
      - "argmap.DiverseLabelsPreferencePairGenerator"
      - "argmap.ArgumentClaimSizePreferencePairGenerator"
      - "argmap.IndependentWordingPreferencePairGenerator"
      - "argmap.SourceTextProximityPreferencePairGenerator"
  - name: infreco
    dependencies:
      - "infreco_from_arganno"
    floor_weight: 0.025
    exercises:
      - "arguments"
    solution: "infreco.InformalReco"
    problem: "infreco.InfRecoProblem"
    problem_generator: "infreco.InfRecoProblemGenerator"
    judge: "infreco.InfRecoJudge"
    max_workers: 8
    feedback_generator: "infreco.InfRecoFeedbackGenerator"
    virtue_preference_pair_generator:
      - "infreco.NoUnusedPropsPreferencePairGenerator"
      - "infreco.FewIntermediateConclusionsPreferencePairGenerator"
      - "infreco.ManyIntermediateConclusionsPreferencePairGenerator"
      - "infreco.SimplicityPreferencePairGenerator"
      - "infreco.VerbosityPreferencePairGenerator"
      - "infreco.IndependentWordingPreferencePairGenerator"
      - "infreco.SourceTextProximityPreferencePairGenerator"
    ask_for_invalid_probs:
      HasArgdownHandler: .5
  - name: logreco
    dependencies:
      - "logreco_from_infreco"
    floor_weight: 0.025
    exercises:
      - "arguments"
    solution: "logreco.LogicalReco"
    problem: "logreco.LogRecoProblem"
    problem_generator: "logreco.LogRecoProblemGenerator"
    judge: "logreco.LogRecoJudge"
    max_workers: 1
    feedback_generator: "logreco.LogRecoFeedbackGenerator"
    virtue_preference_pair_generator:
      - "logreco.PredicateLogicPreferencePairGenerator"
      - "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
      - "logreco.FewIntermediateConclusionsPreferencePairGenerator"
      - "logreco.ManyIntermediateConclusionsPreferencePairGenerator"
      - "logreco.SourceTextProximityPreferencePairGenerator"
      - "logreco.IndependentWordingPreferencePairGenerator"
      - "logreco.SimplicityPreferencePairGenerator"
      - "logreco.VerbosityPreferencePairGenerator"
    ask_for_invalid_probs:
      HasArgdownHandler: .5
  - name: arganno_from_argmap
    dependencies:
      - "argmap"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "arganno.Annotation"
    problem: "arganno_from_argmap.ArgannoFromArgmapProblem"
    problem_generator: "arganno_from_argmap.ArgannoFromArgmapProblemGenerator"
    judge: "arganno.AnnotationJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "arganno_from_argmap.ArgmapTextProximityPreferencePairGenerator"
      - "arganno_from_argmap.ArgmapGraphProximityPreferencePairGenerator"
  - name: argmap_from_arganno
    dependencies:
      - "arganno"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap.ArgumentMap"
    problem: "argmap_from_arganno.ArgmapFromArgannoProblem"
    problem_generator: "argmap_from_arganno.ArgmapFromArgannoProblemGenerator"
    judge: "argmap.ArgMapJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "argmap_from_arganno.AnnotationTextProximityPreferencePairGenerator"
      - "argmap_from_arganno.AnnotationGraphProximityPreferencePairGenerator"
  - name: infreco_from_arganno
    dependencies:
      - "arganno"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "infreco.InformalReco"
    problem: "infreco_from_arganno.InfRecoFromArgAnnoProblem"
    problem_generator: "infreco_from_arganno.InfRecoFromArgAnnoProblemGenerator"
    judge: "infreco.InfRecoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "infreco_from_arganno.AnnotationProximityPreferencePairGenerator"
      - "infreco.NoUnusedPropsPreferencePairGenerator"
  - name: logreco_from_infreco
    dependencies:
      - "infreco"
    exercises:
      - "arguments"
    solution: "logreco.LogicalReco"
    problem: "logreco_from_infreco.LogrecoFromInfrecoProblem"
    problem_generator: "logreco_from_infreco.LogrecoFromInfrecoProblemGenerator"
    judge: "logreco.LogRecoJudge"
    max_workers: 1
    virtue_preference_pair_generator:
      - "logreco_from_infreco.InfrecoProximityPreferencePairGenerator"
      - "logreco.PredicateLogicPreferencePairGenerator"
      - "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
      - "logreco.SimplicityPreferencePairGenerator"
  - name: arganno_plus_infreco
    dependencies:
      - "arganno"
      - "infreco"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "arganno_plus_infreco.ArgannoPlusInfreco"
    problem: "arganno_plus_infreco.ArgannoPlusInfrecoProblem"
    problem_generator: "arganno_plus_infreco.ArgannoPlusInfrecoProblemGenerator"
    judge: "arganno_plus_infreco.ArgannoPlusInfrecoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "arganno.AnnotationAttacksPreferencePairGenerator"
      - "arganno.AnnotationSupportsPreferencePairGenerator"
      - "infreco.SimplicityPreferencePairGenerator"
  - name: arganno_plus_logreco
    dependencies:
      - "arganno"
      - "logreco"
      - "arganno_plus_infreco"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "arganno_plus_logreco.ArgannoPlusLogReco"
    problem: "arganno_plus_logreco.ArgannoPlusLogRecoProblem"
    problem_generator: "arganno_plus_logreco.ArgannoPlusLogRecoProblemGenerator"
    judge: "arganno_plus_logreco.ArgannoPlusLogRecoJudge"
    max_workers: 1
    virtue_preference_pair_generator:
      - "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
      - "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "arganno.AnnotationAttacksPreferencePairGenerator"
      - "arganno.AnnotationSupportsPreferencePairGenerator"
      - "infreco.SimplicityPreferencePairGenerator"
  - name: argmap_plus_arganno
    dependencies:
      - "argmap"
      - "arganno"
      - "argmap_from_arganno"
      - "arganno_from_argmap"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap_plus_arganno.ArgmapPlusArganno"
    problem: "argmap_plus_arganno.ArgmapPlusArgannoProblem"
    problem_generator: "argmap_plus_arganno.ArgmapPlusArgannoProblemGenerator"
    judge: "argmap_plus_arganno.ArgmapPlusArgannoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "argmap.DensityPreferencePairGenerator"
      - "argmap.ConnectednessPreferencePairGenerator"
      - "argmap.MaxAttacksPreferencePairGenerator"
      - "argmap.MaxSupportsPreferencePairGenerator"
  - name: argmap_plus_arganno-l
    dependencies:
      - "argmap-l"
      - "arganno-l"
      - "argmap_from_arganno"
      - "arganno_from_argmap"
      - "logreco"
    exercises:
      - "debates"
    solution: "argmap_plus_arganno.ArgmapPlusArganno"
    problem: "argmap_plus_arganno.ArgmapPlusArgannoProblem"
    problem_generator: "argmap_plus_arganno.ArgmapPlusArgannoProblemGenerator"
    judge: "argmap_plus_arganno.ArgmapPlusArgannoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "argmap.DensityPreferencePairGenerator"
      - "argmap.ConnectednessPreferencePairGenerator"
      - "argmap.MaxAttacksPreferencePairGenerator"
      - "argmap.MaxSupportsPreferencePairGenerator"
  - name: argmap_plus_infreco
    dependencies:
      - "argmap"
      - "infreco"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap_plus_infreco.ArgmapPlusInfreco"
    problem: "argmap_plus_infreco.ArgmapPlusInfrecoProblem"
    problem_generator: "argmap_plus_infreco.ArgmapPlusInfrecoProblemGenerator"
    judge: "argmap_plus_infreco.ArgmapPlusInfrecoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "argmap_plus_infreco.SimplicityPreferencePairGenerator"
      - "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxSupportsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxAttacksPreferencePairGeneratorCT"
      - "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
  - name: argmap_plus_infreco-l
    dependencies:
      - "argmap_plus_infreco"
    exercises:
      - "debates"
    solution: "argmap_plus_infreco.ArgmapPlusInfreco"
    problem: "argmap_plus_infreco.ArgmapPlusInfrecoProblem"
    problem_generator: "argmap_plus_infreco.ArgmapPlusInfrecoProblemGenerator"
    judge: "argmap_plus_infreco.ArgmapPlusInfrecoJudge"
    max_workers: 8
    virtue_preference_pair_generator:
      - "argmap_plus_infreco.SimplicityPreferencePairGenerator"
      - "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxSupportsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxAttacksPreferencePairGeneratorCT"
      - "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
  - name: argmap_plus_logreco
    dependencies:
      - "argmap"
      - "logreco"
      - "argmap_plus_infreco"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap_plus_logreco.ArgmapPlusLogreco"
    problem: "argmap_plus_logreco.ArgmapPlusLogrecoProblem"
    problem_generator: "argmap_plus_logreco.ArgmapPlusLogrecoProblemGenerator"
    judge: "argmap_plus_logreco.ArgmapPlusLogrecoJudge"
    max_workers: 1
    virtue_preference_pair_generator:
      - "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
      - "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
  - name: argmap_plus_arganno_plus_logreco
    dependencies:
      - "argmap_plus_logreco"
      - "argmap_plus_arganno"
      - "arganno_plus_logreco"
    exercises:
      - "arguments"
      - "debates-handful"
    solution: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogreco"
    problem: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblem"
    problem_generator: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblemGenerator"
    judge: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoJudge"
    max_workers: 1
    virtue_preference_pair_generator:
      - "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
      - "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
      - "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
  - name: argmap_plus_arganno_plus_logreco-l
    dependencies:
      - "argmap_plus_arganno_plus_logreco"
      - "argmap_plus_logreco"
      - "arganno_plus_logreco"
      - "argmap_plus_arganno-l"
      - "argmap_plus_infreco-l"
    exercises:
      - "debates"
    solution: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogreco"
    problem: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblem"
    problem_generator: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblemGenerator"
    judge: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoJudge"
    max_workers: 1
    virtue_preference_pair_generator:
      - "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
      - "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
      - "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
      - "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
      - "arganno.AnnotationCoveragePreferencePairGenerator"
      - "arganno.AnnotationScopePreferencePairGenerator"
      - "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
      - "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"

HIRPO Training Configs


# This is an example configuration file of TRL CLI
# The YAML file supports environment variables by adding an `env` field
# as below

# env:
#   CUDA_VISIBLE_DEVICES: 0


model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
use_liger: true  # custom script argument
use_liger_kernel: true  # transformers.TrainingArguments argument
# use_liger_loss: true  # trl.DPOConfig argument
bf16: true
tf32: true

# Dataset arguments
max_length: 8192
max_prompt_length: 4096

# Training arguments
num_train_epochs: 2
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-7
lr_scheduler_type: linear
loss_type: sigmoid
warmup_ratio: 0.5

# Logging arguments
logging_strategy: steps
logging_steps: 5
report_to:
- wandb

# Evaluation arguments
eval_strategy: no

# Checkpointing arguments
save_strategy: "no"
save_steps: 50
save_total_limit: 2
seed: 42

# Hugging Face Hub 
push_to_hub: true
hub_strategy: every_save
hub_private_repo: true

Training Metrics

Training loss is going down during the 100 training epochs (~4.5k steps):

The performance in the diverse argument analysis tasks, assessed against exercises from the eval split, improves significantly during HIRPO training:

Noteworthy:

DebateLabKIT/Phi-4-Argunaut-1-HIRPO acquires complex skills involving logical argument reconstruction and argument mapping or annotation.

The training curriculum is dynamically adjusted as the model acquires more and more skills, as shown by the number of training items generated in each epoch:

We may break this down further by showing which types of preference pairs are generated for each task. Note that that, typically, when the model starts training on task, it fails to generate valid solutions at all and only "failure type preference pairs" (avoid this or that error) are generated. This helps the model to learn the task and eventually produce valid solutions, which then allows for generating positive feedback preference pairs (prefer this solution over that one). As the model improves further, more and more diverse preference pairs are generated. This is visible in the following plot:

Evaluation

Chatting with Phi-4-Argunaut-1-HIRPO reveals that its training has been one-sided; improving its formal argument analysis skills at the detriment of its general coherence, conversational abilities and instruction following competence. This might expected, as the HIRPO training focused exclusively on argument analysis tasks.

Before using the model in practice, it should be further finetuned on general instruction following data to restore its general capabilities.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment