Argunauts Update: Learning Formal Argument Analysis with RLVF and HIRPO
- 🤗 DebateLabKIT/Phi-4-Argunaut-1-HIRPO
- 🛢️ DebateLabKIT/argunauts-hirpo-preferences
- 👩💻 debatelab/argdown-feedback
Introduction
Based on the HIRPO recipe, I've successfully trained Phi-4-Argunaut-1-HIRPO, an open large language model (LLM) that technically masters formal argument analysis. However, having lost general conversational abilities (likely due to one-sided training), Phi-4-Argunaut-1-HIRPO is not a production-ready chat model.
I'll sketch training methdos and results below. Please refer to previous articles for background info.
This brief post continues a series of posts about the Argunauts project:
- ☑️ Argunauts: Motivation and Goals ↪
- ☑️ Phase I: SFT Training ↪
- ☑️ Phase II: Selfplay Finetuning Line-By-Line
- ☑️ Phase III: RLVF with Hindsight Instruction Relabeling, Self-Correction and Dynamic Curriculum
- ✅ Update: ✨Learning Formal Argument Analysis with RLVF and HIRPO✨
Training
We are using DebateLabKIT/Phi-4-Argunaut-1-SPIN-dev1 as base model and train Phi-4-Argunaut-1-HIRPO for 100 HIRPO epochs, each of which consists in a data generation and a training part.
HIRPO Data Generation Configs
defaults:
eval_gen_kwargs:
temperature: 0.6
max_tokens: 4096
top_p: 0.95
extra_body:
min_p: 0.05
gen_kwargs:
temperature: 0.8
max_tokens: 8192
extra_body:
min_p: 0.05
feedback_gen_kwargs:
temperature: 0.8
max_tokens: 1024
extra_body:
min_p: 0.05
exercise_profiles:
- name: arguments
description: "argumentative texts with one main argument / line of argumentation"
training:
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-pros-and-cons-1950"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-pros-and-cons-2010"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-procon_org"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-room-for-debate"
split: "train"
evaluation:
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-pros-and-cons-1950"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-pros-and-cons-2010"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-procon_org"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "arguments-room-for-debate"
split: "test"
- name: debates-handful
description: "debates with 3-5 arguments"
training:
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-procon_org-handful"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-1950-handful"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-2010-handful"
split: "train"
evaluation:
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-procon_org-handful"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-1950-handful"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-2010-handful"
split: "test"
- name: debates
description: "debates with 6 or many more arguments"
training:
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-debatabase-full"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-procon_org-full"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-1950-full"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-2010-full"
split: "train"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-room-for-debate-full"
split: "train"
evaluation:
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-debatabase-full"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-procon_org-full"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-1950-full"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-pros-and-cons-2010-full"
split: "test"
- path: "DebateLabKIT/arguments-and-debates"
name: "debates-room-for-debate-full"
split: "test"
tasks:
- name: arganno
dependencies: null
floor_weight: 0.025
exercises:
- "arguments"
- "debates-handful"
solution: "arganno.Annotation"
solution_generator_kwargs:
n_solutions: 10
n_revisions: 5
gen_kwargs:
temperature: 0.6
max_tokens: 2048
extra_body:
min_p: 0.05
problem: "arganno.AnnotationProblem"
problem_generator: "arganno.AnnotationProblemGenerator"
judge: "arganno.AnnotationJudge"
max_workers: 8
feedback_generator: "arganno.AnnotationFeedbackGenerator"
feedback_generator_kwargs:
n_feedbacks: 5
gen_kwargs:
temperature: 0.8
max_tokens: 2048
extra_body:
min_p: 0.05
virtue_preference_pair_generator:
- "arganno.AnnotationAttacksPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationNoAttacksPreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "arganno.AnnotationSupportsPreferencePairGenerator"
ask_for_invalid_probs:
HasAnnotationsHandler: .5
- name: arganno-l
dependencies:
- "arganno"
- "argmap"
exercises:
- "debates"
solution: "arganno.Annotation"
solution_generator_kwargs:
n_solutions: 10
n_revisions: 5
gen_kwargs:
temperature: 0.6
max_tokens: 4096
extra_body:
min_p: 0.05
problem: "arganno.AnnotationProblem"
problem_generator: "arganno.AnnotationProblemGenerator"
judge: "arganno.AnnotationJudge"
max_workers: 8
feedback_generator: "arganno.AnnotationFeedbackGenerator"
feedback_generator_kwargs:
n_feedbacks: 5
gen_kwargs:
temperature: 0.8
max_tokens: 2048
extra_body:
min_p: 0.05
virtue_preference_pair_generator:
- "arganno.AnnotationAttacksPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationNoAttacksPreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "arganno.AnnotationSupportsPreferencePairGenerator"
- name: argmap
dependencies: null
floor_weight: 0.025
exercises:
- "arguments"
- "debates-handful"
solution: "argmap.ArgumentMap"
problem: "argmap.ArgMapProblem"
problem_generator: "argmap.ArgMapProblemGenerator"
judge: "argmap.ArgMapJudge"
max_workers: 8
feedback_generator: "argmap.ArgMapFeedbackGenerator"
virtue_preference_pair_generator:
- "argmap.BalancePreferencePairGenerator"
- "argmap.DensityPreferencePairGenerator"
- "argmap.ConnectednessPreferencePairGenerator"
- "argmap.MaxDiameterPreferencePairGenerator"
- "argmap.MinDiameterPreferencePairGenerator"
- "argmap.MaxInDegreePreferencePairGenerator"
- "argmap.MaxOutDegreePreferencePairGenerator"
- "argmap.MaxArgsPreferencePairGenerator"
- "argmap.MinLeafsPreferencePairGenerator"
- "argmap.MaxAttacksPreferencePairGenerator"
- "argmap.MaxSupportsPreferencePairGenerator"
- "argmap.ShortClaimsPreferencePairGenerator"
- "argmap.LongClaimsPreferencePairGenerator"
- "argmap.ShortLabelsPreferencePairGenerator"
- "argmap.DiverseLabelsPreferencePairGenerator"
- "argmap.ArgumentClaimSizePreferencePairGenerator"
- "argmap.IndependentWordingPreferencePairGenerator"
- "argmap.SourceTextProximityPreferencePairGenerator"
ask_for_invalid_probs:
HasArgdownHandler: .5
- name: argmap-l
dependencies:
- "argmap"
- "arganno"
exercises:
- "debates"
solution: "argmap.ArgumentMap"
problem: "argmap.ArgMapProblem"
problem_generator: "argmap.ArgMapProblemGenerator"
judge: "argmap.ArgMapJudge"
max_workers: 8
feedback_generator: "argmap.ArgMapFeedbackGenerator"
virtue_preference_pair_generator:
- "argmap.BalancePreferencePairGenerator"
- "argmap.DensityPreferencePairGenerator"
- "argmap.ConnectednessPreferencePairGenerator"
- "argmap.MaxDiameterPreferencePairGenerator"
- "argmap.MinDiameterPreferencePairGenerator"
- "argmap.MaxInDegreePreferencePairGenerator"
- "argmap.MaxOutDegreePreferencePairGenerator"
- "argmap.MaxArgsPreferencePairGenerator"
- "argmap.MinLeafsPreferencePairGenerator"
- "argmap.MaxAttacksPreferencePairGenerator"
- "argmap.MaxSupportsPreferencePairGenerator"
- "argmap.ShortClaimsPreferencePairGenerator"
- "argmap.LongClaimsPreferencePairGenerator"
- "argmap.ShortLabelsPreferencePairGenerator"
- "argmap.DiverseLabelsPreferencePairGenerator"
- "argmap.ArgumentClaimSizePreferencePairGenerator"
- "argmap.IndependentWordingPreferencePairGenerator"
- "argmap.SourceTextProximityPreferencePairGenerator"
- name: infreco
dependencies:
- "infreco_from_arganno"
floor_weight: 0.025
exercises:
- "arguments"
solution: "infreco.InformalReco"
problem: "infreco.InfRecoProblem"
problem_generator: "infreco.InfRecoProblemGenerator"
judge: "infreco.InfRecoJudge"
max_workers: 8
feedback_generator: "infreco.InfRecoFeedbackGenerator"
virtue_preference_pair_generator:
- "infreco.NoUnusedPropsPreferencePairGenerator"
- "infreco.FewIntermediateConclusionsPreferencePairGenerator"
- "infreco.ManyIntermediateConclusionsPreferencePairGenerator"
- "infreco.SimplicityPreferencePairGenerator"
- "infreco.VerbosityPreferencePairGenerator"
- "infreco.IndependentWordingPreferencePairGenerator"
- "infreco.SourceTextProximityPreferencePairGenerator"
ask_for_invalid_probs:
HasArgdownHandler: .5
- name: logreco
dependencies:
- "logreco_from_infreco"
floor_weight: 0.025
exercises:
- "arguments"
solution: "logreco.LogicalReco"
problem: "logreco.LogRecoProblem"
problem_generator: "logreco.LogRecoProblemGenerator"
judge: "logreco.LogRecoJudge"
max_workers: 1
feedback_generator: "logreco.LogRecoFeedbackGenerator"
virtue_preference_pair_generator:
- "logreco.PredicateLogicPreferencePairGenerator"
- "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
- "logreco.FewIntermediateConclusionsPreferencePairGenerator"
- "logreco.ManyIntermediateConclusionsPreferencePairGenerator"
- "logreco.SourceTextProximityPreferencePairGenerator"
- "logreco.IndependentWordingPreferencePairGenerator"
- "logreco.SimplicityPreferencePairGenerator"
- "logreco.VerbosityPreferencePairGenerator"
ask_for_invalid_probs:
HasArgdownHandler: .5
- name: arganno_from_argmap
dependencies:
- "argmap"
exercises:
- "arguments"
- "debates-handful"
solution: "arganno.Annotation"
problem: "arganno_from_argmap.ArgannoFromArgmapProblem"
problem_generator: "arganno_from_argmap.ArgannoFromArgmapProblemGenerator"
judge: "arganno.AnnotationJudge"
max_workers: 8
virtue_preference_pair_generator:
- "arganno_from_argmap.ArgmapTextProximityPreferencePairGenerator"
- "arganno_from_argmap.ArgmapGraphProximityPreferencePairGenerator"
- name: argmap_from_arganno
dependencies:
- "arganno"
exercises:
- "arguments"
- "debates-handful"
solution: "argmap.ArgumentMap"
problem: "argmap_from_arganno.ArgmapFromArgannoProblem"
problem_generator: "argmap_from_arganno.ArgmapFromArgannoProblemGenerator"
judge: "argmap.ArgMapJudge"
max_workers: 8
virtue_preference_pair_generator:
- "argmap_from_arganno.AnnotationTextProximityPreferencePairGenerator"
- "argmap_from_arganno.AnnotationGraphProximityPreferencePairGenerator"
- name: infreco_from_arganno
dependencies:
- "arganno"
exercises:
- "arguments"
- "debates-handful"
solution: "infreco.InformalReco"
problem: "infreco_from_arganno.InfRecoFromArgAnnoProblem"
problem_generator: "infreco_from_arganno.InfRecoFromArgAnnoProblemGenerator"
judge: "infreco.InfRecoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "infreco_from_arganno.AnnotationProximityPreferencePairGenerator"
- "infreco.NoUnusedPropsPreferencePairGenerator"
- name: logreco_from_infreco
dependencies:
- "infreco"
exercises:
- "arguments"
solution: "logreco.LogicalReco"
problem: "logreco_from_infreco.LogrecoFromInfrecoProblem"
problem_generator: "logreco_from_infreco.LogrecoFromInfrecoProblemGenerator"
judge: "logreco.LogRecoJudge"
max_workers: 1
virtue_preference_pair_generator:
- "logreco_from_infreco.InfrecoProximityPreferencePairGenerator"
- "logreco.PredicateLogicPreferencePairGenerator"
- "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
- "logreco.SimplicityPreferencePairGenerator"
- name: arganno_plus_infreco
dependencies:
- "arganno"
- "infreco"
exercises:
- "arguments"
- "debates-handful"
solution: "arganno_plus_infreco.ArgannoPlusInfreco"
problem: "arganno_plus_infreco.ArgannoPlusInfrecoProblem"
problem_generator: "arganno_plus_infreco.ArgannoPlusInfrecoProblemGenerator"
judge: "arganno_plus_infreco.ArgannoPlusInfrecoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "arganno.AnnotationAttacksPreferencePairGenerator"
- "arganno.AnnotationSupportsPreferencePairGenerator"
- "infreco.SimplicityPreferencePairGenerator"
- name: arganno_plus_logreco
dependencies:
- "arganno"
- "logreco"
- "arganno_plus_infreco"
exercises:
- "arguments"
- "debates-handful"
solution: "arganno_plus_logreco.ArgannoPlusLogReco"
problem: "arganno_plus_logreco.ArgannoPlusLogRecoProblem"
problem_generator: "arganno_plus_logreco.ArgannoPlusLogRecoProblemGenerator"
judge: "arganno_plus_logreco.ArgannoPlusLogRecoJudge"
max_workers: 1
virtue_preference_pair_generator:
- "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
- "logreco.FormalizationsFaithfulnessPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "arganno.AnnotationAttacksPreferencePairGenerator"
- "arganno.AnnotationSupportsPreferencePairGenerator"
- "infreco.SimplicityPreferencePairGenerator"
- name: argmap_plus_arganno
dependencies:
- "argmap"
- "arganno"
- "argmap_from_arganno"
- "arganno_from_argmap"
exercises:
- "arguments"
- "debates-handful"
solution: "argmap_plus_arganno.ArgmapPlusArganno"
problem: "argmap_plus_arganno.ArgmapPlusArgannoProblem"
problem_generator: "argmap_plus_arganno.ArgmapPlusArgannoProblemGenerator"
judge: "argmap_plus_arganno.ArgmapPlusArgannoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "argmap.DensityPreferencePairGenerator"
- "argmap.ConnectednessPreferencePairGenerator"
- "argmap.MaxAttacksPreferencePairGenerator"
- "argmap.MaxSupportsPreferencePairGenerator"
- name: argmap_plus_arganno-l
dependencies:
- "argmap-l"
- "arganno-l"
- "argmap_from_arganno"
- "arganno_from_argmap"
- "logreco"
exercises:
- "debates"
solution: "argmap_plus_arganno.ArgmapPlusArganno"
problem: "argmap_plus_arganno.ArgmapPlusArgannoProblem"
problem_generator: "argmap_plus_arganno.ArgmapPlusArgannoProblemGenerator"
judge: "argmap_plus_arganno.ArgmapPlusArgannoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "argmap.DensityPreferencePairGenerator"
- "argmap.ConnectednessPreferencePairGenerator"
- "argmap.MaxAttacksPreferencePairGenerator"
- "argmap.MaxSupportsPreferencePairGenerator"
- name: argmap_plus_infreco
dependencies:
- "argmap"
- "infreco"
exercises:
- "arguments"
- "debates-handful"
solution: "argmap_plus_infreco.ArgmapPlusInfreco"
problem: "argmap_plus_infreco.ArgmapPlusInfrecoProblem"
problem_generator: "argmap_plus_infreco.ArgmapPlusInfrecoProblemGenerator"
judge: "argmap_plus_infreco.ArgmapPlusInfrecoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "argmap_plus_infreco.SimplicityPreferencePairGenerator"
- "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxSupportsPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxAttacksPreferencePairGeneratorCT"
- "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
- name: argmap_plus_infreco-l
dependencies:
- "argmap_plus_infreco"
exercises:
- "debates"
solution: "argmap_plus_infreco.ArgmapPlusInfreco"
problem: "argmap_plus_infreco.ArgmapPlusInfrecoProblem"
problem_generator: "argmap_plus_infreco.ArgmapPlusInfrecoProblemGenerator"
judge: "argmap_plus_infreco.ArgmapPlusInfrecoJudge"
max_workers: 8
virtue_preference_pair_generator:
- "argmap_plus_infreco.SimplicityPreferencePairGenerator"
- "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxSupportsPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxAttacksPreferencePairGeneratorCT"
- "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
- name: argmap_plus_logreco
dependencies:
- "argmap"
- "logreco"
- "argmap_plus_infreco"
exercises:
- "arguments"
- "debates-handful"
solution: "argmap_plus_logreco.ArgmapPlusLogreco"
problem: "argmap_plus_logreco.ArgmapPlusLogrecoProblem"
problem_generator: "argmap_plus_logreco.ArgmapPlusLogrecoProblemGenerator"
judge: "argmap_plus_logreco.ArgmapPlusLogrecoJudge"
max_workers: 1
virtue_preference_pair_generator:
- "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
- "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
- "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
- name: argmap_plus_arganno_plus_logreco
dependencies:
- "argmap_plus_logreco"
- "argmap_plus_arganno"
- "arganno_plus_logreco"
exercises:
- "arguments"
- "debates-handful"
solution: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogreco"
problem: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblem"
problem_generator: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblemGenerator"
judge: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoJudge"
max_workers: 1
virtue_preference_pair_generator:
- "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
- "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
- "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
- "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
- name: argmap_plus_arganno_plus_logreco-l
dependencies:
- "argmap_plus_arganno_plus_logreco"
- "argmap_plus_logreco"
- "arganno_plus_logreco"
- "argmap_plus_arganno-l"
- "argmap_plus_infreco-l"
exercises:
- "debates"
solution: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogreco"
problem: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblem"
problem_generator: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoProblemGenerator"
judge: "argmap_plus_arganno_plus_logreco.ArgmapPlusArgannoPlusLogrecoJudge"
max_workers: 1
virtue_preference_pair_generator:
- "argmap_plus_infreco.ConnectednessPreferencePairGeneratorCT"
- "argmap_plus_infreco.MaxArgsPreferencePairGeneratorCT"
- "argmap_plus_infreco.SourceTextProximityPreferencePairGeneratorCT"
- "argmap_plus_logreco.GlobalFormalizationsFaithfulnessPreferencePairGenerator"
- "arganno.AnnotationCoveragePreferencePairGenerator"
- "arganno.AnnotationScopePreferencePairGenerator"
- "argmap_plus_arganno.AnnotationProximityPreferencePairGenerator"
- "arganno_plus_infreco.AnnotationProximityPreferencePairGenerator"
HIRPO Training Configs
# This is an example configuration file of TRL CLI
# The YAML file supports environment variables by adding an `env` field
# as below
# env:
# CUDA_VISIBLE_DEVICES: 0
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
use_liger: true # custom script argument
use_liger_kernel: true # transformers.TrainingArguments argument
# use_liger_loss: true # trl.DPOConfig argument
bf16: true
tf32: true
# Dataset arguments
max_length: 8192
max_prompt_length: 4096
# Training arguments
num_train_epochs: 2
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-7
lr_scheduler_type: linear
loss_type: sigmoid
warmup_ratio: 0.5
# Logging arguments
logging_strategy: steps
logging_steps: 5
report_to:
- wandb
# Evaluation arguments
eval_strategy: no
# Checkpointing arguments
save_strategy: "no"
save_steps: 50
save_total_limit: 2
seed: 42
# Hugging Face Hub
push_to_hub: true
hub_strategy: every_save
hub_private_repo: true
Training Metrics
Training loss is going down during the 100 training epochs (~4.5k steps):
The performance in the diverse argument analysis tasks, assessed against exercises from the eval split, improves significantly during HIRPO training:
Noteworthy:
DebateLabKIT/Phi-4-Argunaut-1-HIRPOacquires complex skills involving logical argument reconstruction and argument mapping or annotation.
The training curriculum is dynamically adjusted as the model acquires more and more skills, as shown by the number of training items generated in each epoch:
We may break this down further by showing which types of preference pairs are generated for each task. Note that that, typically, when the model starts training on task, it fails to generate valid solutions at all and only "failure type preference pairs" (avoid this or that error) are generated. This helps the model to learn the task and eventually produce valid solutions, which then allows for generating positive feedback preference pairs (prefer this solution over that one). As the model improves further, more and more diverse preference pairs are generated. This is visible in the following plot:
Evaluation
Chatting with Phi-4-Argunaut-1-HIRPO reveals that its training has been one-sided; improving its formal argument analysis skills at the detriment of its general coherence, conversational abilities and instruction following competence. This might expected, as the HIRPO training focused exclusively on argument analysis tasks.
Before using the model in practice, it should be further finetuned on general instruction following data to restore its general capabilities.



