From Data to Behavior: Predicting Unintended Model Behaviors Before Training
Abstract
Data2Behavior predicts unintended model behaviors before training using MDF, a lightweight method that analyzes data features to reveal potential biases without parameter updates.
Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
Community
Can we foresee unintended model behaviors before fine-tuning?
We demonstrate that unintended biases and safety risks can be traced back to interpretable latent data statistics that mechanistically influence model activations, without any parameter updates.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models (2026)
- One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs (2025)
- ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification (2026)
- Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning (2026)
- Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing (2026)
- Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models (2026)
- Defenses Against Prompt Attacks Learn Surface Heuristics (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper