Training optimization - a galois77 Collection

galois77 's Collections

Thousand brains theory

THE ORB

energy based models

OCR

Poetry

Agentic

Videos

ahan

Image generation

Training optimization

RL

Benchmarks and challenges

Training optimization

updated 18 days ago

The Curse of Depth in Large Language Models

Paper • 2502.05795 • Published Feb 9 • 40
Transformers without Normalization

Paper • 2503.10622 • Published Mar 13 • 171
Parallel Scaling Law for Language Models

Paper • 2505.10475 • Published May 15 • 83
Learning to Skip the Middle Layers of Transformers

Paper • 2506.21103 • Published Jun 26 • 18
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75
All is Not Lost: LLM Recovery without Checkpoints

Paper • 2506.15461 • Published Jun 18 • 37
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Paper • 2508.17677 • Published Aug 25 • 14
Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training

Paper • 2511.01918 • Published Nov 1 • 11
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Paper • 2511.13254 • Published 19 days ago • 134