NorOLMo

This is a base (not instruction-tuned) large language model, continually pre-trained on Finnish data starting from the English OLMo2-13B model.

The model was trained for 20 000 steps on around 170 billion tokens. Intermediate checkpoints are published here as branches.

Data Details

Stage 1 (16 000 steps -- 135B tokens)

Data

  • HPLTv3 Finnish
  • FinePDFs Finnish
  • OLMo-Mix

Data Splits

Data Percentage Unique Tokens Total Tokens Number of Documents Average Document Length
HPLT Finnish 69.75 46.8B 93.6B 36.5M 944
FinePDFs Finnish 14.45 9.7B 19.4B 1.5M 4 895
Wiki (OLMo-Mix) 0.02 0.2B 26.8M 0.3M 690
Alg. Stack (OLMo-Mix) 0.04 0.6B 53.7M 0.1M 4 291
Open Web Math (OLMo-Mix) 0.04 0.6B 53.7M 0.1M 4 291
ArXiv (OLMo-Mix) 0.05 1.1B 67.1M 0.2M 5 318
PeS2o (OLMo-Mix) 0.15 2.6B 0.2B 1.6M 1 692
DCLM (OLMo-Mix) 9.50 49.7B 12.8B 35.1M 1 416
StarCoder (OLMo-Mix) 2.10 31.5B 8.1B 23.6M 1 333

The number of documents represents the total unique number of documents, not the documents used during training.

We only took a portion of OLMo-Mix as our unique data.

Stage 2 (4 000 steps -- 35B tokens)

Data

  • HPLTv3 (filtered) Finnish
  • FinePDFs-Edu Finnish
  • Stack-Edu
  • MegaMath Web-Pro
  • FineMath 4+
  • InfiWebMath 4+

Data Splits

Data Percentage Unique Tokens Total Tokens Number of Documents Average Document Length
HPLT Finnish 40.79 3.4B 13.7B 3.1M 1 109
FinePDFs-Edu Finnish 17.84 1.5B 6.0B 0.2M 7 081
FinePDFs-Edu English 15.00 7.5B 5.0B 1.2M 6 485
Stack-Edu 15.00 13.2B 5.0B 15.0M 880
MegaMath Web-Pro 4.76 14.0B 1.6B 15.0M 937
FineMath 4+ 3.51 10.4B 1.2B 6.7M 1 545
InfiWebMath 4+ 3.09 9.1B 1.0B 6.3M 1 447

Training details

Stage 1

Hyperparameter Value
Embedding train steps 1 000
Warmup steps 2 000
Total train steps 16 000
Learning rate schedule Warmup + constant
Learning rate 3e-4
Weight decay 1e-1
Sequence length 4 096
Batch size 2 048
RoPe theta 500 000
Clip grad 1.0
Adam epsilon 1e-8
Adam beta_1 0.9
Adam beta_2 0.95
RMSNorm epsilon 1e-6
Z-loss ratio 1e-5
Diffusion loss ratio 2e-2

Stage 2

Hyperparameter Value
Decay steps 4 000
Total train steps 4 000
Learning rate schedule Linear decay
Initial learning rate 3e-4
Final learning rate 0
Weight decay 1e-1
Sequence length 16 384
Batch size 512
RoPe theta 2 000 000
Clip grad 1.0
Adam epsilon 1e-8
Adam beta_1 0.9
Adam beta_2 0.95
RMSNorm epsilon 1e-6
Z-loss ratio 1e-5
Diffusion loss ratio 2e-2

Acknowledgements

Training was conducted as a part of the HPLT project.

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Downloads last month
78
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HPLT/FinOLMo-13B

Finetuned
(5)
this model

Datasets used to train HPLT/FinOLMo-13B

Collections including HPLT/FinOLMo-13B