Slovenian Educational Quality Classifier
This is a sentence-transformers model finetuned from rokn/slovlo-v1. It maps sentences & paragraphs to a 1-dimensional dense vector space and can be used for assessing the educational value of Slovenian texts. Developed inspired by FineWeb-Edu (arXiv:2406.17557), but using adapted methodology and content of the Slovenian language.
Model Description
The model is a regression model consisting of a frozen text input (encoder) and a single linear layer.
- Model Type: Sentence Transformer
- Base model: rokn/slovlo-v1
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1 dimensions
- Method: Regression (predicts a continuous value, learned in the range 0-5).
- Training data: 564,044 Slovenian texts annotated with synthetic ratings (using the Gemma 3 27B model), filtered for high consensus (stdev < 0.5).
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 1, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
Model Training Note: The model was initially trained using frozen sentence embeddings and a separately learned regression head. After training, the regression head was integrated directly into the SentenceTransformers architecture to streamline inference and simplify deployment.
Evaluation Results
The model was evaluated on a holdout test set (10% of the data, 56,405 samples). Metrics from the last epoch (20) are reported.
Key regression metrics
- Evaluation loss (MSE):
0.2538 - Evaluation loss (RMSE):
0.5037
6-class confusion matrix
(Rows = Actual values, Columns = Model predictions)
| Actual \ Forecast | 0 |
1 |
2 |
3 |
4 |
5 |
|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 0 | 0 | 0 |
| 1 | 100 | 2,291 | 2,499 | 79 | 2 | 0 |
| 2 | 1 | 1,071 | 11,044 | 3,291 | 36 | 0 |
| 3 | 0 | 44 | 4,906 | 14,411 | 1,354 | 0 |
| 4 | 0 | 1 | 162 | 7,795 | 6,994 | 96 |
| 5 | 0 | 0 | 0 | 55 | 166 | 4 |
6-grade Classification Report
precision recall f1-score support
0 0.019 0.667 0.038 3
1 0.672 0.461 0.547 4971
2 0.593 0.715 0.649 15443
3 0.562 0.696 0.622 20715
4 0.818 0.465 0.593 15048
5 0.040 0.018 0.025 225
accuracy 0.616 56405
macro avg 0.451 0.503 0.412 56405
weighted avg 0.647 0.616 0.612 56405
Analysis of a binary task (Score >= 3)
Confusion Matrix
(0 = Less than 3, 1 = 3 or more)
| Actual \ Forecast | 0 (< 3) |
1 (>= 3) |
|---|---|---|
| 0 (< 3) | 17,009 | 3,408 |
| 1 (>= 3) | 5,113 | 30,875 |
Binary Classification Report
precision recall f1-score support
Class 0 (< 3) 0.7689 0.8331 0.7997 20417
Class 1 (>= 3) 0.9006 0.8579 0.8787 35988
accuracy 0.8489 56405
macro avg 0.8347 0.8455 0.8392 56405
weighted avg 0.8529 0.8489 0.8501 56405
This model achieves an F1 score of 0.8787 (87.9%) for the high-quality content class (rating >= 3) in binary classification with a threshold value of 3. This result exceeds the F1 score of 82% reported in a comparable scientific publication, confirming the high performance of the model in distinguishing between content types.
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("zID4si/slovenian-edu-classifier-slovlo")
# Run inference
sentences = [
'Sonce je zvezda v središču našega osončja. Je skoraj popolna krogla vroče plazme.',
'Uff mater, Res imate nekateri zelo omejena razmišljanja.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (2, 1)
print(embeddings)
# [[2.6884167]
# [0.9754883]]
Limitations
Scope: The model’s performance may vary when applied to datasets different from those seen during training, particularly for out-of-distribution samples. It is optimized for educational content related to primary and middle-school levels and may not perform as well on materials intended for higher education or specialized domains.
Bias: The model’s behavior depends on the quality and representativeness of both the training data and the large language model used to create the annotations. Any biases present in these sources may influence the model’s predictions.
Context: The classifier evaluates isolated text samples, such as individual web pages or extracted passages, without access to broader discourse or surrounding context. The model is limited to a maximum sequence length of 512 tokens; for longer texts, predictions might be improved by applying a sliding-window aggregation approach, though this method has not been tested.
Author
Tomaž Savodnik, Zavod za informacijsko družbo (zID)
The model was developed as part of research into the quality and diversity of Slovenian online corpora.
- Downloads last month
- -
Model tree for zID4si/slovenian-edu-classifier-slovlo
Base model
rokn/slovlo-v1