You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Slovenian Educational Quality Classifier

This is a sentence-transformers model finetuned from rokn/slovlo-v1. It maps sentences & paragraphs to a 1-dimensional dense vector space and can be used for assessing the educational value of Slovenian texts. Developed inspired by FineWeb-Edu (arXiv:2406.17557), but using adapted methodology and content of the Slovenian language.

Model Description

The model is a regression model consisting of a frozen text input (encoder) and a single linear layer.

Model Type: Sentence Transformer
Base model: rokn/slovlo-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1 dimensions
Method: Regression (predicts a continuous value, learned in the range 0-5).
Training data: 564,044 Slovenian texts annotated with synthetic ratings (using the Gemma 3 27B model), filtered for high consensus (stdev < 0.5).

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 1, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Model Training Note: The model was initially trained using frozen sentence embeddings and a separately learned regression head. After training, the regression head was integrated directly into the SentenceTransformers architecture to streamline inference and simplify deployment.

Evaluation Results

The model was evaluated on a holdout test set (10% of the data, 56,405 samples). Metrics from the last epoch (20) are reported.

Key regression metrics

Evaluation loss (MSE): 0.2538
Evaluation loss (RMSE): 0.5037

6-class confusion matrix

(Rows = Actual values, Columns = Model predictions)

Actual \ Forecast	`0`	`1`	`2`	`3`	`4`	`5`
0	2	0	1	0	0	0
1	100	2,291	2,499	79	2	0
2	1	1,071	11,044	3,291	36	0
3	0	44	4,906	14,411	1,354	0
4	0	1	162	7,795	6,994	96
5	0	0	0	55	166	4

6-grade Classification Report

              precision    recall  f1-score   support

           0      0.019     0.667     0.038         3
           1      0.672     0.461     0.547      4971
           2      0.593     0.715     0.649     15443
           3      0.562     0.696     0.622     20715
           4      0.818     0.465     0.593     15048
           5      0.040     0.018     0.025       225

    accuracy                          0.616     56405
   macro avg      0.451     0.503     0.412     56405
weighted avg      0.647     0.616     0.612     56405

Analysis of a binary task (Score >= 3)

Confusion Matrix

(0 = Less than 3, 1 = 3 or more)

Actual \ Forecast	`0` (< 3)	`1` (>= 3)
0 (< 3)	17,009	3,408
1 (>= 3)	5,113	30,875

Binary Classification Report

                 precision    recall  f1-score   support

 Class 0 (< 3)     0.7689    0.8331    0.7997     20417
Class 1 (>= 3)     0.9006    0.8579    0.8787     35988

       accuracy                         0.8489     56405
      macro avg     0.8347    0.8455    0.8392     56405
   weighted avg     0.8529    0.8489    0.8501     56405

This model achieves an F1 score of 0.8787 (87.9%) for the high-quality content class (rating >= 3) in binary classification with a threshold value of 3. This result exceeds the F1 score of 82% reported in a comparable scientific publication, confirming the high performance of the model in distinguishing between content types.

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("zID4si/slovenian-edu-classifier-slovlo")
# Run inference
sentences = [
    'Sonce je zvezda v središču našega osončja. Je skoraj popolna krogla vroče plazme.',
    'Uff mater, Res imate nekateri zelo omejena razmišljanja.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (2, 1)
print(embeddings)
# [[2.6884167]
#  [0.9754883]]

Limitations

Scope: The model’s performance may vary when applied to datasets different from those seen during training, particularly for out-of-distribution samples. It is optimized for educational content related to primary and middle-school levels and may not perform as well on materials intended for higher education or specialized domains.

Bias: The model’s behavior depends on the quality and representativeness of both the training data and the large language model used to create the annotations. Any biases present in these sources may influence the model’s predictions.

Context: The classifier evaluates isolated text samples, such as individual web pages or extracted passages, without access to broader discourse or surrounding context. The model is limited to a maximum sequence length of 512 tokens; for longer texts, predictions might be improved by applying a sliding-window aggregation approach, though this method has not been tested.

Author

Tomaž Savodnik, Zavod za informacijsko družbo (zID)

The model was developed as part of research into the quality and diversity of Slovenian online corpora.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zID4si/slovenian-edu-classifier-slovlo

Base model

rokn/slovlo-v1

Finetuned

(1)

this model