Spaces:
Running
Running
| title: GLUE | |
| emoji: 🤗 | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 3.19.1 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - metric | |
| description: >- | |
| GLUE, the General Language Understanding Evaluation benchmark | |
| (https://gluebenchmark.com/) is a collection of resources for training, | |
| evaluating, and analyzing natural language understanding systems. | |
| # Metric Card for GLUE | |
| ## Metric description | |
| This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue). | |
| GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. | |
| ## How to use | |
| There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric. | |
| 1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`, `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`. | |
| More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue). | |
| 2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation. | |
| ```python | |
| from evaluate import load | |
| glue_metric = load('glue', 'sst2') | |
| references = [0, 1] | |
| predictions = [0, 1] | |
| results = glue_metric.compute(predictions=predictions, references=references) | |
| ``` | |
| ## Output values | |
| The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics: | |
| `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information). | |
| `f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. | |
| `pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases. | |
| `spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`. | |
| `matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. | |
| The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy. | |
| ### Values from popular papers | |
| The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible). | |
| For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue). | |
| ## Examples | |
| Maximal values for the MRPC subset (which outputs `accuracy` and `f1`): | |
| ```python | |
| from evaluate import load | |
| glue_metric = load('glue', 'mrpc') # 'mrpc' or 'qqp' | |
| references = [0, 1] | |
| predictions = [0, 1] | |
| results = glue_metric.compute(predictions=predictions, references=references) | |
| print(results) | |
| {'accuracy': 1.0, 'f1': 1.0} | |
| ``` | |
| Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`): | |
| ```python | |
| from evaluate import load | |
| glue_metric = load('glue', 'stsb') | |
| references = [0., 1., 2., 3., 4., 5.] | |
| predictions = [-10., -11., -12., -13., -14., -15.] | |
| results = glue_metric.compute(predictions=predictions, references=references) | |
| print(results) | |
| {'pearson': -1.0, 'spearmanr': -1.0} | |
| ``` | |
| Partial match for the COLA subset (which outputs `matthews_correlation`) | |
| ```python | |
| from evaluate import load | |
| glue_metric = load('glue', 'cola') | |
| references = [0, 1] | |
| predictions = [1, 1] | |
| results = glue_metric.compute(predictions=predictions, references=references) | |
| results | |
| {'matthews_correlation': 0.0} | |
| ``` | |
| ## Limitations and bias | |
| This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue). | |
| While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such. | |
| Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{wang2019glue, | |
| title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, | |
| author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.}, | |
| note={In the Proceedings of ICLR.}, | |
| year={2019} | |
| } | |
| ``` | |
| ## Further References | |
| - [GLUE benchmark homepage](https://gluebenchmark.com/) | |
| - [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?) | |