File size: 11,226 Bytes
0b82484
3637503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b82484
 
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
0b82484
3637503
0b82484
3637503
 
0b82484
3637503
cad9e8f
 
0b82484
3637503
 
 
 
 
 
0b82484
3637503
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
0b82484
cad9e8f
 
0b82484
3637503
 
0b82484
3637503
 
0b82484
3637503
 
 
 
 
 
 
0b82484
3637503
 
 
0b82484
3637503
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
 
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
0b82484
3637503
0b82484
3637503
 
0b82484
3637503
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b82484
 
 
3637503
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
0b82484
3637503
 
 
 
 
 
 
0b82484
3637503
0b82484
3637503
 
 
0b82484
3637503
0b82484
3637503
 
 
 
0b82484
3637503
0b82484
3637503
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
language: en
license: apache-2.0
tags:
- question-answering
- bert
- squad
- extractive-qa
- baseline
datasets:
- squad
metrics:
- f1
- exact_match
model-index:
- name: bert-base-uncased-squad-baseline
  results:
  - task:
      type: question-answering
      name: Question Answering
    dataset:
      name: SQuAD 1.1
      type: squad
      split: validation
    metrics:
    - type: exact_match
      value: 79.45
      name: Exact Match
    - type: f1
      value: 87.41
      name: F1 Score
---

# BERT Base Uncased - SQuAD 1.1 Baseline

This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the SQuAD 1.1 dataset for extractive question answering.

## Model Description

**BERT (Bidirectional Encoder Representations from Transformers)** fine-tuned on the Stanford Question Answering Dataset (SQuAD 1.1) to perform extractive question answering - finding the answer span within a given context passage.

- **Model Type:** Question Answering (Extractive)
- **Base Model:** `bert-base-uncased`
- **Language:** English
- **License:** Apache 2.0
- **Fine-tuned on:** SQuAD 1.1
- **Parameters:** 108,893,186 (all trainable)

## Intended Use

### Primary Use Cases

This model is designed for extractive question answering tasks where:
- The answer exists as a continuous span of text within the provided context
- Questions are factual and answerable from the context
- English language text processing

### Example Usage

```python
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("G20-CS4248/bert-baseline-qa")
tokenizer = AutoTokenizer.from_pretrained("G20-CS4248/bert-baseline-qa")

# Create QA pipeline
qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer
)

# Ask a question
context = """
The Amazon rainforest is a moist broadleaf tropical rainforest in the Amazon biome 
that covers most of the Amazon basin of South America. This basin encompasses 
7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are 
covered by the rainforest.
"""

question = "How large is the Amazon basin?"

result = qa_pipeline(question=question, context=context)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")
```

**Output:**
```
Answer: 7,000,000 km2
Confidence: 0.9234
```

### Direct Model Usage (without pipeline)

```python
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model = AutoModelForQuestionAnswering.from_pretrained("G20-CS4248/bert-baseline-qa")
tokenizer = AutoTokenizer.from_pretrained("G20-CS4248/bert-baseline-qa")

question = "What is the capital of France?"
context = "Paris is the capital and largest city of France."

# Tokenize
inputs = tokenizer(question, context, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    
# Get answer span
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end])
)

print(f"Answer: {answer}")
```

## Training Data

### Dataset: SQuAD 1.1

The Stanford Question Answering Dataset (SQuAD) v1.1 consists of questions posed by crowdworkers on a set of Wikipedia articles.

**Training Set:**
- **Examples:** 87,599
- **Average question length:** 10.06 words
- **Average context length:** 119.76 words  
- **Average answer length:** 3.16 words

**Validation Set:**
- **Examples:** 10,570
- **Average question length:** 10.22 words
- **Average context length:** 123.95 words
- **Average answer length:** 3.02 words

### Data Preprocessing

- **Tokenizer:** `bert-base-uncased`
- **Max sequence length:** 384 tokens
- **Stride:** 128 tokens (for handling long contexts)
- **Padding:** Maximum length
- **Truncation:** Only second sequence (context)

Long contexts are split into multiple features with overlapping windows to ensure answers aren't lost at sequence boundaries.

## Training Procedure

### Training Hyperparameters

| Parameter | Value |
|-----------|-------|
| **Base model** | bert-base-uncased |
| **Optimizer** | AdamW |
| **Learning rate** | 3e-5 |
| **Learning rate schedule** | Linear with warmup |
| **Warmup ratio** | 0.1 (10% of training) |
| **Weight decay** | 0.01 |
| **Batch size (train)** | 8 |
| **Batch size (eval)** | 8 |
| **Number of epochs** | 1 |
| **Mixed precision** | FP16 (enabled) |
| **Gradient accumulation** | 1 |
| **Max gradient norm** | 1.0 |

### Training Environment

- **Hardware:** NVIDIA GPU (CUDA enabled)
- **Framework:** PyTorch with Transformers library
- **Training time:** ~29.5 minutes (1 epoch)
- **Training samples/second:** 44.95
- **Total FLOPs:** 14,541,777 GF

### Training Metrics

- **Final training loss:** 1.2236
- **Evaluation strategy:** End of epoch
- **Metric for best model:** Evaluation loss

## Performance

### Evaluation Results

Evaluated on SQuAD 1.1 validation set (10,570 examples):

| Metric | Score |
|--------|-------|
| **Exact Match (EM)** | **79.45%** |
| **F1 Score** | **87.41%** |

### Metric Explanations

- **Exact Match (EM):** Percentage of predictions that match the ground truth answer exactly
- **F1 Score:** Token-level F1 score measuring overlap between predicted and ground truth answers

### Comparison to BERT Base Performance

| Model | EM | F1 | Training |
|-------|----|----|----------|
| **This model (1 epoch)** | 79.45 | 87.41 | 29.5 min |
| BERT Base (original paper, 3 epochs) | 80.8 | 88.5 | ~2-3 hours |
| BERT Base (fully trained) | 81-84 | 88-91 | ~2-3 hours |

**Note:** This is a baseline model trained for only 1 epoch. Performance can be improved with additional training epochs.

### Performance by Question Type

The model performs well on:
- ✅ Factual questions (What, When, Where, Who)
- ✅ Short answer spans (1-5 words)
- ✅ Questions with clear context

May struggle with:
- ⚠️ Questions requiring reasoning across multiple sentences
- ⚠️ Very long answer spans
- ⚠️ Ambiguous questions with multiple valid answers
- ⚠️ Questions requiring world knowledge not in context

## Limitations and Biases

### Known Limitations

1. **Extractive Only:** Can only extract answers present in the context; cannot generate or synthesize answers
2. **Single Answer:** Provides only one answer span, even if multiple valid answers exist
3. **Context Dependency:** Requires relevant context; cannot answer from general knowledge
4. **Length Constraints:** Limited to 384 tokens per context window
5. **English Only:** Trained on English text; not suitable for other languages
6. **Training Duration:** Only 1 epoch of training; may underfit compared to longer training

### Potential Biases

- **Domain Bias:** Trained primarily on Wikipedia articles; may perform worse on other text types (news, technical docs, etc.)
- **Temporal Bias:** Training data from 2016; may have outdated information
- **Cultural Bias:** Reflects biases present in Wikipedia content
- **Answer Position Bias:** May favor answers appearing in certain positions within context
- **BERT Base Biases:** Inherits any biases from the pre-trained BERT base model

### Out-of-Scope Use

This model should NOT be used for:
- ❌ Medical, legal, or financial advice
- ❌ High-stakes decision making
- ❌ Generative question answering (creating new answers)
- ❌ Non-English languages
- ❌ Yes/no or multiple choice questions (without adaptation)
- ❌ Questions requiring reasoning beyond the context
- ❌ Real-time fact checking or verification

## Technical Specifications

### Model Architecture

```
BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings
    (encoder): BertEncoder (12 layers)
    (pooler): BertPooler
  )
  (qa_outputs): Linear(768 -> 2)  # Start and end position logits
)
```

- **Hidden size:** 768
- **Attention heads:** 12
- **Intermediate size:** 3072
- **Hidden layers:** 12
- **Vocabulary size:** 30,522
- **Max position embeddings:** 512
- **Total parameters:** 108,893,186

### Input Format

The model expects tokenized input with:
- Question and context concatenated with `[SEP]` token
- Format: `[CLS] question [SEP] context [SEP]`
- Token type IDs to distinguish question (0) from context (1)
- Attention mask to identify real vs padding tokens

### Output Format

Returns:
- `start_logits`: Scores for each token being the start of the answer span
- `end_logits`: Scores for each token being the end of the answer span

The predicted answer is the span from token with highest start_logit to token with highest end_logit (where end >= start).

## Evaluation Data

**SQuAD 1.1 Validation Set**
- 10,570 question-context-answer triples
- Same source and format as training data
- Used for final performance evaluation

## Environmental Impact

- **Training hardware:** 1x NVIDIA GPU
- **Training time:** ~29.5 minutes
- **Compute region:** Not specified
- **Carbon footprint:** Estimated minimal due to short training time

## Model Card Authors

[Your Name / Team Name]

## Model Card Contact

[Your Email / Contact Information]

## Citation

If you use this model, please cite:

```bibtex
@misc{bert-squad-baseline-2025,
  author = {Your Name},
  title = {BERT Base Uncased Fine-tuned on SQuAD 1.1 (Baseline)},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/bert-squad-baseline}}
}
```

### Original BERT Paper

```bibtex
@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}
```

### SQuAD Dataset

```bibtex
@article{rajpurkar2016squad,
  title={SQuAD: 100,000+ Questions for Machine Comprehension of Text},
  author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
  journal={arXiv preprint arXiv:1606.05250},
  year={2016}
}
```

## Additional Information

### Future Improvements

Potential enhancements for this baseline model:
- 🔄 Train for additional epochs (2-3 epochs recommended)
- 📈 Increase batch size with gradient accumulation
- 🎯 Implement learning rate scheduling
- 🔍 Add answer validation/verification
- 📊 Ensemble with multiple models
- 🚀 Distillation to smaller model for deployment

### Related Models

- [bert-base-uncased](https://huggingface.co/bert-base-uncased) - Base model
- [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) - Larger BERT variant
- [distilbert-base-uncased-distilled-squad](https://huggingface.co/distilbert-base-uncased-distilled-squad) - Smaller, faster variant

### Acknowledgments

- Google Research for BERT
- Stanford NLP for SQuAD dataset
- Hugging Face for Transformers library
- [Your course/institution if applicable]

---

**Last updated:** October 2025  
**Model version:** 1.0 (Baseline)  
**Status:** Baseline model - suitable for development/comparison