π TinyWay-1.2.0
TinyWay-1.2.0 is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code). The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.
β‘ Trained end-to-end on Kaggle using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.
π Model Overview
| Property | Value |
|---|---|
| Model type | Decoder-only Transformer (GPT-style) |
| Parameters | ~109.6M |
| Layers | 10 |
| Hidden size | 768 |
| Attention heads | 12 |
| Context length | 256 tokens |
| Activation | GELU |
| Dropout | 0.1 |
| Precision | fp16 / bf16 |
| Weight tying | Token embedding tied with LM head |
| Position encoding | Learned absolute embeddings |
π§ Training Details
Dataset
The model was trained using streaming data from:
- π Web text
- π Stories
- π» Code
via the HuggingFace dataset:
shivendrra/consolidated-datasets
Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.
Tokenization
Tokenizer: GPT2TokenizerFast
Vocabulary size: 50,257
Special tokens:
bos_token_id = eos_token_id = pad_token_id = 50256
Training Configuration
| Setting | Value |
|---|---|
| Sequence length | 256 |
| Effective batch size | 64 sequences |
| Optimizer | AdamW |
| Learning rate | 3e-4 (cosine decay + warmup) |
| Betas | (0.9, 0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Mixed precision | AMP (fp16 / bf16) |
| Gradient accumulation | Yes |
| Training steps | ~60k |
| Total tokens | ~1B (approx) |
Final training loss β 3.0 Final perplexity β ~20
π Usage
Load with Transformers (Custom Code Required)
This repository uses a custom model definition (modeling_tinyway.py).
Make sure it is available in your environment before loading.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Text Generation Example
import torch
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
top_k=50,
top_p=0.95,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Example Generations
The model demonstrates:
- β Coherent sentence structure
- β Narrative flow in stories
- β Reasonable grammar and punctuation
- β οΈ Occasional repetition and topic drift (expected for this scale)
This is a research-grade small LLM, not instruction-aligned by default.
β οΈ Limitations
- β Not instruction-tuned
- β Limited reasoning depth compared to large LLMs
- β Context length limited to 256 tokens
- β οΈ May hallucinate or generate inconsistent facts
- β οΈ Training data may contain noise from web sources
Use responsibly.
π§ͺ Intended Use
- Research experiments
- Educational purposes
- Model scaling studies
- Training pipeline benchmarking
- Custom fine-tuning experiments
Not recommended for production or safety-critical applications.
π οΈ Reproducibility
The model was trained using:
- Custom PyTorch training loop
- Streaming datasets via HuggingFace
- Mixed precision training
- Gradient accumulation
- Periodic checkpointing
- Full monitoring (loss, perplexity, gradient norm, attention entropy)
If youβd like the full training code or configs, feel free to reach out.
π License
This model follows the license of the underlying datasets and tokenizer. Please ensure compliance before commercial usage.
π Acknowledgements
- HuggingFace π€
- PyTorch
- Kaggle
- GPT-2 tokenizer
- Open research community
- Downloads last month
- -