Model Card for gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF

This repository contains multiple quantized versions of the gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts model in GGUF format.
It is intended for efficient inference on consumer hardware, making large model deployment more accessible.

Model Details

Model Description

  • Developed by: leeminwaan
  • Funded by [optional]: Independent project
  • Shared by [optional]: leeminwaan
  • Model type: Decoder-only transformer language model
  • Language(s) (NLP): English (primary), multilingual capabilities not benchmarked
  • License: Apache-2.0

Model Sources

  • Repository: Hugging Face Repo
  • Paper [optional]: Not available
  • Demo [optional]: To be released

How to Get Started with the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download("leeminwaan/gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF", "gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-q4_k_m.gguf")
print("Downloaded:", model_path)

Quantized versions available:

  • Q2_K, Q3_K_S, Q3_K_M, Q3_K_L
  • Q4_0, Q4_1, Q4_K_S, Q4_K_M
  • Q5_0, Q5_1, Q5_K_S, Q5_K_M
  • Q6_K, Q8_0

Training Details

Training Data

  • Based on gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts pretraining corpus (public large-scale web text, open datasets).
  • No additional fine-tuning was performed for this release.

Training Procedure

  • Original gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts β†’ quantized to GGUF formats.

Quantization Results

Quantization Size (vs. FP16) Speed Quality Recommended For
Q2_K Smallest Fastest Low Prototyping, minimal RAM/CPU
Q3_K_S Very Small Very Fast Low-Med Lightweight devices, testing
Q3_K_M Small Fast Med Lightweight, slightly better quality
Q3_K_L Small-Med Fast Med Faster inference, fair quality
Q4_0 Medium Fast Good General use, chats, low RAM
Q4_1 Medium Fast Good+ Recommended, slightly better quality
Q4_K_S Medium Fast Good+ Recommended, balanced
Q4_K_M Medium Fast Good++ Recommended, best Q4 option
Q5_0 Larger Moderate Very Good Chatbots, longer responses
Q5_1 Larger Moderate Very Good+ More demanding tasks
Q5_K_S Larger Moderate Very Good+ Advanced users, better accuracy
Q5_K_M Larger Moderate Excellent Demanding tasks, high quality
Q6_K Large Slower Near FP16 Power users, best quantized quality
Q8_0 Largest Slowest FP16-like Maximum quality, high RAM/CPU

Note:

  • Lower quantization = smaller model, faster inference, but lower output quality.
  • Q4_K_M is ideal for most users; Q6_K/Q8_0 offer the highest quality, best for advanced use.
  • All quantizations are suitable for consumer hardwareβ€”select based on your quality/speed needs.

Technical Specifications

Software

  • llama.cpp for quantization
  • Python 3.10, huggingface_hub

Citation

BibTeX:

@miscgpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF,
  title=gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF Quantized Models},
  author={leeminwaan},
  year={2025},
  howpublished={\url{https://huggingface.co/leeminwaan/gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF}}
}

APA:

leeminwaan. (2025). gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF Quantized Models [Computer software]. Hugging Face. https://huggingface.co/leeminwaan/gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF

Glossary

  • Quantization: Reducing precision of weights to lower memory usage.
  • GGUF: Optimized format for llama.cpp inference.

More Information

  • This project is experimental.
  • Expect further updates and quantization benchmarks.

Model Card Authors

  • leeminwaan

Model Card Contact

Downloads last month
303
GGUF
Model size
4B params
Architecture
gpt-oss
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including leeminwaan/gpt-oss-4.2b-specialized-all-pruned-moe-only-4-experts-GGUF