Spaces:
Sleeping
Sleeping
metadata
title: AGI Multi-Model API
emoji: π»
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search
AGI Multi-Model API
A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.
β¨ Features
- π Dynamic Model Switching: Hot-swap between 5 different LLM models
- β‘ Intelligent Caching: LRU cache keeps up to 2 models in memory for instant switching
- π Web-Augmented Chat: Real-time web search integration via DuckDuckGo
- π OpenAI-Compatible API: Drop-in replacement for OpenAI chat completions
- π Auto-Generated Documentation: Interactive API docs with Swagger UI and ReDoc
- π Optimized Performance: Continuous batching and multi-port architecture
π€ Available Models
| Model | Use Case | Size |
|---|---|---|
| deepseek-chat (default) | General purpose conversation | 7B |
| mistral-7b | Financial analysis & summarization | 7B |
| openhermes-7b | Advanced instruction following | 7B |
| deepseek-coder | Specialized coding assistance | 6.7B |
| llama-7b | Lightweight & fast responses | 7B |
π Quick Start
Prerequisites
- Python 3.10+
llama-server(llama.cpp)- 8GB+ RAM (16GB+ recommended for caching multiple models)
Installation
# Clone the repository
git clone <your-repo-url>
cd AGI
# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml
# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
Docker Deployment
# Build the image
docker build -t agi-api .
# Run the container
docker run -p 8000:8000 agi-api
π API Documentation
Once the server is running, access the interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI JSON: http://localhost:8000/openapi.json
π§ Usage Examples
Basic Chat Completion
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
Web-Augmented Chat
response = requests.post(
"http://localhost:8000/v1/web-chat/completions",
json={
"messages": [
{"role": "user", "content": "What are the latest AI developments?"}
],
"max_tokens": 512,
"max_search_results": 5
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")
Switch Models
# Switch to coding model
response = requests.post(
"http://localhost:8000/switch-model",
json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
Check Cache Status
response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
print(f" - {model['name']} on port {model['port']}")
ποΈ Architecture
Model Caching System
The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Request: Switch to Model A β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Check if A is current β Skip β
β 2. Check cache for A β
β ββ Cache Hit β Instant switch (< 1s) β
β ββ Cache Miss β Load model (~2-3 min) β
β 3. If cache full β Evict LRU model β
β 4. Add A to cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Benefits:
- First load: ~2-3 minutes (model download + initialization)
- Subsequent switches: < 1 second (from cache)
- Automatic memory management with LRU eviction
- Each model runs on a separate port (8080, 8081, etc.)
Multi-Port Architecture
ββββββββββββββββββββββββ
β FastAPI Server β
β (Port 8000) β
ββββββββββββ¬ββββββββββββ
β
ββββββββ΄βββββββββββββββββββββ
β β
βββββΌβββββββββββββ ββββββββββββΌβββββ
β llama-server β β llama-server β
β Model A:8080 β β Model B:8081 β
ββββββββββββββββββ βββββββββββββββββ
βοΈ Configuration
Environment Variables
# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2
# Base port for llama-server instances (default: 8080)
BASE_PORT=8080
# Default model on startup
DEFAULT_MODEL=deepseek-chat
Model Configuration
Edit AVAILABLE_MODELS in app.py to add custom models:
AVAILABLE_MODELS = {
"my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}
π API Endpoints
Status & Models
GET /- API status and current modelGET /models- List available modelsGET /cache/info- Cache statistics and cached models
Model Management
POST /switch-model- Switch active model (with caching)
Chat Completions
POST /v1/chat/completions- Standard chat completions (OpenAI-compatible)POST /v1/web-chat/completions- Web-augmented chat with search
Documentation
GET /docs- Swagger UI interactive documentationGET /redoc- ReDoc alternative documentationGET /openapi.json- OpenAPI 3.0 specification export
π§ͺ Testing
# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Check cache status
curl http://localhost:8000/cache/info
# Switch models
curl -X POST http://localhost:8000/switch-model \
-H "Content-Type: application/json" \
-d '{"model_name": "deepseek-coder"}'
π Web Search Integration
The web-augmented chat endpoint automatically:
- Extracts the user's query from the last message
- Performs a DuckDuckGo web search
- Injects search results into the LLM context
- Returns response with source citations
Use cases:
- Current events and news
- Recent developments beyond training data
- Fact-checking with live web data
- Research with source attribution
π Performance Tips
- Cache Size: Increase
MAX_CACHED_MODELSif you have sufficient RAM (each model ~4-5GB) - CPU Threads: Adjust
-tparameter instart_llama_server()based on your CPU cores - Batch Size: Modify
-bparameter for throughput vs. latency tradeoff - GPU Acceleration: Set
-ngl> 0 if you have a GPU (requires llama.cpp with GPU support)
π οΈ Development
Project Structure
AGI/
βββ app.py # Main FastAPI application
βββ client_multi_model.py # Example client
βββ Dockerfile # Docker configuration
βββ pyproject.toml # Python dependencies
βββ README.md # This file
Adding New Models
- Find a GGUF model on HuggingFace
- Add to
AVAILABLE_MODELSdict - Restart the server
- Switch to your new model via API
π License
Apache 2.0 - See LICENSE file for details
π€ Contributing
Contributions welcome! Please feel free to submit a Pull Request.
π Troubleshooting
Model fails to load
- Ensure
llama-serveris in your PATH - Check available disk space for model downloads
- Verify internet connectivity for first-time model downloads
Out of memory errors
- Reduce
MAX_CACHED_MODELSto 1 - Use smaller quantized models (Q4_K_M instead of Q8)
- Increase system swap space
Port conflicts
- Change
BASE_PORTif 8080+ are in use - Check for other llama-server instances:
ps aux | grep llama
π Additional Resources
Built with β€οΈ using FastAPI and llama.cpp