--- title: AGI Multi-Model API emoji: 😻 colorFrom: purple colorTo: green sdk: docker pinned: false license: apache-2.0 short_description: Multi-model LLM API with caching and web search --- # AGI Multi-Model API A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data. ## ✨ Features - **🔄 Dynamic Model Switching**: Hot-swap between 5 different LLM models - **⚡ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching - **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo - **📚 OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions - **📖 Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc - **🚀 Optimized Performance**: Continuous batching and multi-port architecture ## 🤖 Available Models | Model | Use Case | Size | |-------|----------|------| | **deepseek-chat** (default) | General purpose conversation | 7B | | **mistral-7b** | Financial analysis & summarization | 7B | | **openhermes-7b** | Advanced instruction following | 7B | | **deepseek-coder** | Specialized coding assistance | 6.7B | | **llama-7b** | Lightweight & fast responses | 7B | ## 🚀 Quick Start ### Prerequisites - Python 3.10+ - `llama-server` (llama.cpp) - 8GB+ RAM (16GB+ recommended for caching multiple models) ### Installation ```bash # Clone the repository git clone cd AGI # Install dependencies pip install -r requirements.txt # or uv pip install -r pyproject.toml # Start the server uvicorn app:app --host 0.0.0.0 --port 8000 ``` ### Docker Deployment ```bash # Build the image docker build -t agi-api . # Run the container docker run -p 8000:8000 agi-api ``` ## 📖 API Documentation Once the server is running, access the interactive documentation: - **Swagger UI**: http://localhost:8000/docs - **ReDoc**: http://localhost:8000/redoc - **OpenAPI JSON**: http://localhost:8000/openapi.json ## 🔧 Usage Examples ### Basic Chat Completion ```python import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "messages": [ {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 100, "temperature": 0.7 } ) print(response.json()) ``` ### Web-Augmented Chat ```python response = requests.post( "http://localhost:8000/v1/web-chat/completions", json={ "messages": [ {"role": "user", "content": "What are the latest AI developments?"} ], "max_tokens": 512, "max_search_results": 5 } ) result = response.json() print(result["choices"][0]["message"]["content"]) print(f"Sources: {result['web_search']['sources']}") ``` ### Switch Models ```python # Switch to coding model response = requests.post( "http://localhost:8000/switch-model", json={"model_name": "deepseek-coder"} ) print(response.json()) # Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"} ``` ### Check Cache Status ```python response = requests.get("http://localhost:8000/cache/info") cache_info = response.json() print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}") for model in cache_info['cached_models']: print(f" - {model['name']} on port {model['port']}") ``` ## 🏗️ Architecture ### Model Caching System The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory: ``` ┌─────────────────────────────────────────────────┐ │ Request: Switch to Model A │ ├─────────────────────────────────────────────────┤ │ 1. Check if A is current → Skip │ │ 2. Check cache for A │ │ ├─ Cache Hit → Instant switch (< 1s) │ │ └─ Cache Miss → Load model (~2-3 min) │ │ 3. If cache full → Evict LRU model │ │ 4. Add A to cache │ └─────────────────────────────────────────────────┘ ``` **Benefits:** - First load: ~2-3 minutes (model download + initialization) - Subsequent switches: < 1 second (from cache) - Automatic memory management with LRU eviction - Each model runs on a separate port (8080, 8081, etc.) ### Multi-Port Architecture ``` ┌──────────────────────┐ │ FastAPI Server │ │ (Port 8000) │ └──────────┬───────────┘ │ ┌──────┴────────────────────┐ │ │ ┌───▼────────────┐ ┌──────────▼────┐ │ llama-server │ │ llama-server │ │ Model A:8080 │ │ Model B:8081 │ └────────────────┘ └───────────────┘ ``` ## ⚙️ Configuration ### Environment Variables ```bash # Maximum cached models (default: 2) MAX_CACHED_MODELS=2 # Base port for llama-server instances (default: 8080) BASE_PORT=8080 # Default model on startup DEFAULT_MODEL=deepseek-chat ``` ### Model Configuration Edit `AVAILABLE_MODELS` in `app.py` to add custom models: ```python AVAILABLE_MODELS = { "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf" } ``` ## 📊 API Endpoints ### Status & Models - `GET /` - API status and current model - `GET /models` - List available models - `GET /cache/info` - Cache statistics and cached models ### Model Management - `POST /switch-model` - Switch active model (with caching) ### Chat Completions - `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible) - `POST /v1/web-chat/completions` - Web-augmented chat with search ### Documentation - `GET /docs` - Swagger UI interactive documentation - `GET /redoc` - ReDoc alternative documentation - `GET /openapi.json` - OpenAPI 3.0 specification export ## 🧪 Testing ```bash # Test basic chat curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50 }' # Check cache status curl http://localhost:8000/cache/info # Switch models curl -X POST http://localhost:8000/switch-model \ -H "Content-Type: application/json" \ -d '{"model_name": "deepseek-coder"}' ``` ## 🔍 Web Search Integration The web-augmented chat endpoint automatically: 1. Extracts the user's query from the last message 2. Performs a DuckDuckGo web search 3. Injects search results into the LLM context 4. Returns response with source citations **Use cases:** - Current events and news - Recent developments beyond training data - Fact-checking with live web data - Research with source attribution ## 📈 Performance Tips 1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB) 2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores 3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff 4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support) ## 🛠️ Development ### Project Structure ``` AGI/ ├── app.py # Main FastAPI application ├── client_multi_model.py # Example client ├── Dockerfile # Docker configuration ├── pyproject.toml # Python dependencies └── README.md # This file ``` ### Adding New Models 1. Find a GGUF model on HuggingFace 2. Add to `AVAILABLE_MODELS` dict 3. Restart the server 4. Switch to your new model via API ## 📝 License Apache 2.0 - See LICENSE file for details ## 🤝 Contributing Contributions welcome! Please feel free to submit a Pull Request. ## 🐛 Troubleshooting ### Model fails to load - Ensure `llama-server` is in your PATH - Check available disk space for model downloads - Verify internet connectivity for first-time model downloads ### Out of memory errors - Reduce `MAX_CACHED_MODELS` to 1 - Use smaller quantized models (Q4_K_M instead of Q8) - Increase system swap space ### Port conflicts - Change `BASE_PORT` if 8080+ are in use - Check for other llama-server instances: `ps aux | grep llama` ## 📚 Additional Resources - [FastAPI Documentation](https://fastapi.tiangolo.com/) - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) - [Hugging Face Models](https://huggingface.co/models?library=gguf) - [OpenAI API Reference](https://platform.openai.com/docs/api-reference) --- Built with ❤️ using FastAPI and llama.cpp