--- title: AGI Multi-Model API emoji: ๐Ÿ˜ป colorFrom: purple colorTo: green sdk: docker pinned: false license: apache-2.0 short_description: Multi-model LLM API with caching and web search --- # AGI Multi-Model API A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data. ## โœจ Features - **๐Ÿ”„ Dynamic Model Switching**: Hot-swap between 5 different LLM models - **โšก Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching - **๐ŸŒ Web-Augmented Chat**: Real-time web search integration via DuckDuckGo - **๐Ÿ“š OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions - **๐Ÿ“– Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc - **๐Ÿš€ Optimized Performance**: Continuous batching and multi-port architecture ## ๐Ÿค– Available Models | Model | Use Case | Size | |-------|----------|------| | **deepseek-chat** (default) | General purpose conversation | 7B | | **mistral-7b** | Financial analysis & summarization | 7B | | **openhermes-7b** | Advanced instruction following | 7B | | **deepseek-coder** | Specialized coding assistance | 6.7B | | **llama-7b** | Lightweight & fast responses | 7B | ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.10+ - `llama-server` (llama.cpp) - 8GB+ RAM (16GB+ recommended for caching multiple models) ### Installation ```bash # Clone the repository git clone cd AGI # Install dependencies pip install -r requirements.txt # or uv pip install -r pyproject.toml # Start the server uvicorn app:app --host 0.0.0.0 --port 8000 ``` ### Docker Deployment ```bash # Build the image docker build -t agi-api . # Run the container docker run -p 8000:8000 agi-api ``` ## ๐Ÿ“– API Documentation Once the server is running, access the interactive documentation: - **Swagger UI**: http://localhost:8000/docs - **ReDoc**: http://localhost:8000/redoc - **OpenAPI JSON**: http://localhost:8000/openapi.json ## ๐Ÿ”ง Usage Examples ### Basic Chat Completion ```python import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "messages": [ {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 100, "temperature": 0.7 } ) print(response.json()) ``` ### Web-Augmented Chat ```python response = requests.post( "http://localhost:8000/v1/web-chat/completions", json={ "messages": [ {"role": "user", "content": "What are the latest AI developments?"} ], "max_tokens": 512, "max_search_results": 5 } ) result = response.json() print(result["choices"][0]["message"]["content"]) print(f"Sources: {result['web_search']['sources']}") ``` ### Switch Models ```python # Switch to coding model response = requests.post( "http://localhost:8000/switch-model", json={"model_name": "deepseek-coder"} ) print(response.json()) # Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"} ``` ### Check Cache Status ```python response = requests.get("http://localhost:8000/cache/info") cache_info = response.json() print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}") for model in cache_info['cached_models']: print(f" - {model['name']} on port {model['port']}") ``` ## ๐Ÿ—๏ธ Architecture ### Model Caching System The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory: ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Request: Switch to Model A โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 1. Check if A is current โ†’ Skip โ”‚ โ”‚ 2. Check cache for A โ”‚ โ”‚ โ”œโ”€ Cache Hit โ†’ Instant switch (< 1s) โ”‚ โ”‚ โ””โ”€ Cache Miss โ†’ Load model (~2-3 min) โ”‚ โ”‚ 3. If cache full โ†’ Evict LRU model โ”‚ โ”‚ 4. Add A to cache โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` **Benefits:** - First load: ~2-3 minutes (model download + initialization) - Subsequent switches: < 1 second (from cache) - Automatic memory management with LRU eviction - Each model runs on a separate port (8080, 8081, etc.) ### Multi-Port Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FastAPI Server โ”‚ โ”‚ (Port 8000) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”‚ llama-server โ”‚ โ”‚ llama-server โ”‚ โ”‚ Model A:8080 โ”‚ โ”‚ Model B:8081 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## โš™๏ธ Configuration ### Environment Variables ```bash # Maximum cached models (default: 2) MAX_CACHED_MODELS=2 # Base port for llama-server instances (default: 8080) BASE_PORT=8080 # Default model on startup DEFAULT_MODEL=deepseek-chat ``` ### Model Configuration Edit `AVAILABLE_MODELS` in `app.py` to add custom models: ```python AVAILABLE_MODELS = { "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf" } ``` ## ๐Ÿ“Š API Endpoints ### Status & Models - `GET /` - API status and current model - `GET /models` - List available models - `GET /cache/info` - Cache statistics and cached models ### Model Management - `POST /switch-model` - Switch active model (with caching) ### Chat Completions - `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible) - `POST /v1/web-chat/completions` - Web-augmented chat with search ### Documentation - `GET /docs` - Swagger UI interactive documentation - `GET /redoc` - ReDoc alternative documentation - `GET /openapi.json` - OpenAPI 3.0 specification export ## ๐Ÿงช Testing ```bash # Test basic chat curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50 }' # Check cache status curl http://localhost:8000/cache/info # Switch models curl -X POST http://localhost:8000/switch-model \ -H "Content-Type: application/json" \ -d '{"model_name": "deepseek-coder"}' ``` ## ๐Ÿ” Web Search Integration The web-augmented chat endpoint automatically: 1. Extracts the user's query from the last message 2. Performs a DuckDuckGo web search 3. Injects search results into the LLM context 4. Returns response with source citations **Use cases:** - Current events and news - Recent developments beyond training data - Fact-checking with live web data - Research with source attribution ## ๐Ÿ“ˆ Performance Tips 1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB) 2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores 3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff 4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support) ## ๐Ÿ› ๏ธ Development ### Project Structure ``` AGI/ โ”œโ”€โ”€ app.py # Main FastAPI application โ”œโ”€โ”€ client_multi_model.py # Example client โ”œโ”€โ”€ Dockerfile # Docker configuration โ”œโ”€โ”€ pyproject.toml # Python dependencies โ””โ”€โ”€ README.md # This file ``` ### Adding New Models 1. Find a GGUF model on HuggingFace 2. Add to `AVAILABLE_MODELS` dict 3. Restart the server 4. Switch to your new model via API ## ๐Ÿ“ License Apache 2.0 - See LICENSE file for details ## ๐Ÿค Contributing Contributions welcome! Please feel free to submit a Pull Request. ## ๐Ÿ› Troubleshooting ### Model fails to load - Ensure `llama-server` is in your PATH - Check available disk space for model downloads - Verify internet connectivity for first-time model downloads ### Out of memory errors - Reduce `MAX_CACHED_MODELS` to 1 - Use smaller quantized models (Q4_K_M instead of Q8) - Increase system swap space ### Port conflicts - Change `BASE_PORT` if 8080+ are in use - Check for other llama-server instances: `ps aux | grep llama` ## ๐Ÿ“š Additional Resources - [FastAPI Documentation](https://fastapi.tiangolo.com/) - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) - [Hugging Face Models](https://huggingface.co/models?library=gguf) - [OpenAI API Reference](https://platform.openai.com/docs/api-reference) --- Built with โค๏ธ using FastAPI and llama.cpp