AGI / README.md
Dmitry Beresnev
add readme
c384ef1
metadata
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search

AGI Multi-Model API

A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.

✨ Features

  • πŸ”„ Dynamic Model Switching: Hot-swap between 5 different LLM models
  • ⚑ Intelligent Caching: LRU cache keeps up to 2 models in memory for instant switching
  • 🌐 Web-Augmented Chat: Real-time web search integration via DuckDuckGo
  • πŸ“š OpenAI-Compatible API: Drop-in replacement for OpenAI chat completions
  • πŸ“– Auto-Generated Documentation: Interactive API docs with Swagger UI and ReDoc
  • πŸš€ Optimized Performance: Continuous batching and multi-port architecture

πŸ€– Available Models

Model Use Case Size
deepseek-chat (default) General purpose conversation 7B
mistral-7b Financial analysis & summarization 7B
openhermes-7b Advanced instruction following 7B
deepseek-coder Specialized coding assistance 6.7B
llama-7b Lightweight & fast responses 7B

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • llama-server (llama.cpp)
  • 8GB+ RAM (16GB+ recommended for caching multiple models)

Installation

# Clone the repository
git clone <your-repo-url>
cd AGI

# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml

# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000

Docker Deployment

# Build the image
docker build -t agi-api .

# Run the container
docker run -p 8000:8000 agi-api

πŸ“– API Documentation

Once the server is running, access the interactive documentation:

πŸ”§ Usage Examples

Basic Chat Completion

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }
)
print(response.json())

Web-Augmented Chat

response = requests.post(
    "http://localhost:8000/v1/web-chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What are the latest AI developments?"}
        ],
        "max_tokens": 512,
        "max_search_results": 5
    }
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")

Switch Models

# Switch to coding model
response = requests.post(
    "http://localhost:8000/switch-model",
    json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}

Check Cache Status

response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
    print(f"  - {model['name']} on port {model['port']}")

πŸ—οΈ Architecture

Model Caching System

The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Request: Switch to Model A                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1. Check if A is current β†’ Skip               β”‚
β”‚  2. Check cache for A                          β”‚
β”‚     β”œβ”€ Cache Hit  β†’ Instant switch (< 1s)      β”‚
β”‚     └─ Cache Miss β†’ Load model (~2-3 min)      β”‚
β”‚  3. If cache full β†’ Evict LRU model            β”‚
β”‚  4. Add A to cache                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • First load: ~2-3 minutes (model download + initialization)
  • Subsequent switches: < 1 second (from cache)
  • Automatic memory management with LRU eviction
  • Each model runs on a separate port (8080, 8081, etc.)

Multi-Port Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   FastAPI Server     β”‚
β”‚   (Port 8000)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                           β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ llama-server   β”‚  β”‚ llama-server  β”‚
β”‚ Model A:8080   β”‚  β”‚ Model B:8081  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βš™οΈ Configuration

Environment Variables

# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2

# Base port for llama-server instances (default: 8080)
BASE_PORT=8080

# Default model on startup
DEFAULT_MODEL=deepseek-chat

Model Configuration

Edit AVAILABLE_MODELS in app.py to add custom models:

AVAILABLE_MODELS = {
    "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}

πŸ“Š API Endpoints

Status & Models

  • GET / - API status and current model
  • GET /models - List available models
  • GET /cache/info - Cache statistics and cached models

Model Management

  • POST /switch-model - Switch active model (with caching)

Chat Completions

  • POST /v1/chat/completions - Standard chat completions (OpenAI-compatible)
  • POST /v1/web-chat/completions - Web-augmented chat with search

Documentation

  • GET /docs - Swagger UI interactive documentation
  • GET /redoc - ReDoc alternative documentation
  • GET /openapi.json - OpenAPI 3.0 specification export

πŸ§ͺ Testing

# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check cache status
curl http://localhost:8000/cache/info

# Switch models
curl -X POST http://localhost:8000/switch-model \
  -H "Content-Type: application/json" \
  -d '{"model_name": "deepseek-coder"}'

πŸ” Web Search Integration

The web-augmented chat endpoint automatically:

  1. Extracts the user's query from the last message
  2. Performs a DuckDuckGo web search
  3. Injects search results into the LLM context
  4. Returns response with source citations

Use cases:

  • Current events and news
  • Recent developments beyond training data
  • Fact-checking with live web data
  • Research with source attribution

πŸ“ˆ Performance Tips

  1. Cache Size: Increase MAX_CACHED_MODELS if you have sufficient RAM (each model ~4-5GB)
  2. CPU Threads: Adjust -t parameter in start_llama_server() based on your CPU cores
  3. Batch Size: Modify -b parameter for throughput vs. latency tradeoff
  4. GPU Acceleration: Set -ngl > 0 if you have a GPU (requires llama.cpp with GPU support)

πŸ› οΈ Development

Project Structure

AGI/
β”œβ”€β”€ app.py                    # Main FastAPI application
β”œβ”€β”€ client_multi_model.py     # Example client
β”œβ”€β”€ Dockerfile                # Docker configuration
β”œβ”€β”€ pyproject.toml           # Python dependencies
└── README.md                # This file

Adding New Models

  1. Find a GGUF model on HuggingFace
  2. Add to AVAILABLE_MODELS dict
  3. Restart the server
  4. Switch to your new model via API

πŸ“ License

Apache 2.0 - See LICENSE file for details

🀝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

πŸ› Troubleshooting

Model fails to load

  • Ensure llama-server is in your PATH
  • Check available disk space for model downloads
  • Verify internet connectivity for first-time model downloads

Out of memory errors

  • Reduce MAX_CACHED_MODELS to 1
  • Use smaller quantized models (Q4_K_M instead of Q8)
  • Increase system swap space

Port conflicts

  • Change BASE_PORT if 8080+ are in use
  • Check for other llama-server instances: ps aux | grep llama

πŸ“š Additional Resources


Built with ❀️ using FastAPI and llama.cpp