Spaces:

ResearchEngineering
/

AGI

Sleeping

App Files Files Community

AGI / README.md

Dmitry Beresnev

add readme

c384ef1 10 days ago

preview code

raw

history blame contribute delete

9.16 kB

metadata

title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search

AGI Multi-Model API

A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.

✨ Features

🔄 Dynamic Model Switching: Hot-swap between 5 different LLM models
⚡ Intelligent Caching: LRU cache keeps up to 2 models in memory for instant switching
🌐 Web-Augmented Chat: Real-time web search integration via DuckDuckGo
📚 OpenAI-Compatible API: Drop-in replacement for OpenAI chat completions
📖 Auto-Generated Documentation: Interactive API docs with Swagger UI and ReDoc
🚀 Optimized Performance: Continuous batching and multi-port architecture

🤖 Available Models

Model	Use Case	Size
deepseek-chat (default)	General purpose conversation	7B
mistral-7b	Financial analysis & summarization	7B
openhermes-7b	Advanced instruction following	7B
deepseek-coder	Specialized coding assistance	6.7B
llama-7b	Lightweight & fast responses	7B

🚀 Quick Start

Prerequisites

Python 3.10+
llama-server (llama.cpp)
8GB+ RAM (16GB+ recommended for caching multiple models)

Installation

# Clone the repository
git clone <your-repo-url>
cd AGI

# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml

# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000

Docker Deployment

# Build the image
docker build -t agi-api .

# Run the container
docker run -p 8000:8000 agi-api

📖 API Documentation

Once the server is running, access the interactive documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

🔧 Usage Examples

Basic Chat Completion

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }
)
print(response.json())

Web-Augmented Chat

response = requests.post(
    "http://localhost:8000/v1/web-chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What are the latest AI developments?"}
        ],
        "max_tokens": 512,
        "max_search_results": 5
    }
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")

Switch Models

# Switch to coding model
response = requests.post(
    "http://localhost:8000/switch-model",
    json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}

Check Cache Status

response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
    print(f"  - {model['name']} on port {model['port']}")

🏗️ Architecture

Model Caching System

The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:

┌─────────────────────────────────────────────────┐
│  Request: Switch to Model A                    │
├─────────────────────────────────────────────────┤
│  1. Check if A is current → Skip               │
│  2. Check cache for A                          │
│     ├─ Cache Hit  → Instant switch (< 1s)      │
│     └─ Cache Miss → Load model (~2-3 min)      │
│  3. If cache full → Evict LRU model            │
│  4. Add A to cache                             │
└─────────────────────────────────────────────────┘

Benefits:

First load: ~2-3 minutes (model download + initialization)
Subsequent switches: < 1 second (from cache)
Automatic memory management with LRU eviction
Each model runs on a separate port (8080, 8081, etc.)

Multi-Port Architecture

┌──────────────────────┐
│   FastAPI Server     │
│   (Port 8000)        │
└──────────┬───────────┘
           │
    ┌──────┴────────────────────┐
    │                           │
┌───▼────────────┐  ┌──────────▼────┐
│ llama-server   │  │ llama-server  │
│ Model A:8080   │  │ Model B:8081  │
└────────────────┘  └───────────────┘

⚙️ Configuration

Environment Variables

# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2

# Base port for llama-server instances (default: 8080)
BASE_PORT=8080

# Default model on startup
DEFAULT_MODEL=deepseek-chat

Model Configuration

Edit AVAILABLE_MODELS in app.py to add custom models:

AVAILABLE_MODELS = {
    "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}

📊 API Endpoints

Status & Models

GET / - API status and current model
GET /models - List available models
GET /cache/info - Cache statistics and cached models

Model Management

POST /switch-model - Switch active model (with caching)

Chat Completions

POST /v1/chat/completions - Standard chat completions (OpenAI-compatible)
POST /v1/web-chat/completions - Web-augmented chat with search

Documentation

GET /docs - Swagger UI interactive documentation
GET /redoc - ReDoc alternative documentation
GET /openapi.json - OpenAPI 3.0 specification export

🧪 Testing

# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check cache status
curl http://localhost:8000/cache/info

# Switch models
curl -X POST http://localhost:8000/switch-model \
  -H "Content-Type: application/json" \
  -d '{"model_name": "deepseek-coder"}'

🔍 Web Search Integration

The web-augmented chat endpoint automatically:

Extracts the user's query from the last message
Performs a DuckDuckGo web search
Injects search results into the LLM context
Returns response with source citations

Use cases:

Current events and news
Recent developments beyond training data
Fact-checking with live web data
Research with source attribution

📈 Performance Tips

Cache Size: Increase MAX_CACHED_MODELS if you have sufficient RAM (each model ~4-5GB)
CPU Threads: Adjust -t parameter in start_llama_server() based on your CPU cores
Batch Size: Modify -b parameter for throughput vs. latency tradeoff
GPU Acceleration: Set -ngl > 0 if you have a GPU (requires llama.cpp with GPU support)

🛠️ Development

Project Structure

AGI/
├── app.py                    # Main FastAPI application
├── client_multi_model.py     # Example client
├── Dockerfile                # Docker configuration
├── pyproject.toml           # Python dependencies
└── README.md                # This file

Adding New Models

Find a GGUF model on HuggingFace
Add to AVAILABLE_MODELS dict
Restart the server
Switch to your new model via API

📝 License

Apache 2.0 - See LICENSE file for details

🤝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

🐛 Troubleshooting

Model fails to load

Ensure llama-server is in your PATH
Check available disk space for model downloads
Verify internet connectivity for first-time model downloads

Out of memory errors

Reduce MAX_CACHED_MODELS to 1
Use smaller quantized models (Q4_K_M instead of Q8)
Increase system swap space

Port conflicts

Change BASE_PORT if 8080+ are in use
Check for other llama-server instances: ps aux | grep llama

📚 Additional Resources

Built with ❤️ using FastAPI and llama.cpp