Spaces:

ResearchEngineering
/

AGI

Sleeping

AGI

File size: 9,164 Bytes

ff5d4b2
1ccd330
ff5d4b2
 
 
 
 
 
c384ef1
ff5d4b2
 
1ccd330

---
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search
---

# AGI Multi-Model API

A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.

## ✨ Features

- **🔄 Dynamic Model Switching**: Hot-swap between 5 different LLM models
- **⚡ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching
- **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo
- **📚 OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions
- **📖 Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc
- **🚀 Optimized Performance**: Continuous batching and multi-port architecture

## 🤖 Available Models

| Model | Use Case | Size |
|-------|----------|------|
| **deepseek-chat** (default) | General purpose conversation | 7B |
| **mistral-7b** | Financial analysis & summarization | 7B |
| **openhermes-7b** | Advanced instruction following | 7B |
| **deepseek-coder** | Specialized coding assistance | 6.7B |
| **llama-7b** | Lightweight & fast responses | 7B |

## 🚀 Quick Start

### Prerequisites

- Python 3.10+
- `llama-server` (llama.cpp)
- 8GB+ RAM (16GB+ recommended for caching multiple models)

### Installation

```bash
# Clone the repository
git clone <your-repo-url>
cd AGI

# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml

# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
```

### Docker Deployment

```bash
# Build the image
docker build -t agi-api .

# Run the container
docker run -p 8000:8000 agi-api
```

## 📖 API Documentation

Once the server is running, access the interactive documentation:

- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json

## 🔧 Usage Examples

### Basic Chat Completion

```python
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }
)
print(response.json())
```

### Web-Augmented Chat

```python
response = requests.post(
    "http://localhost:8000/v1/web-chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What are the latest AI developments?"}
        ],
        "max_tokens": 512,
        "max_search_results": 5
    }
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")
```

### Switch Models

```python
# Switch to coding model
response = requests.post(
    "http://localhost:8000/switch-model",
    json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
```

### Check Cache Status

```python
response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
    print(f"  - {model['name']} on port {model['port']}")
```

## 🏗️ Architecture

### Model Caching System

The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:

```
┌─────────────────────────────────────────────────┐
│  Request: Switch to Model A                    │
├─────────────────────────────────────────────────┤
│  1. Check if A is current → Skip               │
│  2. Check cache for A                          │
│     ├─ Cache Hit  → Instant switch (< 1s)      │
│     └─ Cache Miss → Load model (~2-3 min)      │
│  3. If cache full → Evict LRU model            │
│  4. Add A to cache                             │
└─────────────────────────────────────────────────┘
```

**Benefits:**
- First load: ~2-3 minutes (model download + initialization)
- Subsequent switches: < 1 second (from cache)
- Automatic memory management with LRU eviction
- Each model runs on a separate port (8080, 8081, etc.)

### Multi-Port Architecture

```
┌──────────────────────┐
│   FastAPI Server     │
│   (Port 8000)        │
└──────────┬───────────┘
           │
    ┌──────┴────────────────────┐
    │                           │
┌───▼────────────┐  ┌──────────▼────┐
│ llama-server   │  │ llama-server  │
│ Model A:8080   │  │ Model B:8081  │
└────────────────┘  └───────────────┘
```

## ⚙️ Configuration

### Environment Variables

```bash
# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2

# Base port for llama-server instances (default: 8080)
BASE_PORT=8080

# Default model on startup
DEFAULT_MODEL=deepseek-chat
```

### Model Configuration

Edit `AVAILABLE_MODELS` in `app.py` to add custom models:

```python
AVAILABLE_MODELS = {
    "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}
```

## 📊 API Endpoints

### Status & Models

- `GET /` - API status and current model
- `GET /models` - List available models
- `GET /cache/info` - Cache statistics and cached models

### Model Management

- `POST /switch-model` - Switch active model (with caching)

### Chat Completions

- `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
- `POST /v1/web-chat/completions` - Web-augmented chat with search

### Documentation

- `GET /docs` - Swagger UI interactive documentation
- `GET /redoc` - ReDoc alternative documentation
- `GET /openapi.json` - OpenAPI 3.0 specification export

## 🧪 Testing

```bash
# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check cache status
curl http://localhost:8000/cache/info

# Switch models
curl -X POST http://localhost:8000/switch-model \
  -H "Content-Type: application/json" \
  -d '{"model_name": "deepseek-coder"}'
```

## 🔍 Web Search Integration

The web-augmented chat endpoint automatically:
1. Extracts the user's query from the last message
2. Performs a DuckDuckGo web search
3. Injects search results into the LLM context
4. Returns response with source citations

**Use cases:**
- Current events and news
- Recent developments beyond training data
- Fact-checking with live web data
- Research with source attribution

## 📈 Performance Tips

1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff
4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)

## 🛠️ Development

### Project Structure

```
AGI/
├── app.py                    # Main FastAPI application
├── client_multi_model.py     # Example client
├── Dockerfile                # Docker configuration
├── pyproject.toml           # Python dependencies
└── README.md                # This file
```

### Adding New Models

1. Find a GGUF model on HuggingFace
2. Add to `AVAILABLE_MODELS` dict
3. Restart the server
4. Switch to your new model via API

## 📝 License

Apache 2.0 - See LICENSE file for details

## 🤝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

## 🐛 Troubleshooting

### Model fails to load
- Ensure `llama-server` is in your PATH
- Check available disk space for model downloads
- Verify internet connectivity for first-time model downloads

### Out of memory errors
- Reduce `MAX_CACHED_MODELS` to 1
- Use smaller quantized models (Q4_K_M instead of Q8)
- Increase system swap space

### Port conflicts
- Change `BASE_PORT` if 8080+ are in use
- Check for other llama-server instances: `ps aux | grep llama`

## 📚 Additional Resources

- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [Hugging Face Models](https://huggingface.co/models?library=gguf)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)

---

Built with ❤️ using FastAPI and llama.cpp