File size: 9,164 Bytes
ff5d4b2
1ccd330
ff5d4b2
 
 
 
 
 
c384ef1
ff5d4b2
 
1ccd330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search
---

# AGI Multi-Model API

A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.

## ✨ Features

- **πŸ”„ Dynamic Model Switching**: Hot-swap between 5 different LLM models
- **⚑ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching
- **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo
- **πŸ“š OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions
- **πŸ“– Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc
- **πŸš€ Optimized Performance**: Continuous batching and multi-port architecture

## πŸ€– Available Models

| Model | Use Case | Size |
|-------|----------|------|
| **deepseek-chat** (default) | General purpose conversation | 7B |
| **mistral-7b** | Financial analysis & summarization | 7B |
| **openhermes-7b** | Advanced instruction following | 7B |
| **deepseek-coder** | Specialized coding assistance | 6.7B |
| **llama-7b** | Lightweight & fast responses | 7B |

## πŸš€ Quick Start

### Prerequisites

- Python 3.10+
- `llama-server` (llama.cpp)
- 8GB+ RAM (16GB+ recommended for caching multiple models)

### Installation

```bash
# Clone the repository
git clone <your-repo-url>
cd AGI

# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml

# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
```

### Docker Deployment

```bash
# Build the image
docker build -t agi-api .

# Run the container
docker run -p 8000:8000 agi-api
```

## πŸ“– API Documentation

Once the server is running, access the interactive documentation:

- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json

## πŸ”§ Usage Examples

### Basic Chat Completion

```python
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }
)
print(response.json())
```

### Web-Augmented Chat

```python
response = requests.post(
    "http://localhost:8000/v1/web-chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What are the latest AI developments?"}
        ],
        "max_tokens": 512,
        "max_search_results": 5
    }
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")
```

### Switch Models

```python
# Switch to coding model
response = requests.post(
    "http://localhost:8000/switch-model",
    json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
```

### Check Cache Status

```python
response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
    print(f"  - {model['name']} on port {model['port']}")
```

## πŸ—οΈ Architecture

### Model Caching System

The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Request: Switch to Model A                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1. Check if A is current β†’ Skip               β”‚
β”‚  2. Check cache for A                          β”‚
β”‚     β”œβ”€ Cache Hit  β†’ Instant switch (< 1s)      β”‚
β”‚     └─ Cache Miss β†’ Load model (~2-3 min)      β”‚
β”‚  3. If cache full β†’ Evict LRU model            β”‚
β”‚  4. Add A to cache                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Benefits:**
- First load: ~2-3 minutes (model download + initialization)
- Subsequent switches: < 1 second (from cache)
- Automatic memory management with LRU eviction
- Each model runs on a separate port (8080, 8081, etc.)

### Multi-Port Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   FastAPI Server     β”‚
β”‚   (Port 8000)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                           β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ llama-server   β”‚  β”‚ llama-server  β”‚
β”‚ Model A:8080   β”‚  β”‚ Model B:8081  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## βš™οΈ Configuration

### Environment Variables

```bash
# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2

# Base port for llama-server instances (default: 8080)
BASE_PORT=8080

# Default model on startup
DEFAULT_MODEL=deepseek-chat
```

### Model Configuration

Edit `AVAILABLE_MODELS` in `app.py` to add custom models:

```python
AVAILABLE_MODELS = {
    "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}
```

## πŸ“Š API Endpoints

### Status & Models

- `GET /` - API status and current model
- `GET /models` - List available models
- `GET /cache/info` - Cache statistics and cached models

### Model Management

- `POST /switch-model` - Switch active model (with caching)

### Chat Completions

- `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
- `POST /v1/web-chat/completions` - Web-augmented chat with search

### Documentation

- `GET /docs` - Swagger UI interactive documentation
- `GET /redoc` - ReDoc alternative documentation
- `GET /openapi.json` - OpenAPI 3.0 specification export

## πŸ§ͺ Testing

```bash
# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check cache status
curl http://localhost:8000/cache/info

# Switch models
curl -X POST http://localhost:8000/switch-model \
  -H "Content-Type: application/json" \
  -d '{"model_name": "deepseek-coder"}'
```

## πŸ” Web Search Integration

The web-augmented chat endpoint automatically:
1. Extracts the user's query from the last message
2. Performs a DuckDuckGo web search
3. Injects search results into the LLM context
4. Returns response with source citations

**Use cases:**
- Current events and news
- Recent developments beyond training data
- Fact-checking with live web data
- Research with source attribution

## πŸ“ˆ Performance Tips

1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff
4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)

## πŸ› οΈ Development

### Project Structure

```
AGI/
β”œβ”€β”€ app.py                    # Main FastAPI application
β”œβ”€β”€ client_multi_model.py     # Example client
β”œβ”€β”€ Dockerfile                # Docker configuration
β”œβ”€β”€ pyproject.toml           # Python dependencies
└── README.md                # This file
```

### Adding New Models

1. Find a GGUF model on HuggingFace
2. Add to `AVAILABLE_MODELS` dict
3. Restart the server
4. Switch to your new model via API

## πŸ“ License

Apache 2.0 - See LICENSE file for details

## 🀝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

## πŸ› Troubleshooting

### Model fails to load
- Ensure `llama-server` is in your PATH
- Check available disk space for model downloads
- Verify internet connectivity for first-time model downloads

### Out of memory errors
- Reduce `MAX_CACHED_MODELS` to 1
- Use smaller quantized models (Q4_K_M instead of Q8)
- Increase system swap space

### Port conflicts
- Change `BASE_PORT` if 8080+ are in use
- Check for other llama-server instances: `ps aux | grep llama`

## πŸ“š Additional Resources

- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [Hugging Face Models](https://huggingface.co/models?library=gguf)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)

---

Built with ❀️ using FastAPI and llama.cpp