Dmitry Beresnev commited on
Commit
1ccd330
Β·
1 Parent(s): 2295174

add readme

Browse files
Files changed (1) hide show
  1. README.md +311 -3
README.md CHANGED
@@ -1,12 +1,320 @@
1
  ---
2
- title: AGI
3
  emoji: 😻
4
  colorFrom: purple
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
- short_description: Local AI model
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AGI Multi-Model API
3
  emoji: 😻
4
  colorFrom: purple
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
+ short_description: Dynamic multi-model LLM API with intelligent caching and web search
10
  ---
11
 
12
+ # AGI Multi-Model API
13
+
14
+ A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.
15
+
16
+ ## ✨ Features
17
+
18
+ - **πŸ”„ Dynamic Model Switching**: Hot-swap between 5 different LLM models
19
+ - **⚑ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching
20
+ - **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo
21
+ - **πŸ“š OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions
22
+ - **πŸ“– Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc
23
+ - **πŸš€ Optimized Performance**: Continuous batching and multi-port architecture
24
+
25
+ ## πŸ€– Available Models
26
+
27
+ | Model | Use Case | Size |
28
+ |-------|----------|------|
29
+ | **deepseek-chat** (default) | General purpose conversation | 7B |
30
+ | **mistral-7b** | Financial analysis & summarization | 7B |
31
+ | **openhermes-7b** | Advanced instruction following | 7B |
32
+ | **deepseek-coder** | Specialized coding assistance | 6.7B |
33
+ | **llama-7b** | Lightweight & fast responses | 7B |
34
+
35
+ ## πŸš€ Quick Start
36
+
37
+ ### Prerequisites
38
+
39
+ - Python 3.10+
40
+ - `llama-server` (llama.cpp)
41
+ - 8GB+ RAM (16GB+ recommended for caching multiple models)
42
+
43
+ ### Installation
44
+
45
+ ```bash
46
+ # Clone the repository
47
+ git clone <your-repo-url>
48
+ cd AGI
49
+
50
+ # Install dependencies
51
+ pip install -r requirements.txt
52
+ # or
53
+ uv pip install -r pyproject.toml
54
+
55
+ # Start the server
56
+ uvicorn app:app --host 0.0.0.0 --port 8000
57
+ ```
58
+
59
+ ### Docker Deployment
60
+
61
+ ```bash
62
+ # Build the image
63
+ docker build -t agi-api .
64
+
65
+ # Run the container
66
+ docker run -p 8000:8000 agi-api
67
+ ```
68
+
69
+ ## πŸ“– API Documentation
70
+
71
+ Once the server is running, access the interactive documentation:
72
+
73
+ - **Swagger UI**: http://localhost:8000/docs
74
+ - **ReDoc**: http://localhost:8000/redoc
75
+ - **OpenAPI JSON**: http://localhost:8000/openapi.json
76
+
77
+ ## πŸ”§ Usage Examples
78
+
79
+ ### Basic Chat Completion
80
+
81
+ ```python
82
+ import requests
83
+
84
+ response = requests.post(
85
+ "http://localhost:8000/v1/chat/completions",
86
+ json={
87
+ "messages": [
88
+ {"role": "user", "content": "What is the capital of France?"}
89
+ ],
90
+ "max_tokens": 100,
91
+ "temperature": 0.7
92
+ }
93
+ )
94
+ print(response.json())
95
+ ```
96
+
97
+ ### Web-Augmented Chat
98
+
99
+ ```python
100
+ response = requests.post(
101
+ "http://localhost:8000/v1/web-chat/completions",
102
+ json={
103
+ "messages": [
104
+ {"role": "user", "content": "What are the latest AI developments?"}
105
+ ],
106
+ "max_tokens": 512,
107
+ "max_search_results": 5
108
+ }
109
+ )
110
+ result = response.json()
111
+ print(result["choices"][0]["message"]["content"])
112
+ print(f"Sources: {result['web_search']['sources']}")
113
+ ```
114
+
115
+ ### Switch Models
116
+
117
+ ```python
118
+ # Switch to coding model
119
+ response = requests.post(
120
+ "http://localhost:8000/switch-model",
121
+ json={"model_name": "deepseek-coder"}
122
+ )
123
+ print(response.json())
124
+ # Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
125
+ ```
126
+
127
+ ### Check Cache Status
128
+
129
+ ```python
130
+ response = requests.get("http://localhost:8000/cache/info")
131
+ cache_info = response.json()
132
+ print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
133
+ for model in cache_info['cached_models']:
134
+ print(f" - {model['name']} on port {model['port']}")
135
+ ```
136
+
137
+ ## πŸ—οΈ Architecture
138
+
139
+ ### Model Caching System
140
+
141
+ The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:
142
+
143
+ ```
144
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
145
+ β”‚ Request: Switch to Model A β”‚
146
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
147
+ β”‚ 1. Check if A is current β†’ Skip β”‚
148
+ β”‚ 2. Check cache for A β”‚
149
+ β”‚ β”œβ”€ Cache Hit β†’ Instant switch (< 1s) β”‚
150
+ β”‚ └─ Cache Miss β†’ Load model (~2-3 min) β”‚
151
+ β”‚ 3. If cache full β†’ Evict LRU model β”‚
152
+ β”‚ 4. Add A to cache β”‚
153
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
154
+ ```
155
+
156
+ **Benefits:**
157
+ - First load: ~2-3 minutes (model download + initialization)
158
+ - Subsequent switches: < 1 second (from cache)
159
+ - Automatic memory management with LRU eviction
160
+ - Each model runs on a separate port (8080, 8081, etc.)
161
+
162
+ ### Multi-Port Architecture
163
+
164
+ ```
165
+ β”Œβ”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
166
+ β”‚ FastAPI Server β”‚
167
+ β”‚ (Port 8000) β”‚
168
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
169
+ β”‚
170
+ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
171
+ β”‚ β”‚
172
+ β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
173
+ β”‚ llama-server β”‚ β”‚ llama-server β”‚
174
+ β”‚ Model A:8080 β”‚ β”‚ Model B:8081 β”‚
175
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
176
+ ```
177
+
178
+ ## βš™οΈ Configuration
179
+
180
+ ### Environment Variables
181
+
182
+ ```bash
183
+ # Maximum cached models (default: 2)
184
+ MAX_CACHED_MODELS=2
185
+
186
+ # Base port for llama-server instances (default: 8080)
187
+ BASE_PORT=8080
188
+
189
+ # Default model on startup
190
+ DEFAULT_MODEL=deepseek-chat
191
+ ```
192
+
193
+ ### Model Configuration
194
+
195
+ Edit `AVAILABLE_MODELS` in `app.py` to add custom models:
196
+
197
+ ```python
198
+ AVAILABLE_MODELS = {
199
+ "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
200
+ }
201
+ ```
202
+
203
+ ## πŸ“Š API Endpoints
204
+
205
+ ### Status & Models
206
+
207
+ - `GET /` - API status and current model
208
+ - `GET /models` - List available models
209
+ - `GET /cache/info` - Cache statistics and cached models
210
+
211
+ ### Model Management
212
+
213
+ - `POST /switch-model` - Switch active model (with caching)
214
+
215
+ ### Chat Completions
216
+
217
+ - `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
218
+ - `POST /v1/web-chat/completions` - Web-augmented chat with search
219
+
220
+ ### Documentation
221
+
222
+ - `GET /docs` - Swagger UI interactive documentation
223
+ - `GET /redoc` - ReDoc alternative documentation
224
+ - `GET /openapi.json` - OpenAPI 3.0 specification export
225
+
226
+ ## πŸ§ͺ Testing
227
+
228
+ ```bash
229
+ # Test basic chat
230
+ curl -X POST http://localhost:8000/v1/chat/completions \
231
+ -H "Content-Type: application/json" \
232
+ -d '{
233
+ "messages": [{"role": "user", "content": "Hello!"}],
234
+ "max_tokens": 50
235
+ }'
236
+
237
+ # Check cache status
238
+ curl http://localhost:8000/cache/info
239
+
240
+ # Switch models
241
+ curl -X POST http://localhost:8000/switch-model \
242
+ -H "Content-Type: application/json" \
243
+ -d '{"model_name": "deepseek-coder"}'
244
+ ```
245
+
246
+ ## πŸ” Web Search Integration
247
+
248
+ The web-augmented chat endpoint automatically:
249
+ 1. Extracts the user's query from the last message
250
+ 2. Performs a DuckDuckGo web search
251
+ 3. Injects search results into the LLM context
252
+ 4. Returns response with source citations
253
+
254
+ **Use cases:**
255
+ - Current events and news
256
+ - Recent developments beyond training data
257
+ - Fact-checking with live web data
258
+ - Research with source attribution
259
+
260
+ ## πŸ“ˆ Performance Tips
261
+
262
+ 1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
263
+ 2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
264
+ 3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff
265
+ 4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)
266
+
267
+ ## πŸ› οΈ Development
268
+
269
+ ### Project Structure
270
+
271
+ ```
272
+ AGI/
273
+ β”œβ”€β”€ app.py # Main FastAPI application
274
+ β”œβ”€β”€ client_multi_model.py # Example client
275
+ β”œβ”€β”€ Dockerfile # Docker configuration
276
+ β”œβ”€β”€ pyproject.toml # Python dependencies
277
+ └── README.md # This file
278
+ ```
279
+
280
+ ### Adding New Models
281
+
282
+ 1. Find a GGUF model on HuggingFace
283
+ 2. Add to `AVAILABLE_MODELS` dict
284
+ 3. Restart the server
285
+ 4. Switch to your new model via API
286
+
287
+ ## πŸ“ License
288
+
289
+ Apache 2.0 - See LICENSE file for details
290
+
291
+ ## 🀝 Contributing
292
+
293
+ Contributions welcome! Please feel free to submit a Pull Request.
294
+
295
+ ## πŸ› Troubleshooting
296
+
297
+ ### Model fails to load
298
+ - Ensure `llama-server` is in your PATH
299
+ - Check available disk space for model downloads
300
+ - Verify internet connectivity for first-time model downloads
301
+
302
+ ### Out of memory errors
303
+ - Reduce `MAX_CACHED_MODELS` to 1
304
+ - Use smaller quantized models (Q4_K_M instead of Q8)
305
+ - Increase system swap space
306
+
307
+ ### Port conflicts
308
+ - Change `BASE_PORT` if 8080+ are in use
309
+ - Check for other llama-server instances: `ps aux | grep llama`
310
+
311
+ ## πŸ“š Additional Resources
312
+
313
+ - [FastAPI Documentation](https://fastapi.tiangolo.com/)
314
+ - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
315
+ - [Hugging Face Models](https://huggingface.co/models?library=gguf)
316
+ - [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
317
+
318
+ ---
319
+
320
+ Built with ❀️ using FastAPI and llama.cpp