Spaces:

ResearchEngineering
/

AGI

Sleeping

App Files Files Community

Dmitry Beresnev commited on 12 days ago

Commit

1ccd330

1 Parent(s): 2295174

add readme

Browse files

Files changed (1) hide show

README.md +311 -3

README.md CHANGED Viewed

@@ -1,12 +1,320 @@
 ---
-title: AGI
 emoji: 😻
 colorFrom: purple
 colorTo: green
 sdk: docker
 pinned: false
 license: apache-2.0
-short_description: Local AI model
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: AGI Multi-Model API
 emoji: 😻
 colorFrom: purple
 colorTo: green
 sdk: docker
 pinned: false
 license: apache-2.0
+short_description: Dynamic multi-model LLM API with intelligent caching and web search
 ---
+# AGI Multi-Model API
+A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.
+## ✨ Features
+- **🔄 Dynamic Model Switching**: Hot-swap between 5 different LLM models
+- **⚡ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching
+- **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo
+- **📚 OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions
+- **📖 Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc
+- **🚀 Optimized Performance**: Continuous batching and multi-port architecture
+## 🤖 Available Models
+| Model | Use Case | Size |
+|-------|----------|------|
+| **deepseek-chat** (default) | General purpose conversation | 7B |
+| **mistral-7b** | Financial analysis & summarization | 7B |
+| **openhermes-7b** | Advanced instruction following | 7B |
+| **deepseek-coder** | Specialized coding assistance | 6.7B |
+| **llama-7b** | Lightweight & fast responses | 7B |
+## 🚀 Quick Start
+### Prerequisites
+- Python 3.10+
+- `llama-server` (llama.cpp)
+- 8GB+ RAM (16GB+ recommended for caching multiple models)
+### Installation
+```bash
+# Clone the repository
+git clone <your-repo-url>
+cd AGI
+# Install dependencies
+pip install -r requirements.txt
+# or
+uv pip install -r pyproject.toml
+# Start the server
+uvicorn app:app --host 0.0.0.0 --port 8000
+```
+### Docker Deployment
+```bash
+# Build the image
+docker build -t agi-api .
+# Run the container
+docker run -p 8000:8000 agi-api
+```
+## 📖 API Documentation
+Once the server is running, access the interactive documentation:
+- **Swagger UI**: http://localhost:8000/docs
+- **ReDoc**: http://localhost:8000/redoc
+- **OpenAPI JSON**: http://localhost:8000/openapi.json
+## 🔧 Usage Examples
+### Basic Chat Completion
+```python
+import requests
+response = requests.post(
+    "http://localhost:8000/v1/chat/completions",
+    json={
+        "messages": [
+            {"role": "user", "content": "What is the capital of France?"}
+        ],
+        "max_tokens": 100,
+        "temperature": 0.7
+    }
+)
+print(response.json())
+```
+### Web-Augmented Chat
+```python
+response = requests.post(
+    "http://localhost:8000/v1/web-chat/completions",
+    json={
+        "messages": [
+            {"role": "user", "content": "What are the latest AI developments?"}
+        ],
+        "max_tokens": 512,
+        "max_search_results": 5
+    }
+)
+result = response.json()
+print(result["choices"][0]["message"]["content"])
+print(f"Sources: {result['web_search']['sources']}")
+```
+### Switch Models
+```python
+# Switch to coding model
+response = requests.post(
+    "http://localhost:8000/switch-model",
+    json={"model_name": "deepseek-coder"}
+)
+print(response.json())
+# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
+```
+### Check Cache Status
+```python
+response = requests.get("http://localhost:8000/cache/info")
+cache_info = response.json()
+print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
+for model in cache_info['cached_models']:
+    print(f"  - {model['name']} on port {model['port']}")
+```
+## 🏗️ Architecture
+### Model Caching System
+The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:
+```
+┌─────────────────────────────────────────────────┐
+│  Request: Switch to Model A                    │
+├─────────────────────────────────────────────────┤
+│  1. Check if A is current → Skip               │
+│  2. Check cache for A                          │
+│     ├─ Cache Hit  → Instant switch (< 1s)      │
+│     └─ Cache Miss → Load model (~2-3 min)      │
+│  3. If cache full → Evict LRU model            │
+│  4. Add A to cache                             │
+└─────────────────────────────────────────────────┘
+```
+**Benefits:**
+- First load: ~2-3 minutes (model download + initialization)
+- Subsequent switches: < 1 second (from cache)
+- Automatic memory management with LRU eviction
+- Each model runs on a separate port (8080, 8081, etc.)
+### Multi-Port Architecture
+```
+┌──��───────────────────┐
+│   FastAPI Server     │
+│   (Port 8000)        │
+└──────────┬───────────┘
+           │
+    ┌──────┴────────────────────┐
+    │                           │
+┌───▼────────────┐  ┌──────────▼────┐
+│ llama-server   │  │ llama-server  │
+│ Model A:8080   │  │ Model B:8081  │
+└────────────────┘  └───────────────┘
+```
+## ⚙️ Configuration
+### Environment Variables
+```bash
+# Maximum cached models (default: 2)
+MAX_CACHED_MODELS=2
+# Base port for llama-server instances (default: 8080)
+BASE_PORT=8080
+# Default model on startup
+DEFAULT_MODEL=deepseek-chat
+```
+### Model Configuration
+Edit `AVAILABLE_MODELS` in `app.py` to add custom models:
+```python
+AVAILABLE_MODELS = {
+    "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
+}
+```
+## 📊 API Endpoints
+### Status & Models
+- `GET /` - API status and current model
+- `GET /models` - List available models
+- `GET /cache/info` - Cache statistics and cached models
+### Model Management
+- `POST /switch-model` - Switch active model (with caching)
+### Chat Completions
+- `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
+- `POST /v1/web-chat/completions` - Web-augmented chat with search
+### Documentation
+- `GET /docs` - Swagger UI interactive documentation
+- `GET /redoc` - ReDoc alternative documentation
+- `GET /openapi.json` - OpenAPI 3.0 specification export
+## 🧪 Testing
+```bash
+# Test basic chat
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+# Check cache status
+curl http://localhost:8000/cache/info
+# Switch models
+curl -X POST http://localhost:8000/switch-model \
+  -H "Content-Type: application/json" \
+  -d '{"model_name": "deepseek-coder"}'
+```
+## 🔍 Web Search Integration
+The web-augmented chat endpoint automatically:
+1. Extracts the user's query from the last message
+2. Performs a DuckDuckGo web search
+3. Injects search results into the LLM context
+4. Returns response with source citations
+**Use cases:**
+- Current events and news
+- Recent developments beyond training data
+- Fact-checking with live web data
+- Research with source attribution
+## 📈 Performance Tips
+1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
+2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
+3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff
+4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)
+## 🛠️ Development
+### Project Structure
+```
+AGI/
+├── app.py                    # Main FastAPI application
+├── client_multi_model.py     # Example client
+├── Dockerfile                # Docker configuration
+├── pyproject.toml           # Python dependencies
+└── README.md                # This file
+```
+### Adding New Models
+1. Find a GGUF model on HuggingFace
+2. Add to `AVAILABLE_MODELS` dict
+3. Restart the server
+4. Switch to your new model via API
+## 📝 License
+Apache 2.0 - See LICENSE file for details
+## 🤝 Contributing
+Contributions welcome! Please feel free to submit a Pull Request.
+## 🐛 Troubleshooting
+### Model fails to load
+- Ensure `llama-server` is in your PATH
+- Check available disk space for model downloads
+- Verify internet connectivity for first-time model downloads
+### Out of memory errors
+- Reduce `MAX_CACHED_MODELS` to 1
+- Use smaller quantized models (Q4_K_M instead of Q8)
+- Increase system swap space
+### Port conflicts
+- Change `BASE_PORT` if 8080+ are in use
+- Check for other llama-server instances: `ps aux | grep llama`
+## 📚 Additional Resources
+- [FastAPI Documentation](https://fastapi.tiangolo.com/)
+- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
+- [Hugging Face Models](https://huggingface.co/models?library=gguf)
+- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
+---
+Built with ❤️ using FastAPI and llama.cpp