NPU-CLIP-Python / README.md
makn87amd's picture
workaround for crash when using two sessions
8fc0e7a
# CLIP Inference with AMD Ryzen AI
This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU. This version is for RAI 1.5. This script demonstrates zero-shot image classification capabilities of the CLIP model. It runs on both NPU and CPU.
### Installation instructions
The user must have the RAI 1.5 environment set up. Please follow the [Ryzen AI Installation Guide](https://ryzenai.docs.amd.com/en/latest/inst.html) to prepare your environment.
1. Activate your conda environment:
```bash
conda activate ryzen-ai-1.5.0
```
2. Unzip both of the cache directories. There is one for vision and one for text. Make sure that the directories are in the same location as the inference script.
3. Install the required Python packages:
```bash
pip install -r requirements.txt
```
### Required Files
Ensure the following files are present in the same directory as `clip_inference.py`:
#### ONNX Model Files
- `clip_text_model.onnx` - ONNX text encoder model
- `clip_vision_model.onnx` - ONNX vision encoder model
#### Configuration Files (for NPU execution)
- `vitisai_config.json` - VitisAI configuration
#### Model Cache Directories
- `clip_text_model_cache/` - Cached text model artifacts
- `clip_vision_model_cache/` - Cached vision model artifacts
### Cache Directory Structure
The cache directories contain pre-compiled model artifacts and optimization files for improved performance.
They eliminate the need for model compilation, which may be timely.
CLIP uses two models, and has two cache files provided as zip files.
#### Cache Directory Descriptions
- **Root Level Files**: Contain compilation metadata, graph analysis, and performance summaries
- **`cache/`**: Hash-based cache storage for model artifacts
- **`vaiml_par_0/`**: Contains compiled model artifacts, MLIR representations, and native libraries
- **`vaiml_partition_fe.flexml/`**: Contains optimized ONNX models and visualization files
**Note**: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times.
## Usage
### Command Line Interface
```bash
python clip_inference.py [-h] (--npu | --cpu) [--num_images NUM_IMAGES]
```
### Arguments
**Required (mutually exclusive):**
- `--cpu`: Run inference on CPU using CPUExecutionProvider
- `--npu`: Run inference on NPU using VitisAIExecutionProvider
**Optional:**
- `--num_images`: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000)
### Examples
1. **CPU inference with default settings (50 images):**
```bash
python clip_inference.py --cpu
```
2. **NPU inference with 100 images:**
```bash
python clip_inference.py --npu --num_images 100
```
3. **NPU inference on complete test dataset:**
```bash
python clip_inference.py --npu --num_images 10000
```
## How It Works
### Model Architecture
- **Text Encoder**: Processes text descriptions ("a photo of a {class_name}")
- **Vision Encoder**: Processes CIFAR-100 images (32x32 RGB)
- **Classification**: Computes similarity between image and text embeddings
### Inference Pipeline
1. **Text Processing**: Pre-compute text features for all 100 CIFAR-100 class labels
2. **Image Processing**: Process each image through the vision encoder
3. **Classification**: Compute cosine similarity between image and text features
4. **Prediction**: Select the class with highest similarity score
### Performance Optimization
- **NPU Acceleration**: Leverages AMD Ryzen AI NPU for faster inference
- **Caching**: Uses pre-compiled model caches for reduced startup time
## Output Metrics
The script reports the following performance metrics:
- **Text Latency**: Average time per text inference (ms)
- **Text Throughput**: Text inferences per second (inf/s)
- **Vision Latency**: Average time per image inference (ms)
- **Vision Throughput**: Image inferences per second (inf/s)
- **Classification Accuracy**: Percentage of correctly classified images
### Example Output
**NPU Execution (50 images):**
```
Compilation Done
Session on NPU
Processing images...
Image inference: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:03<00:00, 13.45it/s]
Results:
Text latency: 26.65 ms
Text throughput: 37.52 inf/s
Vision latency: 73.46 ms
Vision throughput: 13.61 inf/s
Classification accuracy: 77.55%
```