--- license: apple-amlr base_model: - mistralai/Mistral-7B-Instruct-v0.2 tags: - rag - compression - retrieval - end-to-end - generation --- # CLaRa-7B-E2E (Compression-16 & 128) The **CLaRa-7B-E2E** model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16× and 128x document compression. **Training recipe:** End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective. **Benchmarks:** Strong retrieval-augmented QA performance under aggressive compression. --- ## More details and usage examples: Paper: [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/abs/2511.18659) GitHub: https://github.com/apple/ml-clara --- ## Example Usage (End-to-End Inference) ```python from transformers import AutoModel unirag = AutoModel.from_pretrained( "/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16", trust_remote_code=True ).to("cuda") # Example documents and question documents = [[ "Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...", ] * 20] questions = [ "Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?" ] # End-to-end usage (retrieval + generation) # The effective top-k is controlled by `generation_top_k` in config.json. out = unirag.generate_from_questions( questions=questions, documents=documents, max_new_tokens=64 ) print("Generated answer", out)