Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

App Files Files Community

DeepBoner / docs /STATUS_LLAMAINDEX_INTEGRATION.md

VibecoderMcSwaggins

feat: Wire LlamaIndex RAG into Simple Mode (Tiered Embedding) (#83)

7baf8ba unverified 9 days ago

preview code

raw

history blame contribute delete

8.51 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

After This PR: What's Working, What's Missing, What's Next

TL;DR: DeepBoner is a fully working biomedical research agent. The LlamaIndex integration we just completed is wired in correctly. The system can search PubMed, ClinicalTrials.gov, and Europe PMC, deduplicate evidence semantically, and generate research reports. It's ready for hackathon submission.

What Does LlamaIndex Actually Do Here?

Short answer: LlamaIndex provides better embeddings + persistence when you have an OpenAI API key.

User has OPENAI_API_KEY → LlamaIndex (OpenAI embeddings, disk persistence)
User has NO API key     → Local embeddings (sentence-transformers, in-memory)

What it does:

Embeds evidence - Converts paper abstracts to vectors for semantic search
Stores to disk - Evidence survives app restart (ChromaDB PersistentClient)
Deduplicates - Prevents storing 99% similar papers (0.9 threshold)
Retrieves context - Judge gets top-30 semantically relevant papers, not random ones

What it does NOT do:

Primary search - PubMed/ClinicalTrials return results; LlamaIndex stores them
Ranking - No reranking of search results (they come pre-ranked from APIs)
Query routing - Doesn't decide which database to search

Is This a "Real" RAG System?

Yes, but simpler than you might expect.

Traditional RAG:     Query → Retrieve from vector DB → Generate with context
DeepBoner's RAG:     Query → Search APIs → Store in vector DB → Judge with context

We're doing "Search-and-Store RAG" not "Retrieve-and-Generate RAG":

Evidence comes from real biomedical APIs (PubMed, etc.), not a pre-built knowledge base
Vector DB is for deduplication + context windowing, not primary retrieval
The "retrieval" happens from external APIs, not from embeddings

This is the RIGHT architecture for a research agent - you want fresh, authoritative sources (PubMed) not a static knowledge base.

Do We Need Neo4j / FAISS / More Complex RAG?

No. Here's why:

You might think you need...	But actually...
Neo4j for knowledge graphs	Evidence relationships are implicit in citations/abstracts
FAISS for fast search	ChromaDB handles our scale (hundreds of papers, not millions)
Complex ingestion pipeline	Our pipeline IS working: Search → Dedupe → Store → Retrieve
Reranking models	PubMed already ranks by relevance; judge handles scoring

The bottleneck is NOT the vector store. It's:

API rate limits (PubMed: 3 req/sec without key, 10 with key)
LLM context windows (judge can only see ~30 papers effectively)
Search query quality (garbage in, garbage out)

What's Actually Working (End-to-End)

Core Research Loop

User Query: "What drugs improve female libido post-menopause?"
    ↓
[1] SearchHandler queries 3 databases in parallel
    ├─ PubMed: 10 results
    ├─ ClinicalTrials.gov: 5 results
    └─ Europe PMC: 10 results
    ↓
[2] ResearchMemory deduplicates (25 → 18 unique)
    ↓
[3] Evidence stored in ChromaDB/LlamaIndex
    ↓
[4] Judge gets top-30 by semantic similarity
    ↓
[5] Judge scores: mechanism=7/10, clinical=6/10
    ↓
[6] Judge says: "Need more on flibanserin mechanism"
    ↓
[7] Loop with new queries (up to 10 iterations)
    ↓
[8] Generate report with drug candidates + findings

What Each Component Does

Component	Status	What It Does
`SearchHandler`	Working	Parallel search across 3 databases
`ResearchMemory`	Working	Stores evidence, tracks hypotheses
`EmbeddingService`	Working	Free tier: local sentence-transformers
`LlamaIndexRAGService`	Working	Premium tier: OpenAI embeddings + persistence
`JudgeHandler`	Working	LLM scores evidence, suggests next queries
`SimpleOrchestrator`	Working	Main research loop (search → judge → synthesize)
`AdvancedOrchestrator`	Working	Multi-agent mode (requires agent-framework)
Gradio UI	Working	Chat interface with streaming events

What's Missing (But Not Blocking)

1. Active Knowledge Base Querying (P2)

Currently: Judge guesses what to search next Should: Judge checks "what do we already have?" before suggesting new queries

Impact: Could reduce redundant searches Effort: Medium (modify judge prompt to include memory summary)

2. Evidence Diversity Selection (P2)

Currently: Judge sees top-30 by relevance (might be redundant) Should: Use MMR (Maximal Marginal Relevance) for diversity

Impact: Better coverage of different perspectives Effort: Low (we have select_diverse_evidence() but it's not used everywhere)

3. Singleton Pattern for LlamaIndex (P3)

Currently: Each call creates new LlamaIndexRAGService instance Should: Cache like _shared_model in EmbeddingService

Impact: Minor performance improvement Effort: Low

4. Evidence Quality Scoring (P3)

Currently: Judge gives overall scores (mechanism + clinical) Should: Score each paper (study design, sample size, etc.)

Impact: Better synthesis quality Effort: High (significant prompt engineering)

What's Definitely NOT Needed

Over-engineering	Why it's unnecessary
GraphRAG / Neo4j	Our scale is hundreds of papers, not knowledge graphs
FAISS / Pinecone	ChromaDB handles our volume fine
Custom embedding models	OpenAI/sentence-transformers work great for biomedical text
Complex chunking strategies	We're storing abstracts (already short)
Hybrid search (BM25 + vector)	APIs already do keyword matching

Hackathon Submission Checklist

Core research loop working
3 biomedical databases integrated (PubMed, ClinicalTrials, Europe PMC)
Semantic deduplication working
Judge assessment working
Report generation working
Gradio UI working
202 tests passing
Tiered embedding service (free vs premium)
LlamaIndex integration complete

You're ready to submit.

Post-Hackathon Roadmap

Phase 1: Polish (1-2 days)

Add singleton pattern for LlamaIndex service
Integration test with real API keys
Verify persistence works on HuggingFace Spaces

Phase 2: Intelligence (1 week)

Judge queries memory before suggesting searches
MMR diversity selection for evidence context
Hypothesis-driven search refinement

Phase 3: Scale (2+ weeks)

Rate limit handling improvements
Batch embedding for large evidence sets
Multi-query parallelization
Export to structured formats (JSON, BibTeX)

Phase 4: Production (future)

User authentication
Persistent user sessions
Evidence caching across users
Usage analytics

Quick Reference: Where Things Are

src/
├── orchestrators/
│   ├── simple.py          # Main research loop (START HERE)
│   └── advanced.py        # Multi-agent mode
├── services/
│   ├── embeddings.py      # Free tier (sentence-transformers)
│   ├── llamaindex_rag.py  # Premium tier (OpenAI + persistence)
│   ├── embedding_protocol.py  # Interface both implement
│   └── research_memory.py # Evidence storage + retrieval
├── tools/
│   ├── pubmed.py          # PubMed E-utilities
│   ├── clinicaltrials.py  # ClinicalTrials.gov API
│   └── europepmc.py       # Europe PMC API
├── agent_factory/
│   └── judges.py          # LLM judge (assess evidence sufficiency)
└── utils/
    ├── config.py          # Environment variables
    ├── service_loader.py  # Tiered service selection
    └── models.py          # Evidence, Citation, etc.

The Bottom Line

DeepBoner is not missing anything critical. The LlamaIndex integration you just completed was the last major infrastructure piece. What remains is optimization and polish, not core functionality.

The system works like this:

Search real databases (not a vector store)
Store + deduplicate (this is where LlamaIndex helps)
Judge with context (top-30 semantically relevant papers)
Loop or synthesize (code-enforced decision)

This is a sensible architecture for a research agent. You don't need more complexity - you need to ship it.