Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

VibecoderMcSwaggins commited on 12 days ago

Commit

9639483

1 Parent(s): 11888fc

feat: add service loader + SPEC_06 + P0 bug report

Service Loader:
- Create src/utils/service_loader.py for centralized lazy loading
- All orchestrators now use centralized service loading
- Enforces OrchestratorProtocol on Advanced and Hierarchical

SPEC_06 (Simple Mode Synthesis - not yet implemented):
- Research-backed spec for fixing Simple mode never synthesizing
- Identifies 6 root causes with November 2025 LLM-as-Judge research
- Separates scoring from decision-making (debiased architecture)

P0 Bug Report:
- Documents Simple mode producing only citations (no synthesis)
- Confidence dropping to 0% in late iterations (context overflow)
- Judge never recommending "synthesize" even with 455 sources

CodeRabbit concerns from PR #70 addressed.

Files changed (8) hide show

docs/bugs/ACTIVE_BUGS.md +17 -0
docs/bugs/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md +254 -0
docs/specs/SPEC_06_SIMPLE_MODE_SYNTHESIS.md +778 -0
src/orchestrators/advanced.py +4 -12
src/orchestrators/hierarchical.py +5 -15
src/orchestrators/simple.py +8 -33
src/utils/service_loader.py +71 -0
tests/unit/utils/test_service_loader.py +71 -0

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -2,6 +2,23 @@
 > Last updated: 2025-11-29
 ## P3 - Edge Case
 *(None)*

 > Last updated: 2025-11-29
+## P0 - Blocker
+### P0 - Simple Mode Never Synthesizes
+**File:** `P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md`
+**Symptom:** Simple mode finds 455 sources but outputs only citations (no synthesis).
+**Root Causes:**
+1. Judge never recommends "synthesize" (prompt too conservative)
+2. Confidence drops to 0% in late iterations (context overflow / API failure)
+3. Search derails to tangential topics (bone health instead of libido)
+4. `_generate_partial_synthesis()` outputs garbage (just citations, no analysis)
+**Status:** Documented, fix plan ready.
+---
 ## P3 - Edge Case
 *(None)*

docs/bugs/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md ADDED Viewed

	@@ -0,0 +1,254 @@

+# P0 Bug Report: Simple Mode Never Synthesizes
+## Status
+- **Date:** 2025-11-29
+- **Priority:** P0 (Blocker - Simple mode produces useless output)
+- **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`, `src/prompts/judge.py`
+- **Environment:** Simple mode **WITHOUT OpenAI key** (HuggingFace Inference free tier)
+---
+## Symptoms
+When running Simple mode with a real research question:
+1. **Judge never recommends "synthesize"** even with 455 sources and 90% confidence
+2. **Confidence drops to 0%** in late iterations (API failures or context overflow)
+3. **Search derails** to tangential topics (bone health, muscle mass instead of libido)
+4. **Max iterations reached** → User gets garbage output (just citations, no synthesis)
+### Example Output (Real Run)
+```
+🔍 SEARCHING: What drugs improve female libido post-menopause?
+📚 SEARCH_COMPLETE: Found 30 new sources (30 total)
+✅ JUDGE_COMPLETE: Assessment: continue (confidence: 70%)    ← Never "synthesize"
+... 8 more iterations ...
+📚 SEARCH_COMPLETE: Found 10 new sources (429 total)
+✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← API failure?
+📚 SEARCH_COMPLETE: Found 26 new sources (455 total)
+✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← Still failing
+## Partial Analysis (Max Iterations Reached)      ← GARBAGE OUTPUT
+### Question
+What drugs improve female libido post-menopause?
+### Status
+Maximum search iterations reached.
+### Citations
+1. [Tribulus terrestris and female reproductive...]
+2. ...
+---
+*Consider searching with more specific terms*     ← NO SYNTHESIS AT ALL
+```
+---
+## Root Cause Analysis
+### Bug 1: Judge Never Says "sufficient=True"
+**File:** `src/prompts/judge.py:22-25`
+```python
+3. **Sufficiency**: Evidence is sufficient when:
+   - Combined scores >= 12 AND
+   - At least one specific drug candidate identified AND
+   - Clear mechanistic rationale exists
+```
+**Problem:** The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:
+1. LLM may not be extracting drug candidates from evidence properly
+2. The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
+3. The recommendation "continue" seems to be the default state
+**Evidence:** Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.
+### Bug 2: Confidence Drops to 0% (Late Iteration Failures)
+**File:** `src/agent_factory/judges.py:150-183`
+The `_create_fallback_assessment()` returns:
+- `confidence: 0.0`
+- `recommendation: "continue"`
+**Problem:** In iterations 9-10, something failed:
+- Context too long (455 sources × ~1500 chars = 680K chars → token limit exceeded)
+- API rate limit hit
+- Network timeout
+**Evidence:** Confidence went from 80%→0%→0% in final iterations - this is the fallback response.
+### Bug 3: Search Derailment
+**Evidence from logs:**
+```
+Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
+Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...
+```
+**Problem:** Judge's `next_search_queries` drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.
+### Bug 4: Partial Synthesis is Garbage
+**File:** `src/orchestrators/simple.py:432-470`
+```python
+def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
+    """Generate a partial synthesis when max iterations reached."""
+    citations = "\n".join([...])  # Just citations
+    return f"""## Partial Analysis (Max Iterations Reached)
+### Question
+{query}
+### Status
+Maximum search iterations reached. The evidence gathered may be incomplete.
+### Evidence Collected
+Found {len(evidence)} sources.
+### Citations
+{citations}
+---
+*Consider searching with more specific terms*
+"""
+```
+**Problem:** When max iterations reached, we have 455 sources but output NO analysis. We should:
+1. Force a synthesis call to the LLM
+2. Or at minimum generate drug candidates/findings from the last good assessment
+3. Not just dump citations and give up
+---
+## The Fix
+### Fix 1: Lower the Bar for "synthesize"
+**Option A:** Change prompt to be less strict:
+```python
+SYSTEM_PROMPT = """...
+3. **Sufficiency**: Evidence is sufficient when:
+   - Combined scores >= 10 (was 12) OR
+   - Confidence >= 80% with drug candidates identified OR
+   - 5+ iterations completed with 100+ sources
+"""
+```
+**Option B:** Add iteration-based heuristic in orchestrator:
+```python
+# If we have lots of evidence and high confidence, force synthesis
+if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
+    assessment.sufficient = True
+    assessment.recommendation = "synthesize"
+```
+### Fix 2: Handle Context Overflow
+**File:** `src/agent_factory/judges.py`
+Before sending to LLM, cap evidence:
+```python
+async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
+    # Cap at 50 most recent/relevant to avoid token overflow
+    if len(evidence) > 50:
+        evidence = evidence[:50]  # Or use embedding similarity to select best 50
+```
+### Fix 3: Keep Search Focused
+**File:** `src/prompts/judge.py`
+Add to prompt:
+```python
+SYSTEM_PROMPT = """...
+## Search Query Rules
+When suggesting next_search_queries:
+- Stay focused on the ORIGINAL question
+- Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
+- Refine existing good terms, don't explore random associations
+"""
+```
+### Fix 4: Generate Real Synthesis on Max Iterations
+**File:** `src/orchestrators/simple.py`
+```python
+def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
+    """Generate a REAL synthesis when max iterations reached."""
+    # Get the last assessment's data (if available)
+    last_assessment = self.history[-1]["assessment"] if self.history else None
+    drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
+    key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []
+    drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
+    findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"
+    citations = "\n".join([
+        f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
+        for i, e in enumerate(evidence[:10])
+    ])
+    return f"""## Drug Repurposing Analysis (Partial)
+### Question
+{query}
+### Status
+⚠️ Maximum iterations reached. Analysis based on {len(evidence)} sources.
+### Drug Candidates Identified
+{drug_list}
+### Key Findings
+{findings_list}
+### Top Citations ({len(evidence)} sources)
+{citations}
+---
+*Analysis may be incomplete. Consider refining query or adding API key for better results.*
+"""
+```
+---
+## Test Plan
+- [ ] Verify judge says "synthesize" within 5 iterations for good queries
+- [ ] Test with 500+ sources to ensure no token overflow
+- [ ] Verify search stays on-topic (no bone/muscle tangents for libido query)
+- [ ] Verify partial synthesis shows drug candidates (not just citations)
+- [ ] Test with MockJudgeHandler to confirm issue is in LLM behavior
+- [ ] Add unit test: `test_judge_synthesizes_with_good_evidence`
+---
+## Priority Justification
+**P0** because:
+- Simple mode is the DEFAULT for users without API keys
+- 455 sources found but ZERO useful output generated
+- User waited 10 iterations just to get a citation dump
+- Makes the tool look completely broken
+- Blocks hackathon demo effectiveness
+---
+## Immediate Workaround
+1. Use **Advanced mode** (requires OpenAI key) - it has its own synthesis logic
+2. Or use **fewer iterations** (MAX_ITERATIONS=3) to hit partial synthesis faster
+3. Or manually review the citations (they ARE relevant, just not synthesized)
+---
+## Related Issues
+- `P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md` - Fixed dedup issue, but synthesis problem persists
+- `ACTIVE_BUGS.md` - Update when this is resolved

docs/specs/SPEC_06_SIMPLE_MODE_SYNTHESIS.md ADDED Viewed

	@@ -0,0 +1,778 @@

+# SPEC 06: Simple Mode Synthesis Fix
+## Priority: P0 (Blocker - Simple mode produces garbage output)
+## Problem Statement
+Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis.
+**Observed Behavior** (real run):
+```
+Iterations 1-8:  confidence 70-90%, recommendation="continue"  ← Never synthesizes
+Iteration 9-10:  confidence 0%                                 ← LLM context overflow
+Final output:    Citation list only, no drug candidates, no analysis
+```
+---
+## Research Context (November 2025 Best Practices)
+This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration.
+### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594))
+| Bias | Description | Impact on Our System |
+|------|-------------|---------------------|
+| **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations |
+| **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted |
+| **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) |
+**Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing.
+### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag))
+> "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**"
+**Key Finding**: RAG is **8-82× cheaper** than long context approaches. Best practice is:
+- **Diverse selection** over recency-only selection
+- **Re-ranking** before sending to judge
+- **Lost-in-the-middle mitigation** - put critical context at prompt edges
+### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/))
+> "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**."
+**Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is:
+1. LLM provides **scores only** (mechanism, clinical, drug candidates)
+2. Code evaluates scores against **explicit thresholds**
+3. Code decides synthesize vs continue
+---
+## Root Cause Analysis
+### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL)
+**File:** `src/prompts/judge.py:62`
+```python
+# BROKEN: Sends ALL evidence to the LLM
+evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
+```
+**Impact:**
+- 455 sources × 1700 chars/source = **773,500 characters ≈ 193K tokens**
+- HuggingFace Inference free tier limit: **~4K-8K tokens**
+- Result: **Context overflow → LLM failure → fallback response → 0% confidence**
+This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM.
+### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern)
+**Current Design:**
+```python
+# LLM does BOTH - subject to verbosity/self-preference bias
+"Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'"
+```
+**Problem** (per 2025 research):
+- LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern
+- "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue"
+- No **separation of concerns** - scoring and decision-making conflated
+### Bug 3: No Diverse Evidence Selection
+**Current Design:**
+```python
+# Just truncates to most recent - subject to position bias
+capped_evidence = evidence[-30:]
+```
+**Problem** (per RAG research):
+- **Position bias** - most recent ≠ most relevant
+- **Lost-in-the-middle** - important early evidence ignored
+- No **diversity** - may select 30 similar papers
+### Bug 4: Prompt Encourages "Continue" Forever
+**File:** `src/prompts/judge.py:22-32`
+```python
+## Sufficiency Criteria (TOO STRICT - requires ALL conditions)
+- Combined scores >= 12 AND
+- At least one specific drug candidate identified AND
+- Clear mechanistic rationale exists
+## Output Rules
+- Be conservative: only recommend "synthesize" when truly confident  ← TRIGGERS VERBOSITY BIAS
+```
+### Bug 5: Search Derailment
+**Evidence from logs:**
+```
+Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
+```
+Original question: "female libido post-menopause" → Judge suggests tangentially related topics.
+### Bug 6: Partial Synthesis is Garbage
+**File:** `src/orchestrators/simple.py:432-470`
+When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings.
+---
+## Solution Design
+### Architecture Change: Separate Scoring from Decision
+**Before (biased):**
+```
+User Question → LLM Judge → { scores, recommendation } → Orchestrator follows recommendation
+```
+**After (debiased, per 2025 best practices):**
+```
+User Question → LLM Judge → { scores only } → Code evaluates → Code decides synthesize/continue
+```
+This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided.
+---
+### Fix 1: Diverse Evidence Selection (Not Just Capping)
+**File:** `src/prompts/judge.py`
+```python
+MAX_EVIDENCE_FOR_JUDGE = 30  # Keep under token limits
+async def select_evidence_for_judge(
+    evidence: list[Evidence],
+    query: str,
+    max_items: int = MAX_EVIDENCE_FOR_JUDGE,
+) -> list[Evidence]:
+    """
+    Select diverse, relevant evidence for judge evaluation.
+    Implements RAG best practices (November 2025):
+    - Diversity selection over recency-only
+    - Lost-in-the-middle mitigation
+    - Relevance re-ranking
+    References:
+    - https://www.pinecone.io/learn/retrieval-augmented-generation/
+    - https://www.truestate.io/blog/lessons-from-rag
+    """
+    if len(evidence) <= max_items:
+        return evidence
+    try:
+        from src.utils.text_utils import select_diverse_evidence
+        # Use embedding-based diversity selection
+        return await select_diverse_evidence(evidence, n=max_items, query=query)
+    except ImportError:
+        # Fallback: mix of recent + early (lost-in-the-middle mitigation)
+        early = evidence[:max_items // 3]           # First third
+        recent = evidence[-(max_items * 2 // 3):]   # Last two-thirds
+        return early + recent
+def format_user_prompt(
+    question: str,
+    evidence: list[Evidence],
+    iteration: int = 0,
+    max_iterations: int = 10,
+    total_evidence_count: int | None = None,
+) -> str:
+    """
+    Format user prompt with selected evidence and iteration context.
+    NOTE: Evidence should be pre-selected using select_evidence_for_judge().
+    This function assumes evidence is already capped.
+    """
+    total_count = total_evidence_count or len(evidence)
+    max_content_len = 1500
+    def format_single_evidence(i: int, e: Evidence) -> str:
+        content = e.content
+        if len(content) > max_content_len:
+            content = content[:max_content_len] + "..."
+        return (
+            f"### Evidence {i + 1}\n"
+            f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
+            f"**URL**: {e.citation.url}\n"
+            f"**Content**:\n{content}"
+        )
+    evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
+    # Lost-in-the-middle mitigation: put critical context at START and END
+    return f"""## Research Question (IMPORTANT - stay focused on this)
+{question}
+## Search Progress
+- **Iteration**: {iteration}/{max_iterations}
+- **Total evidence collected**: {total_count} sources
+- **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)
+## Available Evidence
+{evidence_text}
+## Your Task
+Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
+DO NOT decide "synthesize" vs "continue" - that decision is made by the system.
+## REMINDER: Original Question (stay focused)
+{question}
+"""
+```
+### Fix 2: Debiased Judge Prompt (Scoring Only)
+**File:** `src/prompts/judge.py`
+```python
+SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
+Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
+continue searching or synthesize - that decision is made by the orchestration system
+based on your scores.
+## Your Role: Scoring Only
+You provide objective scores. The system decides next steps based on explicit thresholds.
+This separation prevents bias in the decision-making process.
+## Scoring Criteria
+1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
+   - 0-3: No clear mechanism, speculative
+   - 4-6: Some mechanistic insight, but gaps exist
+   - 7-10: Clear, well-supported mechanism of action
+2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
+   - 0-3: No clinical data, only theoretical
+   - 4-6: Preclinical or early clinical data
+   - 7-10: Strong clinical evidence (trials, meta-analyses)
+3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
+   - Only include drugs explicitly mentioned
+   - Do NOT hallucinate or infer drug names
+   - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")
+4. **Key Findings**: Extract 3-5 key findings from the evidence
+   - Focus on findings relevant to the research question
+   - Include mechanism insights and clinical outcomes
+5. **Confidence (0.0-1.0)**: Your confidence in the scores
+   - Based on evidence quality and relevance
+   - Lower if evidence is tangential or low-quality
+## Output Format
+Return valid JSON with these fields:
+- details.mechanism_score (int 0-10)
+- details.mechanism_reasoning (string)
+- details.clinical_evidence_score (int 0-10)
+- details.clinical_reasoning (string)
+- details.drug_candidates (list of strings)
+- details.key_findings (list of strings)
+- sufficient (boolean) - TRUE if scores suggest enough evidence
+- confidence (float 0-1)
+- recommendation ("continue" or "synthesize") - Your suggestion (system may override)
+- next_search_queries (list) - If continuing, suggest FOCUSED queries
+- reasoning (string)
+## CRITICAL: Search Query Rules
+When suggesting next_search_queries:
+- STAY FOCUSED on the original research question
+- Do NOT drift to tangential topics
+- If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
+- Refine existing terms, don't explore random medical associations
+- Example: "female libido post-menopause" → "testosterone therapy female sexual dysfunction"
+"""
+```
+### Fix 3: Code-Enforced Termination Criteria
+**File:** `src/orchestrators/simple.py`
+```python
+# Termination thresholds (code-enforced, not LLM-decided)
+# Based on multi-agent orchestration best practices (November 2025)
+# Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/
+TERMINATION_CRITERIA = {
+    "min_combined_score": 12,      # mechanism + clinical >= 12
+    "min_score_with_volume": 10,   # >= 10 if 50+ sources
+    "late_iteration_threshold": 8, # >= 8 in iterations 8+
+    "max_evidence_threshold": 100, # Force synthesis with 100+ sources
+    "emergency_iteration": 8,      # Last 2 iterations = emergency mode
+    "min_confidence": 0.5,         # Minimum confidence for emergency synthesis
+}
+def should_synthesize(
+    assessment: JudgeAssessment,
+    iteration: int,
+    max_iterations: int,
+    evidence_count: int,
+) -> tuple[bool, str]:
+    """
+    Code-enforced synthesis decision.
+    Returns (should_synthesize, reason).
+    This function implements the "explicit termination criteria" pattern
+    from multi-agent orchestration best practices. The LLM provides scores,
+    but CODE decides when to stop.
+    Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025
+    """
+    combined_score = (
+        assessment.details.mechanism_score +
+        assessment.details.clinical_evidence_score
+    )
+    has_drug_candidates = len(assessment.details.drug_candidates) > 0
+    confidence = assessment.confidence
+    # Priority 1: LLM explicitly says sufficient with good scores
+    if assessment.sufficient and assessment.recommendation == "synthesize":
+        if combined_score >= 10:
+            return True, "judge_approved"
+    # Priority 2: High scores with drug candidates
+    if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates:
+        return True, "high_scores_with_candidates"
+    # Priority 3: Good scores with high evidence volume
+    if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50:
+        return True, "good_scores_high_volume"
+    # Priority 4: Late iteration with acceptable scores (diminishing returns)
+    is_late_iteration = iteration >= max_iterations - 2
+    if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]:
+        return True, "late_iteration_acceptable"
+    # Priority 5: Very high evidence count (enough to synthesize something)
+    if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]:
+        return True, "max_evidence_reached"
+    # Priority 6: Emergency synthesis (avoid garbage output)
+    if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]:
+        return True, "emergency_synthesis"
+    return False, "continue_searching"
+```
+### Fix 4: Update Orchestrator Decision Phase
+**File:** `src/orchestrators/simple.py`
+```python
+# In the run() method, replace the decision phase:
+# === DECISION PHASE (Code-Enforced) ===
+should_synth, reason = should_synthesize(
+    assessment=assessment,
+    iteration=iteration,
+    max_iterations=self.config.max_iterations,
+    evidence_count=len(all_evidence),
+)
+logger.info(
+    "Synthesis decision",
+    should_synthesize=should_synth,
+    reason=reason,
+    iteration=iteration,
+    combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score,
+    evidence_count=len(all_evidence),
+    confidence=assessment.confidence,
+)
+if should_synth:
+    # Log synthesis trigger reason for debugging
+    if reason != "judge_approved":
+        logger.info(f"Code-enforced synthesis triggered: {reason}")
+    # Optional Analysis Phase
+    async for event in self._run_analysis_phase(query, all_evidence, iteration):
+        yield event
+    yield AgentEvent(
+        type="synthesizing",
+        message=f"Evidence sufficient ({reason})! Preparing synthesis...",
+        iteration=iteration,
+    )
+    # Generate final response
+    final_response = self._generate_synthesis(query, all_evidence, assessment)
+    yield AgentEvent(
+        type="complete",
+        message=final_response,
+        data={
+            "evidence_count": len(all_evidence),
+            "iterations": iteration,
+            "synthesis_reason": reason,
+            "drug_candidates": assessment.details.drug_candidates,
+            "key_findings": assessment.details.key_findings,
+        },
+        iteration=iteration,
+    )
+    return
+else:
+    # Need more evidence - prepare next queries
+    current_queries = assessment.next_search_queries or [
+        f"{query} mechanism of action",
+        f"{query} clinical evidence",
+    ]
+    yield AgentEvent(
+        type="looping",
+        message=(
+            f"Gathering more evidence (scores: {assessment.details.mechanism_score}+"
+            f"{assessment.details.clinical_evidence_score}). "
+            f"Next: {', '.join(current_queries[:2])}..."
+        ),
+        data={"next_queries": current_queries, "reason": reason},
+        iteration=iteration,
+    )
+```
+### Fix 5: Real Partial Synthesis
+**File:** `src/orchestrators/simple.py`
+```python
+def _generate_partial_synthesis(
+    self,
+    query: str,
+    evidence: list[Evidence],
+) -> str:
+    """
+    Generate a REAL synthesis when max iterations reached.
+    Even when forced to stop, we should provide:
+    - Drug candidates (if any were found)
+    - Key findings
+    - Assessment scores
+    - Actionable citations
+    This is still better than a citation dump.
+    """
+    # Extract data from last assessment if available
+    last_assessment = self.history[-1]["assessment"] if self.history else {}
+    details = last_assessment.get("details", {})
+    drug_candidates = details.get("drug_candidates", [])
+    key_findings = details.get("key_findings", [])
+    mechanism_score = details.get("mechanism_score", 0)
+    clinical_score = details.get("clinical_evidence_score", 0)
+    reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")
+    # Format drug candidates
+    if drug_candidates:
+        drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
+    else:
+        drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*"
+    # Format key findings
+    if key_findings:
+        findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
+    else:
+        findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*"
+    # Format citations (top 10)
+    citations = "\n".join([
+        f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
+        f"({e.citation.source.upper()}, {e.citation.date})"
+        for i, e in enumerate(evidence[:10])
+    ])
+    combined_score = mechanism_score + clinical_score
+    return f"""## Drug Repurposing Analysis
+### Research Question
+{query}
+### Status
+Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
+Maximum iterations reached - results may be incomplete.
+### Drug Candidates Identified
+{drug_list}
+### Key Findings
+{findings_list}
+### Evidence Quality Scores
+| Criterion | Score | Interpretation |
+|-----------|-------|----------------|
+| Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence |
+| Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support |
+| Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis |
+### Analysis Summary
+{reasoning}
+### Top Citations ({len(evidence)} sources total)
+{citations}
+---
+*For more complete analysis:*
+- *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
+- *Try a more specific query (e.g., include drug names)*
+- *Use Advanced mode for multi-agent research*
+"""
+```
+### Fix 6: Update Judge Handler Signature
+**File:** `src/orchestrators/base.py`
+```python
+class JudgeHandlerProtocol(Protocol):
+    """Protocol for judge handler."""
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+        iteration: int = 0,           # NEW
+        max_iterations: int = 10,     # NEW
+    ) -> JudgeAssessment:
+        """Assess evidence quality and provide scores."""
+        ...
+```
+**File:** `src/agent_factory/judges.py`
+Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to:
+```python
+async def assess(
+    self,
+    question: str,
+    evidence: list[Evidence],
+    iteration: int = 0,
+    max_iterations: int = 10,
+) -> JudgeAssessment:
+    """Assess evidence with iteration context."""
+    # Select diverse evidence (not just truncate)
+    selected_evidence = await select_evidence_for_judge(evidence, question)
+    # Format prompt with iteration context
+    user_prompt = format_user_prompt(
+        question=question,
+        evidence=selected_evidence,
+        iteration=iteration,
+        max_iterations=max_iterations,
+        total_evidence_count=len(evidence),
+    )
+    # ... rest of implementation
+```
+---
+## Implementation Order
+| Order | Fix | Priority | Impact |
+|-------|-----|----------|--------|
+| 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias |
+| 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations |
+| 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias |
+| 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop |
+| 5 | Update handler signatures | MEDIUM | Enables iteration context |
+| 6 | Update orchestrator | MEDIUM | Integrates all fixes |
+---
+## Files to Modify
+| File | Changes |
+|------|---------|
+| `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` |
+| `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` |
+| `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature |
+| `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection |
+---
+## Test Plan
+### Unit Tests
+```python
+# tests/unit/prompts/test_judge_prompt.py
+@pytest.mark.asyncio
+async def test_evidence_selection_diverse():
+    """Verify evidence selection includes early and recent items."""
+    evidence = [make_evidence(f"Paper {i}") for i in range(100)]
+    selected = await select_evidence_for_judge(evidence, "test query", max_items=30)
+    # Should include some early evidence (lost-in-the-middle mitigation)
+    titles = [e.citation.title for e in selected]
+    assert any("Paper 0" in t or "Paper 1" in t for t in titles)
+    assert any("Paper 99" in t or "Paper 98" in t for t in titles)
+def test_prompt_includes_question_at_edges():
+    """Verify lost-in-the-middle mitigation."""
+    evidence = [make_evidence("Test")]
+    prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10)
+    # Question should appear at START and END of prompt
+    lines = prompt.split("\n")
+    assert "important question" in lines[1]  # Near start
+    assert "important question" in lines[-2]  # Near end
+# tests/unit/orchestrators/test_termination.py
+def test_should_synthesize_high_scores():
+    """High scores with drug candidates triggers synthesis."""
+    assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
+    should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50)
+    assert should_synth is True
+    assert reason == "high_scores_with_candidates"
+def test_should_synthesize_late_iteration():
+    """Late iteration with acceptable scores triggers synthesis."""
+    assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
+    should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80)
+    assert should_synth is True
+    assert reason in ["late_iteration_acceptable", "emergency_synthesis"]
+def test_should_not_synthesize_early_low_scores():
+    """Early iteration with low scores continues searching."""
+    assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
+    should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20)
+    assert should_synth is False
+    assert reason == "continue_searching"
+def test_partial_synthesis_has_drug_candidates():
+    """Partial synthesis includes extracted data."""
+    orchestrator = Orchestrator(...)
+    orchestrator.history = [{
+        "assessment": {
+            "details": {
+                "drug_candidates": ["Testosterone", "DHEA"],
+                "key_findings": ["Finding 1", "Finding 2"],
+                "mechanism_score": 6,
+                "clinical_evidence_score": 5,
+            },
+            "reasoning": "Good evidence found.",
+        }
+    }]
+    result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")])
+    assert "Testosterone" in result
+    assert "DHEA" in result
+    assert "Drug Candidates" in result
+    assert "6/10" in result  # mechanism score
+```
+### Integration Tests
+```python
+# tests/integration/test_simple_mode_synthesis.py
+@pytest.mark.asyncio
+async def test_simple_mode_synthesizes_before_max_iterations():
+    """Verify simple mode produces useful output with mocked judge."""
+    # Mock judge to return good scores
+    mock_judge = MockJudgeHandler()
+    orchestrator = Orchestrator(
+        search_handler=mock_search_handler,
+        judge_handler=mock_judge,
+    )
+    events = []
+    async for event in orchestrator.run("metformin diabetes mechanism"):
+        events.append(event)
+    # Must have synthesis with drug candidates
+    complete_event = next(e for e in events if e.type == "complete")
+    assert "Drug Candidates" in complete_event.message
+    assert complete_event.data.get("synthesis_reason") is not None
+@pytest.mark.asyncio
+async def test_large_evidence_does_not_crash():
+    """Verify 500 sources don't cause token overflow."""
+    evidence = [make_evidence(f"Paper {i}") for i in range(500)]
+    selected = await select_evidence_for_judge(evidence, "test query")
+    # Should be capped
+    assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE
+    # Total chars should be under ~50K (safe for most LLMs)
+    prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500)
+    assert len(prompt) < 100_000  # Well under token limits
+```
+---
+## Acceptance Criteria
+- [ ] Evidence sent to judge is diverse-selected (not just truncated)
+- [ ] Prompt includes question at START and END (lost-in-the-middle mitigation)
+- [ ] Code-enforced `should_synthesize()` makes termination decision
+- [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8
+- [ ] Partial synthesis includes drug candidates and scores (not just citations)
+- [ ] Search queries stay on-topic (judge prompt enforces focus)
+- [ ] 500+ sources don't cause LLM crashes
+- [ ] All existing tests pass
+---
+## Risk Assessment
+| Risk | Mitigation |
+|------|------------|
+| Diverse selection misses critical evidence | Include relevance scoring in selection |
+| Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning |
+| Prompt changes affect OpenAI/Anthropic differently | Test with all providers |
+| Emergency synthesis produces low-quality output | Still better than citation dump |
+---
+## Success Metrics
+| Metric | Before | After |
+|--------|--------|-------|
+| Synthesis rate | 0% | 90%+ |
+| Average iterations to synthesis | 10 (max) | 5-7 |
+| Drug candidates in output | Never | Always (if found) |
+| LLM token overflow errors | Common | None |
+| User-reported "useless output" | Frequent | Rare |
+---
+## References
+- [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
+- [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594)
+- [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/)
+- [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag)
+- [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
+- [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)
+- [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/)

src/orchestrators/advanced.py CHANGED Viewed

@@ -36,9 +36,11 @@ from src.agents.magentic_agents import (
     create_search_agent,
 )
 from src.agents.state import init_magentic_state
 from src.utils.config import settings
 from src.utils.llm_factory import check_magentic_requirements
 from src.utils.models import AgentEvent
 if TYPE_CHECKING:
     from src.services.embeddings import EmbeddingService
@@ -46,7 +48,7 @@ if TYPE_CHECKING:
 logger = structlog.get_logger()
-class AdvancedOrchestrator:
     """
     Advanced orchestrator using Microsoft Agent Framework ChatAgent pattern.
@@ -97,17 +99,7 @@ class AdvancedOrchestrator:
     def _init_embedding_service(self) -> "EmbeddingService | None":
         """Initialize embedding service if available."""
-        try:
-            from src.services.embeddings import get_embedding_service
-            service = get_embedding_service()
-            logger.info("Embedding service enabled")
-            return service
-        except ImportError:
-            logger.info("Embedding service not available (dependencies missing)")
-        except Exception as e:
-            logger.warning("Failed to initialize embedding service", error=str(e))
-        return None
     def _build_workflow(self) -> Any:
         """Build the workflow with ChatAgent participants."""

     create_search_agent,
 )
 from src.agents.state import init_magentic_state
+from src.orchestrators.base import OrchestratorProtocol
 from src.utils.config import settings
 from src.utils.llm_factory import check_magentic_requirements
 from src.utils.models import AgentEvent
+from src.utils.service_loader import get_embedding_service_if_available
 if TYPE_CHECKING:
     from src.services.embeddings import EmbeddingService
 logger = structlog.get_logger()
+class AdvancedOrchestrator(OrchestratorProtocol):
     """
     Advanced orchestrator using Microsoft Agent Framework ChatAgent pattern.
     def _init_embedding_service(self) -> "EmbeddingService | None":
         """Initialize embedding service if available."""
+        return get_embedding_service_if_available()
     def _build_workflow(self) -> Any:
         """Build the workflow with ChatAgent participants."""

src/orchestrators/hierarchical.py CHANGED Viewed

@@ -19,9 +19,10 @@ import structlog
 from src.agents.judge_agent_llm import LLMSubIterationJudge
 from src.agents.magentic_agents import create_search_agent
 from src.middleware.sub_iteration import SubIterationMiddleware, SubIterationTeam
-from src.services.embeddings import get_embedding_service
 from src.state import init_magentic_state
 from src.utils.models import AgentEvent, OrchestratorConfig
 logger = structlog.get_logger()
@@ -56,7 +57,7 @@ class ResearchTeam(SubIterationTeam):
         return "No response from agent."
-class HierarchicalOrchestrator:
     """Orchestrator that uses hierarchical teams and sub-iterations.
     This orchestrator provides:
@@ -96,19 +97,8 @@ class HierarchicalOrchestrator:
         """
         logger.info("Starting hierarchical orchestrator", query=query)
-        try:
-            service = get_embedding_service()
-            init_magentic_state(service)
-        except ImportError:
-            logger.info("Embedding service not available (dependencies missing)")
-            init_magentic_state()
-        except Exception as e:
-            logger.warning(
-                "Embedding service initialization failed",
-                error=str(e),
-                error_type=type(e).__name__,
-            )
-            init_magentic_state()
         yield AgentEvent(type="started", message=f"Starting research: {query}")

 from src.agents.judge_agent_llm import LLMSubIterationJudge
 from src.agents.magentic_agents import create_search_agent
 from src.middleware.sub_iteration import SubIterationMiddleware, SubIterationTeam
+from src.orchestrators.base import OrchestratorProtocol
 from src.state import init_magentic_state
 from src.utils.models import AgentEvent, OrchestratorConfig
+from src.utils.service_loader import get_embedding_service_if_available
 logger = structlog.get_logger()
         return "No response from agent."
+class HierarchicalOrchestrator(OrchestratorProtocol):
     """Orchestrator that uses hierarchical teams and sub-iterations.
     This orchestrator provides:
         """
         logger.info("Starting hierarchical orchestrator", query=query)
+        service = get_embedding_service_if_available()
+        init_magentic_state(service)
         yield AgentEvent(type="started", message=f"Starting research: {query}")

src/orchestrators/simple.py CHANGED Viewed

@@ -72,47 +72,22 @@ class Orchestrator:
         self._embeddings: EmbeddingService | None = None
     def _get_analyzer(self) -> StatisticalAnalyzer | None:
-        """Lazy initialization of StatisticalAnalyzer.
-        Note: This imports from src.services, NOT src.agents,
-        so it works without the magentic optional dependency.
-        Returns:
-            StatisticalAnalyzer instance, or None if Modal is unavailable
-        """
         if self._analyzer is None:
-            try:
-                from src.services.statistical_analyzer import get_statistical_analyzer
-                self._analyzer = get_statistical_analyzer()
-            except ImportError:
-                logger.info("StatisticalAnalyzer not available (Modal dependencies missing)")
                 self._enable_analysis = False
         return self._analyzer
     def _get_embeddings(self) -> EmbeddingService | None:
-        """Lazy initialization of EmbeddingService.
-        Uses local sentence-transformers - NO API key required.
-        Returns:
-            EmbeddingService instance, or None if unavailable
-        """
         if self._embeddings is None and self._enable_embeddings:
-            try:
-                from src.services.embeddings import get_embedding_service
-                self._embeddings = get_embedding_service()
-                logger.info("Embedding service enabled for semantic ranking")
-            except ImportError:
-                logger.info("Embedding service not available (dependencies missing)")
-                self._enable_embeddings = False
-            except Exception as e:
-                logger.warning(
-                    "Embedding service initialization failed",
-                    error=str(e),
-                    error_type=type(e).__name__,
-                )
                 self._enable_embeddings = False
         return self._embeddings

         self._embeddings: EmbeddingService | None = None
     def _get_analyzer(self) -> StatisticalAnalyzer | None:
+        """Lazy initialization of StatisticalAnalyzer."""
         if self._analyzer is None:
+            from src.utils.service_loader import get_analyzer_if_available
+            self._analyzer = get_analyzer_if_available()
+            if self._analyzer is None:
                 self._enable_analysis = False
         return self._analyzer
     def _get_embeddings(self) -> EmbeddingService | None:
+        """Lazy initialization of EmbeddingService."""
         if self._embeddings is None and self._enable_embeddings:
+            from src.utils.service_loader import get_embedding_service_if_available
+            self._embeddings = get_embedding_service_if_available()
+            if self._embeddings is None:
                 self._enable_embeddings = False
         return self._embeddings

src/utils/service_loader.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Service loader utility for safe, lazy loading of optional services.
+This module handles the import and initialization of services that may
+have missing optional dependencies (like Modal or Sentence Transformers),
+preventing the application from crashing if they are not available.
+"""
+from typing import TYPE_CHECKING
+import structlog
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+    from src.services.statistical_analyzer import StatisticalAnalyzer
+logger = structlog.get_logger()
+def get_embedding_service_if_available() -> "EmbeddingService | None":
+    """
+    Safely attempt to load and initialize the EmbeddingService.
+    Returns:
+        EmbeddingService instance if dependencies are met, else None.
+    """
+    try:
+        # Import here to avoid top-level dependency check
+        from src.services.embeddings import get_embedding_service
+        service = get_embedding_service()
+        logger.info("Embedding service initialized successfully")
+        return service
+    except ImportError as e:
+        logger.info(
+            "Embedding service not available (optional dependencies missing)",
+            missing_dependency=str(e),
+        )
+    except Exception as e:
+        logger.warning(
+            "Embedding service initialization failed unexpectedly",
+            error=str(e),
+            error_type=type(e).__name__,
+        )
+    return None
+def get_analyzer_if_available() -> "StatisticalAnalyzer | None":
+    """
+    Safely attempt to load and initialize the StatisticalAnalyzer.
+    Returns:
+        StatisticalAnalyzer instance if Modal is available, else None.
+    """
+    try:
+        from src.services.statistical_analyzer import get_statistical_analyzer
+        analyzer = get_statistical_analyzer()
+        logger.info("StatisticalAnalyzer initialized successfully")
+        return analyzer
+    except ImportError as e:
+        logger.info(
+            "StatisticalAnalyzer not available (Modal dependencies missing)",
+            missing_dependency=str(e),
+        )
+    except Exception as e:
+        logger.warning(
+            "StatisticalAnalyzer initialization failed unexpectedly",
+            error=str(e),
+            error_type=type(e).__name__,
+        )
+    return None

tests/unit/utils/test_service_loader.py ADDED Viewed

	@@ -0,0 +1,71 @@

+from unittest.mock import MagicMock, patch
+from src.utils.service_loader import (
+    get_analyzer_if_available,
+    get_embedding_service_if_available,
+)
+def test_get_embedding_service_success():
+    """Test successful loading of embedding service."""
+    with patch("src.services.embeddings.get_embedding_service") as mock_get:
+        mock_service = MagicMock()
+        mock_get.return_value = mock_service
+        service = get_embedding_service_if_available()
+        assert service is mock_service
+        mock_get.assert_called_once()
+def test_get_embedding_service_import_error():
+    """Test handling of ImportError when loading embedding service."""
+    # Simulate import error by patching the function to raise ImportError
+    with patch(
+        "src.services.embeddings.get_embedding_service",
+        side_effect=ImportError("Missing deps"),
+    ):
+        service = get_embedding_service_if_available()
+        assert service is None
+def test_get_embedding_service_generic_error():
+    """Test handling of generic Exception when loading embedding service."""
+    with patch(
+        "src.services.embeddings.get_embedding_service",
+        side_effect=ValueError("Boom"),
+    ):
+        service = get_embedding_service_if_available()
+        assert service is None
+def test_get_analyzer_success():
+    """Test successful loading of analyzer."""
+    with patch("src.services.statistical_analyzer.get_statistical_analyzer") as mock_get:
+        mock_analyzer = MagicMock()
+        mock_get.return_value = mock_analyzer
+        analyzer = get_analyzer_if_available()
+        assert analyzer is mock_analyzer
+        mock_get.assert_called_once()
+def test_get_analyzer_import_error():
+    """Test handling of ImportError when loading analyzer."""
+    with patch(
+        "src.services.statistical_analyzer.get_statistical_analyzer",
+        side_effect=ImportError("No Modal"),
+    ):
+        analyzer = get_analyzer_if_available()
+        assert analyzer is None
+def test_get_analyzer_generic_error():
+    """Test handling of generic Exception when loading analyzer."""
+    with patch(
+        "src.services.statistical_analyzer.get_statistical_analyzer",
+        side_effect=RuntimeError("Fail"),
+    ):
+        analyzer = get_analyzer_if_available()
+        assert analyzer is None