VibecoderMcSwaggins commited on
Commit
9639483
·
1 Parent(s): 11888fc

feat: add service loader + SPEC_06 + P0 bug report

Browse files

Service Loader:
- Create src/utils/service_loader.py for centralized lazy loading
- All orchestrators now use centralized service loading
- Enforces OrchestratorProtocol on Advanced and Hierarchical

SPEC_06 (Simple Mode Synthesis - not yet implemented):
- Research-backed spec for fixing Simple mode never synthesizing
- Identifies 6 root causes with November 2025 LLM-as-Judge research
- Separates scoring from decision-making (debiased architecture)

P0 Bug Report:
- Documents Simple mode producing only citations (no synthesis)
- Confidence dropping to 0% in late iterations (context overflow)
- Judge never recommending "synthesize" even with 455 sources

CodeRabbit concerns from PR #70 addressed.

docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -2,6 +2,23 @@
2
 
3
  > Last updated: 2025-11-29
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ## P3 - Edge Case
6
 
7
  *(None)*
 
2
 
3
  > Last updated: 2025-11-29
4
 
5
+ ## P0 - Blocker
6
+
7
+ ### P0 - Simple Mode Never Synthesizes
8
+ **File:** `P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md`
9
+
10
+ **Symptom:** Simple mode finds 455 sources but outputs only citations (no synthesis).
11
+
12
+ **Root Causes:**
13
+ 1. Judge never recommends "synthesize" (prompt too conservative)
14
+ 2. Confidence drops to 0% in late iterations (context overflow / API failure)
15
+ 3. Search derails to tangential topics (bone health instead of libido)
16
+ 4. `_generate_partial_synthesis()` outputs garbage (just citations, no analysis)
17
+
18
+ **Status:** Documented, fix plan ready.
19
+
20
+ ---
21
+
22
  ## P3 - Edge Case
23
 
24
  *(None)*
docs/bugs/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P0 Bug Report: Simple Mode Never Synthesizes
2
+
3
+ ## Status
4
+ - **Date:** 2025-11-29
5
+ - **Priority:** P0 (Blocker - Simple mode produces useless output)
6
+ - **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`, `src/prompts/judge.py`
7
+ - **Environment:** Simple mode **WITHOUT OpenAI key** (HuggingFace Inference free tier)
8
+
9
+ ---
10
+
11
+ ## Symptoms
12
+
13
+ When running Simple mode with a real research question:
14
+
15
+ 1. **Judge never recommends "synthesize"** even with 455 sources and 90% confidence
16
+ 2. **Confidence drops to 0%** in late iterations (API failures or context overflow)
17
+ 3. **Search derails** to tangential topics (bone health, muscle mass instead of libido)
18
+ 4. **Max iterations reached** → User gets garbage output (just citations, no synthesis)
19
+
20
+ ### Example Output (Real Run)
21
+
22
+ ```
23
+ 🔍 SEARCHING: What drugs improve female libido post-menopause?
24
+ 📚 SEARCH_COMPLETE: Found 30 new sources (30 total)
25
+ ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 70%) ← Never "synthesize"
26
+
27
+ ... 8 more iterations ...
28
+
29
+ 📚 SEARCH_COMPLETE: Found 10 new sources (429 total)
30
+ ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← API failure?
31
+
32
+ 📚 SEARCH_COMPLETE: Found 26 new sources (455 total)
33
+ ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← Still failing
34
+
35
+ ## Partial Analysis (Max Iterations Reached) ← GARBAGE OUTPUT
36
+ ### Question
37
+ What drugs improve female libido post-menopause?
38
+ ### Status
39
+ Maximum search iterations reached.
40
+ ### Citations
41
+ 1. [Tribulus terrestris and female reproductive...]
42
+ 2. ...
43
+ ---
44
+ *Consider searching with more specific terms* ← NO SYNTHESIS AT ALL
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Root Cause Analysis
50
+
51
+ ### Bug 1: Judge Never Says "sufficient=True"
52
+
53
+ **File:** `src/prompts/judge.py:22-25`
54
+
55
+ ```python
56
+ 3. **Sufficiency**: Evidence is sufficient when:
57
+ - Combined scores >= 12 AND
58
+ - At least one specific drug candidate identified AND
59
+ - Clear mechanistic rationale exists
60
+ ```
61
+
62
+ **Problem:** The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:
63
+
64
+ 1. LLM may not be extracting drug candidates from evidence properly
65
+ 2. The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
66
+ 3. The recommendation "continue" seems to be the default state
67
+
68
+ **Evidence:** Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.
69
+
70
+ ### Bug 2: Confidence Drops to 0% (Late Iteration Failures)
71
+
72
+ **File:** `src/agent_factory/judges.py:150-183`
73
+
74
+ The `_create_fallback_assessment()` returns:
75
+ - `confidence: 0.0`
76
+ - `recommendation: "continue"`
77
+
78
+ **Problem:** In iterations 9-10, something failed:
79
+ - Context too long (455 sources × ~1500 chars = 680K chars → token limit exceeded)
80
+ - API rate limit hit
81
+ - Network timeout
82
+
83
+ **Evidence:** Confidence went from 80%→0%→0% in final iterations - this is the fallback response.
84
+
85
+ ### Bug 3: Search Derailment
86
+
87
+ **Evidence from logs:**
88
+ ```
89
+ Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
90
+ Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...
91
+ ```
92
+
93
+ **Problem:** Judge's `next_search_queries` drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.
94
+
95
+ ### Bug 4: Partial Synthesis is Garbage
96
+
97
+ **File:** `src/orchestrators/simple.py:432-470`
98
+
99
+ ```python
100
+ def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
101
+ """Generate a partial synthesis when max iterations reached."""
102
+ citations = "\n".join([...]) # Just citations
103
+
104
+ return f"""## Partial Analysis (Max Iterations Reached)
105
+ ### Question
106
+ {query}
107
+ ### Status
108
+ Maximum search iterations reached. The evidence gathered may be incomplete.
109
+ ### Evidence Collected
110
+ Found {len(evidence)} sources.
111
+ ### Citations
112
+ {citations}
113
+ ---
114
+ *Consider searching with more specific terms*
115
+ """
116
+ ```
117
+
118
+ **Problem:** When max iterations reached, we have 455 sources but output NO analysis. We should:
119
+ 1. Force a synthesis call to the LLM
120
+ 2. Or at minimum generate drug candidates/findings from the last good assessment
121
+ 3. Not just dump citations and give up
122
+
123
+ ---
124
+
125
+ ## The Fix
126
+
127
+ ### Fix 1: Lower the Bar for "synthesize"
128
+
129
+ **Option A:** Change prompt to be less strict:
130
+ ```python
131
+ SYSTEM_PROMPT = """...
132
+ 3. **Sufficiency**: Evidence is sufficient when:
133
+ - Combined scores >= 10 (was 12) OR
134
+ - Confidence >= 80% with drug candidates identified OR
135
+ - 5+ iterations completed with 100+ sources
136
+ """
137
+ ```
138
+
139
+ **Option B:** Add iteration-based heuristic in orchestrator:
140
+ ```python
141
+ # If we have lots of evidence and high confidence, force synthesis
142
+ if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
143
+ assessment.sufficient = True
144
+ assessment.recommendation = "synthesize"
145
+ ```
146
+
147
+ ### Fix 2: Handle Context Overflow
148
+
149
+ **File:** `src/agent_factory/judges.py`
150
+
151
+ Before sending to LLM, cap evidence:
152
+ ```python
153
+ async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
154
+ # Cap at 50 most recent/relevant to avoid token overflow
155
+ if len(evidence) > 50:
156
+ evidence = evidence[:50] # Or use embedding similarity to select best 50
157
+ ```
158
+
159
+ ### Fix 3: Keep Search Focused
160
+
161
+ **File:** `src/prompts/judge.py`
162
+
163
+ Add to prompt:
164
+ ```python
165
+ SYSTEM_PROMPT = """...
166
+ ## Search Query Rules
167
+
168
+ When suggesting next_search_queries:
169
+ - Stay focused on the ORIGINAL question
170
+ - Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
171
+ - Refine existing good terms, don't explore random associations
172
+ """
173
+ ```
174
+
175
+ ### Fix 4: Generate Real Synthesis on Max Iterations
176
+
177
+ **File:** `src/orchestrators/simple.py`
178
+
179
+ ```python
180
+ def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
181
+ """Generate a REAL synthesis when max iterations reached."""
182
+
183
+ # Get the last assessment's data (if available)
184
+ last_assessment = self.history[-1]["assessment"] if self.history else None
185
+
186
+ drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
187
+ key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []
188
+
189
+ drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
190
+ findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"
191
+
192
+ citations = "\n".join([
193
+ f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
194
+ for i, e in enumerate(evidence[:10])
195
+ ])
196
+
197
+ return f"""## Drug Repurposing Analysis (Partial)
198
+
199
+ ### Question
200
+ {query}
201
+
202
+ ### Status
203
+ ⚠️ Maximum iterations reached. Analysis based on {len(evidence)} sources.
204
+
205
+ ### Drug Candidates Identified
206
+ {drug_list}
207
+
208
+ ### Key Findings
209
+ {findings_list}
210
+
211
+ ### Top Citations ({len(evidence)} sources)
212
+ {citations}
213
+
214
+ ---
215
+ *Analysis may be incomplete. Consider refining query or adding API key for better results.*
216
+ """
217
+ ```
218
+
219
+ ---
220
+
221
+ ## Test Plan
222
+
223
+ - [ ] Verify judge says "synthesize" within 5 iterations for good queries
224
+ - [ ] Test with 500+ sources to ensure no token overflow
225
+ - [ ] Verify search stays on-topic (no bone/muscle tangents for libido query)
226
+ - [ ] Verify partial synthesis shows drug candidates (not just citations)
227
+ - [ ] Test with MockJudgeHandler to confirm issue is in LLM behavior
228
+ - [ ] Add unit test: `test_judge_synthesizes_with_good_evidence`
229
+
230
+ ---
231
+
232
+ ## Priority Justification
233
+
234
+ **P0** because:
235
+ - Simple mode is the DEFAULT for users without API keys
236
+ - 455 sources found but ZERO useful output generated
237
+ - User waited 10 iterations just to get a citation dump
238
+ - Makes the tool look completely broken
239
+ - Blocks hackathon demo effectiveness
240
+
241
+ ---
242
+
243
+ ## Immediate Workaround
244
+
245
+ 1. Use **Advanced mode** (requires OpenAI key) - it has its own synthesis logic
246
+ 2. Or use **fewer iterations** (MAX_ITERATIONS=3) to hit partial synthesis faster
247
+ 3. Or manually review the citations (they ARE relevant, just not synthesized)
248
+
249
+ ---
250
+
251
+ ## Related Issues
252
+
253
+ - `P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md` - Fixed dedup issue, but synthesis problem persists
254
+ - `ACTIVE_BUGS.md` - Update when this is resolved
docs/specs/SPEC_06_SIMPLE_MODE_SYNTHESIS.md ADDED
@@ -0,0 +1,778 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC 06: Simple Mode Synthesis Fix
2
+
3
+ ## Priority: P0 (Blocker - Simple mode produces garbage output)
4
+
5
+ ## Problem Statement
6
+
7
+ Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis.
8
+
9
+ **Observed Behavior** (real run):
10
+ ```
11
+ Iterations 1-8: confidence 70-90%, recommendation="continue" ← Never synthesizes
12
+ Iteration 9-10: confidence 0% ← LLM context overflow
13
+ Final output: Citation list only, no drug candidates, no analysis
14
+ ```
15
+
16
+ ---
17
+
18
+ ## Research Context (November 2025 Best Practices)
19
+
20
+ This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration.
21
+
22
+ ### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594))
23
+
24
+ | Bias | Description | Impact on Our System |
25
+ |------|-------------|---------------------|
26
+ | **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations |
27
+ | **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted |
28
+ | **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) |
29
+
30
+ **Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing.
31
+
32
+ ### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag))
33
+
34
+ > "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**"
35
+
36
+ **Key Finding**: RAG is **8-82× cheaper** than long context approaches. Best practice is:
37
+ - **Diverse selection** over recency-only selection
38
+ - **Re-ranking** before sending to judge
39
+ - **Lost-in-the-middle mitigation** - put critical context at prompt edges
40
+
41
+ ### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/))
42
+
43
+ > "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**."
44
+
45
+ **Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is:
46
+ 1. LLM provides **scores only** (mechanism, clinical, drug candidates)
47
+ 2. Code evaluates scores against **explicit thresholds**
48
+ 3. Code decides synthesize vs continue
49
+
50
+ ---
51
+
52
+ ## Root Cause Analysis
53
+
54
+ ### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL)
55
+
56
+ **File:** `src/prompts/judge.py:62`
57
+
58
+ ```python
59
+ # BROKEN: Sends ALL evidence to the LLM
60
+ evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
61
+ ```
62
+
63
+ **Impact:**
64
+ - 455 sources × 1700 chars/source = **773,500 characters ≈ 193K tokens**
65
+ - HuggingFace Inference free tier limit: **~4K-8K tokens**
66
+ - Result: **Context overflow → LLM failure → fallback response → 0% confidence**
67
+
68
+ This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM.
69
+
70
+ ### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern)
71
+
72
+ **Current Design:**
73
+ ```python
74
+ # LLM does BOTH - subject to verbosity/self-preference bias
75
+ "Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'"
76
+ ```
77
+
78
+ **Problem** (per 2025 research):
79
+ - LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern
80
+ - "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue"
81
+ - No **separation of concerns** - scoring and decision-making conflated
82
+
83
+ ### Bug 3: No Diverse Evidence Selection
84
+
85
+ **Current Design:**
86
+ ```python
87
+ # Just truncates to most recent - subject to position bias
88
+ capped_evidence = evidence[-30:]
89
+ ```
90
+
91
+ **Problem** (per RAG research):
92
+ - **Position bias** - most recent ≠ most relevant
93
+ - **Lost-in-the-middle** - important early evidence ignored
94
+ - No **diversity** - may select 30 similar papers
95
+
96
+ ### Bug 4: Prompt Encourages "Continue" Forever
97
+
98
+ **File:** `src/prompts/judge.py:22-32`
99
+
100
+ ```python
101
+ ## Sufficiency Criteria (TOO STRICT - requires ALL conditions)
102
+ - Combined scores >= 12 AND
103
+ - At least one specific drug candidate identified AND
104
+ - Clear mechanistic rationale exists
105
+
106
+ ## Output Rules
107
+ - Be conservative: only recommend "synthesize" when truly confident ← TRIGGERS VERBOSITY BIAS
108
+ ```
109
+
110
+ ### Bug 5: Search Derailment
111
+
112
+ **Evidence from logs:**
113
+ ```
114
+ Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
115
+ ```
116
+
117
+ Original question: "female libido post-menopause" → Judge suggests tangentially related topics.
118
+
119
+ ### Bug 6: Partial Synthesis is Garbage
120
+
121
+ **File:** `src/orchestrators/simple.py:432-470`
122
+
123
+ When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings.
124
+
125
+ ---
126
+
127
+ ## Solution Design
128
+
129
+ ### Architecture Change: Separate Scoring from Decision
130
+
131
+ **Before (biased):**
132
+ ```
133
+ User Question → LLM Judge → { scores, recommendation } → Orchestrator follows recommendation
134
+ ```
135
+
136
+ **After (debiased, per 2025 best practices):**
137
+ ```
138
+ User Question → LLM Judge → { scores only } → Code evaluates → Code decides synthesize/continue
139
+ ```
140
+
141
+ This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided.
142
+
143
+ ---
144
+
145
+ ### Fix 1: Diverse Evidence Selection (Not Just Capping)
146
+
147
+ **File:** `src/prompts/judge.py`
148
+
149
+ ```python
150
+ MAX_EVIDENCE_FOR_JUDGE = 30 # Keep under token limits
151
+
152
+ async def select_evidence_for_judge(
153
+ evidence: list[Evidence],
154
+ query: str,
155
+ max_items: int = MAX_EVIDENCE_FOR_JUDGE,
156
+ ) -> list[Evidence]:
157
+ """
158
+ Select diverse, relevant evidence for judge evaluation.
159
+
160
+ Implements RAG best practices (November 2025):
161
+ - Diversity selection over recency-only
162
+ - Lost-in-the-middle mitigation
163
+ - Relevance re-ranking
164
+
165
+ References:
166
+ - https://www.pinecone.io/learn/retrieval-augmented-generation/
167
+ - https://www.truestate.io/blog/lessons-from-rag
168
+ """
169
+ if len(evidence) <= max_items:
170
+ return evidence
171
+
172
+ try:
173
+ from src.utils.text_utils import select_diverse_evidence
174
+ # Use embedding-based diversity selection
175
+ return await select_diverse_evidence(evidence, n=max_items, query=query)
176
+ except ImportError:
177
+ # Fallback: mix of recent + early (lost-in-the-middle mitigation)
178
+ early = evidence[:max_items // 3] # First third
179
+ recent = evidence[-(max_items * 2 // 3):] # Last two-thirds
180
+ return early + recent
181
+
182
+
183
+ def format_user_prompt(
184
+ question: str,
185
+ evidence: list[Evidence],
186
+ iteration: int = 0,
187
+ max_iterations: int = 10,
188
+ total_evidence_count: int | None = None,
189
+ ) -> str:
190
+ """
191
+ Format user prompt with selected evidence and iteration context.
192
+
193
+ NOTE: Evidence should be pre-selected using select_evidence_for_judge().
194
+ This function assumes evidence is already capped.
195
+ """
196
+ total_count = total_evidence_count or len(evidence)
197
+ max_content_len = 1500
198
+
199
+ def format_single_evidence(i: int, e: Evidence) -> str:
200
+ content = e.content
201
+ if len(content) > max_content_len:
202
+ content = content[:max_content_len] + "..."
203
+ return (
204
+ f"### Evidence {i + 1}\n"
205
+ f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
206
+ f"**URL**: {e.citation.url}\n"
207
+ f"**Content**:\n{content}"
208
+ )
209
+
210
+ evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
211
+
212
+ # Lost-in-the-middle mitigation: put critical context at START and END
213
+ return f"""## Research Question (IMPORTANT - stay focused on this)
214
+ {question}
215
+
216
+ ## Search Progress
217
+ - **Iteration**: {iteration}/{max_iterations}
218
+ - **Total evidence collected**: {total_count} sources
219
+ - **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)
220
+
221
+ ## Available Evidence
222
+
223
+ {evidence_text}
224
+
225
+ ## Your Task
226
+
227
+ Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
228
+ DO NOT decide "synthesize" vs "continue" - that decision is made by the system.
229
+
230
+ ## REMINDER: Original Question (stay focused)
231
+ {question}
232
+ """
233
+ ```
234
+
235
+ ### Fix 2: Debiased Judge Prompt (Scoring Only)
236
+
237
+ **File:** `src/prompts/judge.py`
238
+
239
+ ```python
240
+ SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
241
+
242
+ Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
243
+ continue searching or synthesize - that decision is made by the orchestration system
244
+ based on your scores.
245
+
246
+ ## Your Role: Scoring Only
247
+
248
+ You provide objective scores. The system decides next steps based on explicit thresholds.
249
+ This separation prevents bias in the decision-making process.
250
+
251
+ ## Scoring Criteria
252
+
253
+ 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
254
+ - 0-3: No clear mechanism, speculative
255
+ - 4-6: Some mechanistic insight, but gaps exist
256
+ - 7-10: Clear, well-supported mechanism of action
257
+
258
+ 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
259
+ - 0-3: No clinical data, only theoretical
260
+ - 4-6: Preclinical or early clinical data
261
+ - 7-10: Strong clinical evidence (trials, meta-analyses)
262
+
263
+ 3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
264
+ - Only include drugs explicitly mentioned
265
+ - Do NOT hallucinate or infer drug names
266
+ - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")
267
+
268
+ 4. **Key Findings**: Extract 3-5 key findings from the evidence
269
+ - Focus on findings relevant to the research question
270
+ - Include mechanism insights and clinical outcomes
271
+
272
+ 5. **Confidence (0.0-1.0)**: Your confidence in the scores
273
+ - Based on evidence quality and relevance
274
+ - Lower if evidence is tangential or low-quality
275
+
276
+ ## Output Format
277
+
278
+ Return valid JSON with these fields:
279
+ - details.mechanism_score (int 0-10)
280
+ - details.mechanism_reasoning (string)
281
+ - details.clinical_evidence_score (int 0-10)
282
+ - details.clinical_reasoning (string)
283
+ - details.drug_candidates (list of strings)
284
+ - details.key_findings (list of strings)
285
+ - sufficient (boolean) - TRUE if scores suggest enough evidence
286
+ - confidence (float 0-1)
287
+ - recommendation ("continue" or "synthesize") - Your suggestion (system may override)
288
+ - next_search_queries (list) - If continuing, suggest FOCUSED queries
289
+ - reasoning (string)
290
+
291
+ ## CRITICAL: Search Query Rules
292
+
293
+ When suggesting next_search_queries:
294
+ - STAY FOCUSED on the original research question
295
+ - Do NOT drift to tangential topics
296
+ - If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
297
+ - Refine existing terms, don't explore random medical associations
298
+ - Example: "female libido post-menopause" → "testosterone therapy female sexual dysfunction"
299
+ """
300
+ ```
301
+
302
+ ### Fix 3: Code-Enforced Termination Criteria
303
+
304
+ **File:** `src/orchestrators/simple.py`
305
+
306
+ ```python
307
+ # Termination thresholds (code-enforced, not LLM-decided)
308
+ # Based on multi-agent orchestration best practices (November 2025)
309
+ # Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/
310
+
311
+ TERMINATION_CRITERIA = {
312
+ "min_combined_score": 12, # mechanism + clinical >= 12
313
+ "min_score_with_volume": 10, # >= 10 if 50+ sources
314
+ "late_iteration_threshold": 8, # >= 8 in iterations 8+
315
+ "max_evidence_threshold": 100, # Force synthesis with 100+ sources
316
+ "emergency_iteration": 8, # Last 2 iterations = emergency mode
317
+ "min_confidence": 0.5, # Minimum confidence for emergency synthesis
318
+ }
319
+
320
+
321
+ def should_synthesize(
322
+ assessment: JudgeAssessment,
323
+ iteration: int,
324
+ max_iterations: int,
325
+ evidence_count: int,
326
+ ) -> tuple[bool, str]:
327
+ """
328
+ Code-enforced synthesis decision.
329
+
330
+ Returns (should_synthesize, reason).
331
+
332
+ This function implements the "explicit termination criteria" pattern
333
+ from multi-agent orchestration best practices. The LLM provides scores,
334
+ but CODE decides when to stop.
335
+
336
+ Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025
337
+ """
338
+ combined_score = (
339
+ assessment.details.mechanism_score +
340
+ assessment.details.clinical_evidence_score
341
+ )
342
+ has_drug_candidates = len(assessment.details.drug_candidates) > 0
343
+ confidence = assessment.confidence
344
+
345
+ # Priority 1: LLM explicitly says sufficient with good scores
346
+ if assessment.sufficient and assessment.recommendation == "synthesize":
347
+ if combined_score >= 10:
348
+ return True, "judge_approved"
349
+
350
+ # Priority 2: High scores with drug candidates
351
+ if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates:
352
+ return True, "high_scores_with_candidates"
353
+
354
+ # Priority 3: Good scores with high evidence volume
355
+ if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50:
356
+ return True, "good_scores_high_volume"
357
+
358
+ # Priority 4: Late iteration with acceptable scores (diminishing returns)
359
+ is_late_iteration = iteration >= max_iterations - 2
360
+ if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]:
361
+ return True, "late_iteration_acceptable"
362
+
363
+ # Priority 5: Very high evidence count (enough to synthesize something)
364
+ if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]:
365
+ return True, "max_evidence_reached"
366
+
367
+ # Priority 6: Emergency synthesis (avoid garbage output)
368
+ if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]:
369
+ return True, "emergency_synthesis"
370
+
371
+ return False, "continue_searching"
372
+ ```
373
+
374
+ ### Fix 4: Update Orchestrator Decision Phase
375
+
376
+ **File:** `src/orchestrators/simple.py`
377
+
378
+ ```python
379
+ # In the run() method, replace the decision phase:
380
+
381
+ # === DECISION PHASE (Code-Enforced) ===
382
+ should_synth, reason = should_synthesize(
383
+ assessment=assessment,
384
+ iteration=iteration,
385
+ max_iterations=self.config.max_iterations,
386
+ evidence_count=len(all_evidence),
387
+ )
388
+
389
+ logger.info(
390
+ "Synthesis decision",
391
+ should_synthesize=should_synth,
392
+ reason=reason,
393
+ iteration=iteration,
394
+ combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score,
395
+ evidence_count=len(all_evidence),
396
+ confidence=assessment.confidence,
397
+ )
398
+
399
+ if should_synth:
400
+ # Log synthesis trigger reason for debugging
401
+ if reason != "judge_approved":
402
+ logger.info(f"Code-enforced synthesis triggered: {reason}")
403
+
404
+ # Optional Analysis Phase
405
+ async for event in self._run_analysis_phase(query, all_evidence, iteration):
406
+ yield event
407
+
408
+ yield AgentEvent(
409
+ type="synthesizing",
410
+ message=f"Evidence sufficient ({reason})! Preparing synthesis...",
411
+ iteration=iteration,
412
+ )
413
+
414
+ # Generate final response
415
+ final_response = self._generate_synthesis(query, all_evidence, assessment)
416
+
417
+ yield AgentEvent(
418
+ type="complete",
419
+ message=final_response,
420
+ data={
421
+ "evidence_count": len(all_evidence),
422
+ "iterations": iteration,
423
+ "synthesis_reason": reason,
424
+ "drug_candidates": assessment.details.drug_candidates,
425
+ "key_findings": assessment.details.key_findings,
426
+ },
427
+ iteration=iteration,
428
+ )
429
+ return
430
+
431
+ else:
432
+ # Need more evidence - prepare next queries
433
+ current_queries = assessment.next_search_queries or [
434
+ f"{query} mechanism of action",
435
+ f"{query} clinical evidence",
436
+ ]
437
+
438
+ yield AgentEvent(
439
+ type="looping",
440
+ message=(
441
+ f"Gathering more evidence (scores: {assessment.details.mechanism_score}+"
442
+ f"{assessment.details.clinical_evidence_score}). "
443
+ f"Next: {', '.join(current_queries[:2])}..."
444
+ ),
445
+ data={"next_queries": current_queries, "reason": reason},
446
+ iteration=iteration,
447
+ )
448
+ ```
449
+
450
+ ### Fix 5: Real Partial Synthesis
451
+
452
+ **File:** `src/orchestrators/simple.py`
453
+
454
+ ```python
455
+ def _generate_partial_synthesis(
456
+ self,
457
+ query: str,
458
+ evidence: list[Evidence],
459
+ ) -> str:
460
+ """
461
+ Generate a REAL synthesis when max iterations reached.
462
+
463
+ Even when forced to stop, we should provide:
464
+ - Drug candidates (if any were found)
465
+ - Key findings
466
+ - Assessment scores
467
+ - Actionable citations
468
+
469
+ This is still better than a citation dump.
470
+ """
471
+ # Extract data from last assessment if available
472
+ last_assessment = self.history[-1]["assessment"] if self.history else {}
473
+ details = last_assessment.get("details", {})
474
+
475
+ drug_candidates = details.get("drug_candidates", [])
476
+ key_findings = details.get("key_findings", [])
477
+ mechanism_score = details.get("mechanism_score", 0)
478
+ clinical_score = details.get("clinical_evidence_score", 0)
479
+ reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")
480
+
481
+ # Format drug candidates
482
+ if drug_candidates:
483
+ drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
484
+ else:
485
+ drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*"
486
+
487
+ # Format key findings
488
+ if key_findings:
489
+ findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
490
+ else:
491
+ findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*"
492
+
493
+ # Format citations (top 10)
494
+ citations = "\n".join([
495
+ f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
496
+ f"({e.citation.source.upper()}, {e.citation.date})"
497
+ for i, e in enumerate(evidence[:10])
498
+ ])
499
+
500
+ combined_score = mechanism_score + clinical_score
501
+
502
+ return f"""## Drug Repurposing Analysis
503
+
504
+ ### Research Question
505
+ {query}
506
+
507
+ ### Status
508
+ Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
509
+ Maximum iterations reached - results may be incomplete.
510
+
511
+ ### Drug Candidates Identified
512
+ {drug_list}
513
+
514
+ ### Key Findings
515
+ {findings_list}
516
+
517
+ ### Evidence Quality Scores
518
+ | Criterion | Score | Interpretation |
519
+ |-----------|-------|----------------|
520
+ | Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence |
521
+ | Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support |
522
+ | Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis |
523
+
524
+ ### Analysis Summary
525
+ {reasoning}
526
+
527
+ ### Top Citations ({len(evidence)} sources total)
528
+ {citations}
529
+
530
+ ---
531
+ *For more complete analysis:*
532
+ - *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
533
+ - *Try a more specific query (e.g., include drug names)*
534
+ - *Use Advanced mode for multi-agent research*
535
+ """
536
+ ```
537
+
538
+ ### Fix 6: Update Judge Handler Signature
539
+
540
+ **File:** `src/orchestrators/base.py`
541
+
542
+ ```python
543
+ class JudgeHandlerProtocol(Protocol):
544
+ """Protocol for judge handler."""
545
+
546
+ async def assess(
547
+ self,
548
+ question: str,
549
+ evidence: list[Evidence],
550
+ iteration: int = 0, # NEW
551
+ max_iterations: int = 10, # NEW
552
+ ) -> JudgeAssessment:
553
+ """Assess evidence quality and provide scores."""
554
+ ...
555
+ ```
556
+
557
+ **File:** `src/agent_factory/judges.py`
558
+
559
+ Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to:
560
+
561
+ ```python
562
+ async def assess(
563
+ self,
564
+ question: str,
565
+ evidence: list[Evidence],
566
+ iteration: int = 0,
567
+ max_iterations: int = 10,
568
+ ) -> JudgeAssessment:
569
+ """Assess evidence with iteration context."""
570
+ # Select diverse evidence (not just truncate)
571
+ selected_evidence = await select_evidence_for_judge(evidence, question)
572
+
573
+ # Format prompt with iteration context
574
+ user_prompt = format_user_prompt(
575
+ question=question,
576
+ evidence=selected_evidence,
577
+ iteration=iteration,
578
+ max_iterations=max_iterations,
579
+ total_evidence_count=len(evidence),
580
+ )
581
+
582
+ # ... rest of implementation
583
+ ```
584
+
585
+ ---
586
+
587
+ ## Implementation Order
588
+
589
+ | Order | Fix | Priority | Impact |
590
+ |-------|-----|----------|--------|
591
+ | 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias |
592
+ | 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations |
593
+ | 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias |
594
+ | 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop |
595
+ | 5 | Update handler signatures | MEDIUM | Enables iteration context |
596
+ | 6 | Update orchestrator | MEDIUM | Integrates all fixes |
597
+
598
+ ---
599
+
600
+ ## Files to Modify
601
+
602
+ | File | Changes |
603
+ |------|---------|
604
+ | `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` |
605
+ | `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` |
606
+ | `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature |
607
+ | `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection |
608
+
609
+ ---
610
+
611
+ ## Test Plan
612
+
613
+ ### Unit Tests
614
+
615
+ ```python
616
+ # tests/unit/prompts/test_judge_prompt.py
617
+
618
+ @pytest.mark.asyncio
619
+ async def test_evidence_selection_diverse():
620
+ """Verify evidence selection includes early and recent items."""
621
+ evidence = [make_evidence(f"Paper {i}") for i in range(100)]
622
+ selected = await select_evidence_for_judge(evidence, "test query", max_items=30)
623
+
624
+ # Should include some early evidence (lost-in-the-middle mitigation)
625
+ titles = [e.citation.title for e in selected]
626
+ assert any("Paper 0" in t or "Paper 1" in t for t in titles)
627
+ assert any("Paper 99" in t or "Paper 98" in t for t in titles)
628
+
629
+
630
+ def test_prompt_includes_question_at_edges():
631
+ """Verify lost-in-the-middle mitigation."""
632
+ evidence = [make_evidence("Test")]
633
+ prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10)
634
+
635
+ # Question should appear at START and END of prompt
636
+ lines = prompt.split("\n")
637
+ assert "important question" in lines[1] # Near start
638
+ assert "important question" in lines[-2] # Near end
639
+
640
+
641
+ # tests/unit/orchestrators/test_termination.py
642
+
643
+ def test_should_synthesize_high_scores():
644
+ """High scores with drug candidates triggers synthesis."""
645
+ assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
646
+ should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50)
647
+
648
+ assert should_synth is True
649
+ assert reason == "high_scores_with_candidates"
650
+
651
+
652
+ def test_should_synthesize_late_iteration():
653
+ """Late iteration with acceptable scores triggers synthesis."""
654
+ assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
655
+ should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80)
656
+
657
+ assert should_synth is True
658
+ assert reason in ["late_iteration_acceptable", "emergency_synthesis"]
659
+
660
+
661
+ def test_should_not_synthesize_early_low_scores():
662
+ """Early iteration with low scores continues searching."""
663
+ assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
664
+ should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20)
665
+
666
+ assert should_synth is False
667
+ assert reason == "continue_searching"
668
+
669
+
670
+ def test_partial_synthesis_has_drug_candidates():
671
+ """Partial synthesis includes extracted data."""
672
+ orchestrator = Orchestrator(...)
673
+ orchestrator.history = [{
674
+ "assessment": {
675
+ "details": {
676
+ "drug_candidates": ["Testosterone", "DHEA"],
677
+ "key_findings": ["Finding 1", "Finding 2"],
678
+ "mechanism_score": 6,
679
+ "clinical_evidence_score": 5,
680
+ },
681
+ "reasoning": "Good evidence found.",
682
+ }
683
+ }]
684
+
685
+ result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")])
686
+
687
+ assert "Testosterone" in result
688
+ assert "DHEA" in result
689
+ assert "Drug Candidates" in result
690
+ assert "6/10" in result # mechanism score
691
+ ```
692
+
693
+ ### Integration Tests
694
+
695
+ ```python
696
+ # tests/integration/test_simple_mode_synthesis.py
697
+
698
+ @pytest.mark.asyncio
699
+ async def test_simple_mode_synthesizes_before_max_iterations():
700
+ """Verify simple mode produces useful output with mocked judge."""
701
+ # Mock judge to return good scores
702
+ mock_judge = MockJudgeHandler()
703
+ orchestrator = Orchestrator(
704
+ search_handler=mock_search_handler,
705
+ judge_handler=mock_judge,
706
+ )
707
+
708
+ events = []
709
+ async for event in orchestrator.run("metformin diabetes mechanism"):
710
+ events.append(event)
711
+
712
+ # Must have synthesis with drug candidates
713
+ complete_event = next(e for e in events if e.type == "complete")
714
+ assert "Drug Candidates" in complete_event.message
715
+ assert complete_event.data.get("synthesis_reason") is not None
716
+
717
+
718
+ @pytest.mark.asyncio
719
+ async def test_large_evidence_does_not_crash():
720
+ """Verify 500 sources don't cause token overflow."""
721
+ evidence = [make_evidence(f"Paper {i}") for i in range(500)]
722
+ selected = await select_evidence_for_judge(evidence, "test query")
723
+
724
+ # Should be capped
725
+ assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE
726
+
727
+ # Total chars should be under ~50K (safe for most LLMs)
728
+ prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500)
729
+ assert len(prompt) < 100_000 # Well under token limits
730
+ ```
731
+
732
+ ---
733
+
734
+ ## Acceptance Criteria
735
+
736
+ - [ ] Evidence sent to judge is diverse-selected (not just truncated)
737
+ - [ ] Prompt includes question at START and END (lost-in-the-middle mitigation)
738
+ - [ ] Code-enforced `should_synthesize()` makes termination decision
739
+ - [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8
740
+ - [ ] Partial synthesis includes drug candidates and scores (not just citations)
741
+ - [ ] Search queries stay on-topic (judge prompt enforces focus)
742
+ - [ ] 500+ sources don't cause LLM crashes
743
+ - [ ] All existing tests pass
744
+
745
+ ---
746
+
747
+ ## Risk Assessment
748
+
749
+ | Risk | Mitigation |
750
+ |------|------------|
751
+ | Diverse selection misses critical evidence | Include relevance scoring in selection |
752
+ | Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning |
753
+ | Prompt changes affect OpenAI/Anthropic differently | Test with all providers |
754
+ | Emergency synthesis produces low-quality output | Still better than citation dump |
755
+
756
+ ---
757
+
758
+ ## Success Metrics
759
+
760
+ | Metric | Before | After |
761
+ |--------|--------|-------|
762
+ | Synthesis rate | 0% | 90%+ |
763
+ | Average iterations to synthesis | 10 (max) | 5-7 |
764
+ | Drug candidates in output | Never | Always (if found) |
765
+ | LLM token overflow errors | Common | None |
766
+ | User-reported "useless output" | Frequent | Rare |
767
+
768
+ ---
769
+
770
+ ## References
771
+
772
+ - [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
773
+ - [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594)
774
+ - [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/)
775
+ - [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag)
776
+ - [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
777
+ - [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)
778
+ - [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/)
src/orchestrators/advanced.py CHANGED
@@ -36,9 +36,11 @@ from src.agents.magentic_agents import (
36
  create_search_agent,
37
  )
38
  from src.agents.state import init_magentic_state
 
39
  from src.utils.config import settings
40
  from src.utils.llm_factory import check_magentic_requirements
41
  from src.utils.models import AgentEvent
 
42
 
43
  if TYPE_CHECKING:
44
  from src.services.embeddings import EmbeddingService
@@ -46,7 +48,7 @@ if TYPE_CHECKING:
46
  logger = structlog.get_logger()
47
 
48
 
49
- class AdvancedOrchestrator:
50
  """
51
  Advanced orchestrator using Microsoft Agent Framework ChatAgent pattern.
52
 
@@ -97,17 +99,7 @@ class AdvancedOrchestrator:
97
 
98
  def _init_embedding_service(self) -> "EmbeddingService | None":
99
  """Initialize embedding service if available."""
100
- try:
101
- from src.services.embeddings import get_embedding_service
102
-
103
- service = get_embedding_service()
104
- logger.info("Embedding service enabled")
105
- return service
106
- except ImportError:
107
- logger.info("Embedding service not available (dependencies missing)")
108
- except Exception as e:
109
- logger.warning("Failed to initialize embedding service", error=str(e))
110
- return None
111
 
112
  def _build_workflow(self) -> Any:
113
  """Build the workflow with ChatAgent participants."""
 
36
  create_search_agent,
37
  )
38
  from src.agents.state import init_magentic_state
39
+ from src.orchestrators.base import OrchestratorProtocol
40
  from src.utils.config import settings
41
  from src.utils.llm_factory import check_magentic_requirements
42
  from src.utils.models import AgentEvent
43
+ from src.utils.service_loader import get_embedding_service_if_available
44
 
45
  if TYPE_CHECKING:
46
  from src.services.embeddings import EmbeddingService
 
48
  logger = structlog.get_logger()
49
 
50
 
51
+ class AdvancedOrchestrator(OrchestratorProtocol):
52
  """
53
  Advanced orchestrator using Microsoft Agent Framework ChatAgent pattern.
54
 
 
99
 
100
  def _init_embedding_service(self) -> "EmbeddingService | None":
101
  """Initialize embedding service if available."""
102
+ return get_embedding_service_if_available()
 
 
 
 
 
 
 
 
 
 
103
 
104
  def _build_workflow(self) -> Any:
105
  """Build the workflow with ChatAgent participants."""
src/orchestrators/hierarchical.py CHANGED
@@ -19,9 +19,10 @@ import structlog
19
  from src.agents.judge_agent_llm import LLMSubIterationJudge
20
  from src.agents.magentic_agents import create_search_agent
21
  from src.middleware.sub_iteration import SubIterationMiddleware, SubIterationTeam
22
- from src.services.embeddings import get_embedding_service
23
  from src.state import init_magentic_state
24
  from src.utils.models import AgentEvent, OrchestratorConfig
 
25
 
26
  logger = structlog.get_logger()
27
 
@@ -56,7 +57,7 @@ class ResearchTeam(SubIterationTeam):
56
  return "No response from agent."
57
 
58
 
59
- class HierarchicalOrchestrator:
60
  """Orchestrator that uses hierarchical teams and sub-iterations.
61
 
62
  This orchestrator provides:
@@ -96,19 +97,8 @@ class HierarchicalOrchestrator:
96
  """
97
  logger.info("Starting hierarchical orchestrator", query=query)
98
 
99
- try:
100
- service = get_embedding_service()
101
- init_magentic_state(service)
102
- except ImportError:
103
- logger.info("Embedding service not available (dependencies missing)")
104
- init_magentic_state()
105
- except Exception as e:
106
- logger.warning(
107
- "Embedding service initialization failed",
108
- error=str(e),
109
- error_type=type(e).__name__,
110
- )
111
- init_magentic_state()
112
 
113
  yield AgentEvent(type="started", message=f"Starting research: {query}")
114
 
 
19
  from src.agents.judge_agent_llm import LLMSubIterationJudge
20
  from src.agents.magentic_agents import create_search_agent
21
  from src.middleware.sub_iteration import SubIterationMiddleware, SubIterationTeam
22
+ from src.orchestrators.base import OrchestratorProtocol
23
  from src.state import init_magentic_state
24
  from src.utils.models import AgentEvent, OrchestratorConfig
25
+ from src.utils.service_loader import get_embedding_service_if_available
26
 
27
  logger = structlog.get_logger()
28
 
 
57
  return "No response from agent."
58
 
59
 
60
+ class HierarchicalOrchestrator(OrchestratorProtocol):
61
  """Orchestrator that uses hierarchical teams and sub-iterations.
62
 
63
  This orchestrator provides:
 
97
  """
98
  logger.info("Starting hierarchical orchestrator", query=query)
99
 
100
+ service = get_embedding_service_if_available()
101
+ init_magentic_state(service)
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  yield AgentEvent(type="started", message=f"Starting research: {query}")
104
 
src/orchestrators/simple.py CHANGED
@@ -72,47 +72,22 @@ class Orchestrator:
72
  self._embeddings: EmbeddingService | None = None
73
 
74
  def _get_analyzer(self) -> StatisticalAnalyzer | None:
75
- """Lazy initialization of StatisticalAnalyzer.
76
-
77
- Note: This imports from src.services, NOT src.agents,
78
- so it works without the magentic optional dependency.
79
-
80
- Returns:
81
- StatisticalAnalyzer instance, or None if Modal is unavailable
82
- """
83
  if self._analyzer is None:
84
- try:
85
- from src.services.statistical_analyzer import get_statistical_analyzer
86
 
87
- self._analyzer = get_statistical_analyzer()
88
- except ImportError:
89
- logger.info("StatisticalAnalyzer not available (Modal dependencies missing)")
90
  self._enable_analysis = False
91
  return self._analyzer
92
 
93
  def _get_embeddings(self) -> EmbeddingService | None:
94
- """Lazy initialization of EmbeddingService.
95
-
96
- Uses local sentence-transformers - NO API key required.
97
-
98
- Returns:
99
- EmbeddingService instance, or None if unavailable
100
- """
101
  if self._embeddings is None and self._enable_embeddings:
102
- try:
103
- from src.services.embeddings import get_embedding_service
104
 
105
- self._embeddings = get_embedding_service()
106
- logger.info("Embedding service enabled for semantic ranking")
107
- except ImportError:
108
- logger.info("Embedding service not available (dependencies missing)")
109
- self._enable_embeddings = False
110
- except Exception as e:
111
- logger.warning(
112
- "Embedding service initialization failed",
113
- error=str(e),
114
- error_type=type(e).__name__,
115
- )
116
  self._enable_embeddings = False
117
  return self._embeddings
118
 
 
72
  self._embeddings: EmbeddingService | None = None
73
 
74
  def _get_analyzer(self) -> StatisticalAnalyzer | None:
75
+ """Lazy initialization of StatisticalAnalyzer."""
 
 
 
 
 
 
 
76
  if self._analyzer is None:
77
+ from src.utils.service_loader import get_analyzer_if_available
 
78
 
79
+ self._analyzer = get_analyzer_if_available()
80
+ if self._analyzer is None:
 
81
  self._enable_analysis = False
82
  return self._analyzer
83
 
84
  def _get_embeddings(self) -> EmbeddingService | None:
85
+ """Lazy initialization of EmbeddingService."""
 
 
 
 
 
 
86
  if self._embeddings is None and self._enable_embeddings:
87
+ from src.utils.service_loader import get_embedding_service_if_available
 
88
 
89
+ self._embeddings = get_embedding_service_if_available()
90
+ if self._embeddings is None:
 
 
 
 
 
 
 
 
 
91
  self._enable_embeddings = False
92
  return self._embeddings
93
 
src/utils/service_loader.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Service loader utility for safe, lazy loading of optional services.
2
+
3
+ This module handles the import and initialization of services that may
4
+ have missing optional dependencies (like Modal or Sentence Transformers),
5
+ preventing the application from crashing if they are not available.
6
+ """
7
+
8
+ from typing import TYPE_CHECKING
9
+
10
+ import structlog
11
+
12
+ if TYPE_CHECKING:
13
+ from src.services.embeddings import EmbeddingService
14
+ from src.services.statistical_analyzer import StatisticalAnalyzer
15
+
16
+ logger = structlog.get_logger()
17
+
18
+
19
+ def get_embedding_service_if_available() -> "EmbeddingService | None":
20
+ """
21
+ Safely attempt to load and initialize the EmbeddingService.
22
+
23
+ Returns:
24
+ EmbeddingService instance if dependencies are met, else None.
25
+ """
26
+ try:
27
+ # Import here to avoid top-level dependency check
28
+ from src.services.embeddings import get_embedding_service
29
+
30
+ service = get_embedding_service()
31
+ logger.info("Embedding service initialized successfully")
32
+ return service
33
+ except ImportError as e:
34
+ logger.info(
35
+ "Embedding service not available (optional dependencies missing)",
36
+ missing_dependency=str(e),
37
+ )
38
+ except Exception as e:
39
+ logger.warning(
40
+ "Embedding service initialization failed unexpectedly",
41
+ error=str(e),
42
+ error_type=type(e).__name__,
43
+ )
44
+ return None
45
+
46
+
47
+ def get_analyzer_if_available() -> "StatisticalAnalyzer | None":
48
+ """
49
+ Safely attempt to load and initialize the StatisticalAnalyzer.
50
+
51
+ Returns:
52
+ StatisticalAnalyzer instance if Modal is available, else None.
53
+ """
54
+ try:
55
+ from src.services.statistical_analyzer import get_statistical_analyzer
56
+
57
+ analyzer = get_statistical_analyzer()
58
+ logger.info("StatisticalAnalyzer initialized successfully")
59
+ return analyzer
60
+ except ImportError as e:
61
+ logger.info(
62
+ "StatisticalAnalyzer not available (Modal dependencies missing)",
63
+ missing_dependency=str(e),
64
+ )
65
+ except Exception as e:
66
+ logger.warning(
67
+ "StatisticalAnalyzer initialization failed unexpectedly",
68
+ error=str(e),
69
+ error_type=type(e).__name__,
70
+ )
71
+ return None
tests/unit/utils/test_service_loader.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from unittest.mock import MagicMock, patch
2
+
3
+ from src.utils.service_loader import (
4
+ get_analyzer_if_available,
5
+ get_embedding_service_if_available,
6
+ )
7
+
8
+
9
+ def test_get_embedding_service_success():
10
+ """Test successful loading of embedding service."""
11
+ with patch("src.services.embeddings.get_embedding_service") as mock_get:
12
+ mock_service = MagicMock()
13
+ mock_get.return_value = mock_service
14
+
15
+ service = get_embedding_service_if_available()
16
+
17
+ assert service is mock_service
18
+ mock_get.assert_called_once()
19
+
20
+
21
+ def test_get_embedding_service_import_error():
22
+ """Test handling of ImportError when loading embedding service."""
23
+ # Simulate import error by patching the function to raise ImportError
24
+ with patch(
25
+ "src.services.embeddings.get_embedding_service",
26
+ side_effect=ImportError("Missing deps"),
27
+ ):
28
+ service = get_embedding_service_if_available()
29
+ assert service is None
30
+
31
+
32
+ def test_get_embedding_service_generic_error():
33
+ """Test handling of generic Exception when loading embedding service."""
34
+ with patch(
35
+ "src.services.embeddings.get_embedding_service",
36
+ side_effect=ValueError("Boom"),
37
+ ):
38
+ service = get_embedding_service_if_available()
39
+ assert service is None
40
+
41
+
42
+ def test_get_analyzer_success():
43
+ """Test successful loading of analyzer."""
44
+ with patch("src.services.statistical_analyzer.get_statistical_analyzer") as mock_get:
45
+ mock_analyzer = MagicMock()
46
+ mock_get.return_value = mock_analyzer
47
+
48
+ analyzer = get_analyzer_if_available()
49
+
50
+ assert analyzer is mock_analyzer
51
+ mock_get.assert_called_once()
52
+
53
+
54
+ def test_get_analyzer_import_error():
55
+ """Test handling of ImportError when loading analyzer."""
56
+ with patch(
57
+ "src.services.statistical_analyzer.get_statistical_analyzer",
58
+ side_effect=ImportError("No Modal"),
59
+ ):
60
+ analyzer = get_analyzer_if_available()
61
+ assert analyzer is None
62
+
63
+
64
+ def test_get_analyzer_generic_error():
65
+ """Test handling of generic Exception when loading analyzer."""
66
+ with patch(
67
+ "src.services.statistical_analyzer.get_statistical_analyzer",
68
+ side_effect=RuntimeError("Fail"),
69
+ ):
70
+ analyzer = get_analyzer_if_available()
71
+ assert analyzer is None