DeepBoner / docs /specs /SPEC_14_CLINICALTRIALS_OUTCOMES.md
VibecoderMcSwaggins's picture
feat(search): SPEC_13 Evidence Deduplication (#98)
2c5db87 unverified
# SPEC_14: Add Outcome Measures to ClinicalTrials.gov Fields
**Status**: Draft (Validated via API Documentation Review)
**Priority**: P1
**GitHub Issue**: #95
**Estimated Effort**: Small (~40 lines of code)
**Last Updated**: 2025-11-30
---
## Problem Statement
The `ClinicalTrialsTool` retrieves trial metadata but **misses critical efficacy data**:
### Current Fields Retrieved
```python
# src/tools/clinicaltrials.py:24-33
FIELDS: ClassVar[list[str]] = [
"NCTId",
"BriefTitle",
"Phase",
"OverallStatus",
"Condition",
"InterventionName",
"StartDate",
"BriefSummary",
]
```
### Missing Data (Critical for Research)
| Data | Location in Response | Purpose |
|------|---------------------|---------|
| Primary Outcomes | `protocolSection.outcomesModule.primaryOutcomes[].measure` | Main efficacy endpoint |
| Secondary Outcomes | `protocolSection.outcomesModule.secondaryOutcomes[].measure` | Additional endpoints |
| Has Results | `study.hasResults` (top-level) | Whether results are posted |
| Results Date | `protocolSection.statusModule.resultsFirstPostDateStruct.date` | When results posted |
### Impact
**Current Output**:
```
Trial Phase: PHASE3. Status: COMPLETED. Conditions: Erectile Dysfunction.
Interventions: Sildenafil.
```
**Desired Output**:
```
Trial Phase: PHASE3. Status: COMPLETED. Conditions: Erectile Dysfunction.
Interventions: Sildenafil.
Primary Outcome: Change from baseline in IIEF-EF domain score at Week 12.
Results Available: Yes (posted 2024-01-15).
```
---
## API Documentation Review (2025-11-30)
### ClinicalTrials.gov API v2 Response Structure
**Source**: [Stack Overflow - ClinicalTrials.gov API v2](https://stackoverflow.com/questions/78415818)
The API returns nested JSON. Key findings:
1. **`hasResults`** is a **top-level** field on each study object (NOT inside `protocolSection`)
2. **Outcomes** are in `protocolSection.outcomesModule`:
```python
study['protocolSection']['outcomesModule']['primaryOutcomes'] # List
study['protocolSection']['outcomesModule']['secondaryOutcomes'] # List
```
3. **Results date** is in `protocolSection.statusModule.resultsFirstPostDateStruct.date`
### `fields` Parameter Behavior (VERIFIED VIA LIVE API TESTING)
The `fields` query parameter filters what the API returns. **If you don't request a field, you don't get it.**
**Live API Test Results (2025-11-30):**
```bash
# Test 1: With limited fields - NO outcomesModule returned
curl "...&fields=NCTId,BriefTitle"
# → Returns ONLY: protocolSection.identificationModule.{nctId, briefTitle}
# Test 2: Without fields param - outcomesModule IS present
curl "...&pageSize=1"
# → Returns: hasResults: false, outcomesModule: {primaryOutcomes, secondaryOutcomes, otherOutcomes}
# Test 3: Valid field names for outcomes
curl "...&fields=NCTId,OutcomesModule" # ✅ Works - returns full outcomesModule
curl "...&fields=NCTId,PrimaryOutcome" # ✅ Works - returns only primaryOutcomes
curl "...&fields=NCTId,HasResults" # ✅ Works - returns hasResults at top level
```
**Valid Field Names (Tested):**
- `OutcomesModule` → Returns full `protocolSection.outcomesModule` with all outcomes
- `PrimaryOutcome` → Returns only `primaryOutcomes` array
- `SecondaryOutcome` → Returns only `secondaryOutcomes` array
- `HasResults` → Returns `hasResults` at study top level
---
## Proposed Solution
### ✅ UPDATE FIELDS Constant (REQUIRED)
The current implementation explicitly passes `fields=",".join(self.FIELDS)` at line 67.
**The API ONLY returns requested fields.** We MUST add the new field names.
```python
# src/tools/clinicaltrials.py - UPDATE FIELDS
FIELDS: ClassVar[list[str]] = [
"NCTId",
"BriefTitle",
"Phase",
"OverallStatus",
"Condition",
"InterventionName",
"StartDate",
"BriefSummary",
# NEW: Outcome measures (verified via live API testing 2025-11-30)
"OutcomesModule", # Returns protocolSection.outcomesModule.{primaryOutcomes, secondaryOutcomes}
"HasResults", # Returns study.hasResults (top-level boolean)
]
```
### ✅ Update `_study_to_evidence()` Method
```python
def _study_to_evidence(self, study: dict[str, Any]) -> Evidence:
"""Convert a clinical trial study to Evidence."""
# Navigate nested structure
protocol = study.get("protocolSection", {})
id_module = protocol.get("identificationModule", {})
status_module = protocol.get("statusModule", {})
desc_module = protocol.get("descriptionModule", {})
design_module = protocol.get("designModule", {})
conditions_module = protocol.get("conditionsModule", {})
arms_module = protocol.get("armsInterventionsModule", {})
outcomes_module = protocol.get("outcomesModule", {}) # NEW
# ... existing field extraction (nct_id, title, status, phase, etc.) ...
# NEW: Extract outcome measures
primary_outcomes = outcomes_module.get("primaryOutcomes", [])
primary_outcome_str = ""
if primary_outcomes:
# Get first primary outcome measure and timeframe
first = primary_outcomes[0]
measure = first.get("measure", "")
timeframe = first.get("timeFrame", "")
# Truncate long outcome descriptions
primary_outcome_str = measure[:200]
if timeframe:
primary_outcome_str += f" (measured at {timeframe})"
secondary_outcomes = outcomes_module.get("secondaryOutcomes", [])
secondary_count = len(secondary_outcomes)
# NEW: Check if results are available (hasResults is TOP-LEVEL, not in protocol!)
has_results = study.get("hasResults", False)
# Results date is in statusModule (nested inside date struct)
results_date_struct = status_module.get("resultsFirstPostDateStruct", {})
results_date = results_date_struct.get("date", "")
# Build content with key trial info (UPDATED)
content_parts = [
f"{summary[:400]}...",
f"Trial Phase: {phase}.",
f"Status: {status}.",
f"Conditions: {conditions_str}.",
f"Interventions: {interventions_str}.",
]
if primary_outcome_str:
content_parts.append(f"Primary Outcome: {primary_outcome_str}.")
if secondary_count > 0:
content_parts.append(f"Secondary Outcomes: {secondary_count} additional endpoints.")
if has_results:
results_info = "Results Available: Yes"
if results_date:
results_info += f" (posted {results_date})"
content_parts.append(results_info + ".")
else:
content_parts.append("Results Available: Not yet posted.")
content = " ".join(content_parts)
return Evidence(
content=content[:2000],
citation=Citation(
source="clinicaltrials",
title=title[:500],
url=f"https://clinicaltrials.gov/study/{nct_id}",
date=start_date,
authors=[],
),
relevance=0.90 if has_results else 0.85, # Boost relevance for trials with results
)
```
---
## API Reference
The ClinicalTrials.gov API v2 returns nested JSON:
```json
{
"protocolSection": {
"outcomesModule": {
"primaryOutcomes": [
{
"measure": "Change from Baseline in IIEF-EF Domain Score",
"description": "...",
"timeFrame": "Baseline to Week 12"
}
],
"secondaryOutcomes": [
{
"measure": "Subject Global Assessment Question",
"timeFrame": "Week 12"
}
]
}
},
"hasResults": true
}
```
See: https://clinicaltrials.gov/data-api/api
---
## Test Plan
### Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
```python
@pytest.mark.unit
class TestClinicalTrialsOutcomes:
"""Tests for outcome measure extraction."""
@pytest.mark.asyncio
async def test_extracts_primary_outcome(self, tool: ClinicalTrialsTool) -> None:
"""Test that primary outcome is extracted from response."""
mock_study = {
"protocolSection": {
"identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
"statusModule": {"overallStatus": "COMPLETED", "startDateStruct": {"date": "2023"}},
"descriptionModule": {"briefSummary": "Summary"},
"designModule": {"phases": ["PHASE3"]},
"conditionsModule": {"conditions": ["ED"]},
"armsInterventionsModule": {"interventions": []},
"outcomesModule": {
"primaryOutcomes": [
{
"measure": "Change in IIEF-EF score",
"timeFrame": "Week 12"
}
]
},
},
"hasResults": True,
}
mock_response = MagicMock()
mock_response.json.return_value = {"studies": [mock_study]}
mock_response.raise_for_status = MagicMock()
with patch("requests.get", return_value=mock_response):
results = await tool.search("test", max_results=1)
assert len(results) == 1
assert "Primary Outcome" in results[0].content
assert "IIEF-EF" in results[0].content
assert "Week 12" in results[0].content
@pytest.mark.asyncio
async def test_includes_results_status(self, tool: ClinicalTrialsTool) -> None:
"""Test that results availability is shown."""
mock_study = {
"protocolSection": {
"identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
"statusModule": {
"overallStatus": "COMPLETED",
"startDateStruct": {"date": "2023"},
# Note: resultsFirstPostDateStruct, not resultsFirstSubmitDate
"resultsFirstPostDateStruct": {"date": "2024-06-15"},
},
"descriptionModule": {"briefSummary": "Summary"},
"designModule": {"phases": ["PHASE3"]},
"conditionsModule": {"conditions": ["ED"]},
"armsInterventionsModule": {"interventions": []},
"outcomesModule": {},
},
"hasResults": True, # Note: hasResults is TOP-LEVEL
}
mock_response = MagicMock()
mock_response.json.return_value = {"studies": [mock_study]}
mock_response.raise_for_status = MagicMock()
with patch("requests.get", return_value=mock_response):
results = await tool.search("test", max_results=1)
assert "Results Available: Yes" in results[0].content
assert "2024-06-15" in results[0].content
@pytest.mark.asyncio
async def test_shows_no_results_when_missing(self, tool: ClinicalTrialsTool) -> None:
"""Test that missing results are indicated."""
mock_study = {
"protocolSection": {
"identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
"statusModule": {"overallStatus": "RECRUITING", "startDateStruct": {"date": "2024"}},
"descriptionModule": {"briefSummary": "Summary"},
"designModule": {"phases": ["PHASE2"]},
"conditionsModule": {"conditions": ["ED"]},
"armsInterventionsModule": {"interventions": []},
"outcomesModule": {},
},
"hasResults": False,
}
mock_response = MagicMock()
mock_response.json.return_value = {"studies": [mock_study]}
mock_response.raise_for_status = MagicMock()
with patch("requests.get", return_value=mock_response):
results = await tool.search("test", max_results=1)
assert "Results Available: Not yet posted" in results[0].content
@pytest.mark.asyncio
async def test_boosts_relevance_for_results(self, tool: ClinicalTrialsTool) -> None:
"""Trials with results should have higher relevance score."""
with_results = {
"protocolSection": {
"identificationModule": {"nctId": "NCT11111111", "briefTitle": "With Results"},
"statusModule": {"overallStatus": "COMPLETED", "startDateStruct": {"date": "2023"}},
"descriptionModule": {"briefSummary": "Summary"},
"designModule": {"phases": []},
"conditionsModule": {"conditions": []},
"armsInterventionsModule": {"interventions": []},
"outcomesModule": {},
},
"hasResults": True,
}
without_results = {
"protocolSection": {
"identificationModule": {"nctId": "NCT22222222", "briefTitle": "No Results"},
"statusModule": {"overallStatus": "RECRUITING", "startDateStruct": {"date": "2024"}},
"descriptionModule": {"briefSummary": "Summary"},
"designModule": {"phases": []},
"conditionsModule": {"conditions": []},
"armsInterventionsModule": {"interventions": []},
"outcomesModule": {},
},
"hasResults": False,
}
mock_response = MagicMock()
mock_response.json.return_value = {"studies": [with_results, without_results]}
mock_response.raise_for_status = MagicMock()
with patch("requests.get", return_value=mock_response):
results = await tool.search("test", max_results=2)
assert results[0].relevance == 0.90 # With results
assert results[1].relevance == 0.85 # Without results
```
### Integration Test
```python
@pytest.mark.integration
class TestClinicalTrialsOutcomesIntegration:
"""Integration tests with real API."""
@pytest.mark.asyncio
async def test_real_completed_trial_has_outcome(self) -> None:
"""Real completed Phase 3 trials should have outcome measures."""
tool = ClinicalTrialsTool()
# Search for completed Phase 3 ED trials (likely to have outcomes)
results = await tool.search(
"sildenafil erectile dysfunction Phase 3 COMPLETED",
max_results=3
)
# At least one should have primary outcome
has_outcome = any("Primary Outcome" in r.content for r in results)
assert has_outcome, "No completed trials with outcome measures found"
```
---
## Files to Modify
| File | Change |
|------|--------|
| `src/tools/clinicaltrials.py` | **ADD** `OutcomesModule` and `HasResults` to `FIELDS`, update `_study_to_evidence()` |
| `tests/unit/tools/test_clinicaltrials.py` | Add outcome parsing tests |
---
## Acceptance Criteria
### FIELDS Constant (REQUIRED CHANGE)
- [ ] `FIELDS` includes `"OutcomesModule"` (returns full outcomesModule)
- [ ] `FIELDS` includes `"HasResults"` (returns top-level boolean)
### `_study_to_evidence()` Method
- [ ] Extracts `protocolSection.outcomesModule.primaryOutcomes`
- [ ] Accesses `study.hasResults` at TOP LEVEL (not inside protocolSection)
- [ ] Results date extracted from `statusModule.resultsFirstPostDateStruct.date`
- [ ] Evidence content includes primary outcome measure when available
- [ ] Evidence content shows results availability status
- [ ] Outcome measure text truncated to 200 chars
- [ ] Trials with results have boosted relevance (0.90 vs 0.85)
### Testing
- [ ] All unit tests pass
- [ ] Integration test confirms real trials return outcome data
- [ ] Live API test confirms `OutcomesModule` and `HasResults` fields work
---
## Edge Cases
1. **No outcomes defined**: Some early-phase trials don't have outcomes yet
- Solution: Gracefully skip outcome section if `outcomesModule` is empty or missing
2. **Multiple primary outcomes**: Some trials have 2-3 primary outcomes
- Solution: Show first outcome only, mention count of others
3. **Long outcome descriptions**: Some measures are very verbose (500+ chars)
- Solution: Truncate measure to 200 chars with `[:200]`
4. **hasResults without resultsFirstPostDateStruct**: Some completed trials may have results without a posted date
- Solution: Show "Results Available: Yes" without date
5. **outcomesModule missing entirely**: Not all API responses include this module
- Solution: Use `.get("outcomesModule", {})` for safe access
---
## Rollback Plan
If outcome extraction causes issues:
1. DO NOT modify `FIELDS` - nothing to revert there
2. Remove outcome extraction code from `_study_to_evidence()`
3. Existing tests should still pass
---
## References
- GitHub Issue #95
- [ClinicalTrials.gov API v2 Studies Endpoint](https://clinicaltrials.gov/data-api/api)
- [Stack Overflow - ClinicalTrials.gov API v2 Response Structure](https://stackoverflow.com/questions/78415818)
- `TOOL_ANALYSIS_CRITICAL.md` - "Tool 2: ClinicalTrials.gov > Current Implementation Gaps"