DeepBoner / docs /specs /SPEC_14_CLINICALTRIALS_OUTCOMES.md
VibecoderMcSwaggins's picture
feat(search): SPEC_13 Evidence Deduplication (#98)
2c5db87 unverified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

SPEC_14: Add Outcome Measures to ClinicalTrials.gov Fields

Status: Draft (Validated via API Documentation Review) Priority: P1 GitHub Issue: #95 Estimated Effort: Small (~40 lines of code) Last Updated: 2025-11-30


Problem Statement

The ClinicalTrialsTool retrieves trial metadata but misses critical efficacy data:

Current Fields Retrieved

# src/tools/clinicaltrials.py:24-33
FIELDS: ClassVar[list[str]] = [
    "NCTId",
    "BriefTitle",
    "Phase",
    "OverallStatus",
    "Condition",
    "InterventionName",
    "StartDate",
    "BriefSummary",
]

Missing Data (Critical for Research)

Data Location in Response Purpose
Primary Outcomes protocolSection.outcomesModule.primaryOutcomes[].measure Main efficacy endpoint
Secondary Outcomes protocolSection.outcomesModule.secondaryOutcomes[].measure Additional endpoints
Has Results study.hasResults (top-level) Whether results are posted
Results Date protocolSection.statusModule.resultsFirstPostDateStruct.date When results posted

Impact

Current Output:

Trial Phase: PHASE3. Status: COMPLETED. Conditions: Erectile Dysfunction.
Interventions: Sildenafil.

Desired Output:

Trial Phase: PHASE3. Status: COMPLETED. Conditions: Erectile Dysfunction.
Interventions: Sildenafil.
Primary Outcome: Change from baseline in IIEF-EF domain score at Week 12.
Results Available: Yes (posted 2024-01-15).

API Documentation Review (2025-11-30)

ClinicalTrials.gov API v2 Response Structure

Source: Stack Overflow - ClinicalTrials.gov API v2

The API returns nested JSON. Key findings:

  1. hasResults is a top-level field on each study object (NOT inside protocolSection)
  2. Outcomes are in protocolSection.outcomesModule:
    study['protocolSection']['outcomesModule']['primaryOutcomes']  # List
    study['protocolSection']['outcomesModule']['secondaryOutcomes']  # List
    
  3. Results date is in protocolSection.statusModule.resultsFirstPostDateStruct.date

fields Parameter Behavior (VERIFIED VIA LIVE API TESTING)

The fields query parameter filters what the API returns. If you don't request a field, you don't get it.

Live API Test Results (2025-11-30):

# Test 1: With limited fields - NO outcomesModule returned
curl "...&fields=NCTId,BriefTitle"
# β†’ Returns ONLY: protocolSection.identificationModule.{nctId, briefTitle}

# Test 2: Without fields param - outcomesModule IS present
curl "...&pageSize=1"
# β†’ Returns: hasResults: false, outcomesModule: {primaryOutcomes, secondaryOutcomes, otherOutcomes}

# Test 3: Valid field names for outcomes
curl "...&fields=NCTId,OutcomesModule"  # βœ… Works - returns full outcomesModule
curl "...&fields=NCTId,PrimaryOutcome"  # βœ… Works - returns only primaryOutcomes
curl "...&fields=NCTId,HasResults"      # βœ… Works - returns hasResults at top level

Valid Field Names (Tested):

  • OutcomesModule β†’ Returns full protocolSection.outcomesModule with all outcomes
  • PrimaryOutcome β†’ Returns only primaryOutcomes array
  • SecondaryOutcome β†’ Returns only secondaryOutcomes array
  • HasResults β†’ Returns hasResults at study top level

Proposed Solution

βœ… UPDATE FIELDS Constant (REQUIRED)

The current implementation explicitly passes fields=",".join(self.FIELDS) at line 67. The API ONLY returns requested fields. We MUST add the new field names.

# src/tools/clinicaltrials.py - UPDATE FIELDS
FIELDS: ClassVar[list[str]] = [
    "NCTId",
    "BriefTitle",
    "Phase",
    "OverallStatus",
    "Condition",
    "InterventionName",
    "StartDate",
    "BriefSummary",
    # NEW: Outcome measures (verified via live API testing 2025-11-30)
    "OutcomesModule",  # Returns protocolSection.outcomesModule.{primaryOutcomes, secondaryOutcomes}
    "HasResults",      # Returns study.hasResults (top-level boolean)
]

βœ… Update _study_to_evidence() Method

def _study_to_evidence(self, study: dict[str, Any]) -> Evidence:
    """Convert a clinical trial study to Evidence."""
    # Navigate nested structure
    protocol = study.get("protocolSection", {})
    id_module = protocol.get("identificationModule", {})
    status_module = protocol.get("statusModule", {})
    desc_module = protocol.get("descriptionModule", {})
    design_module = protocol.get("designModule", {})
    conditions_module = protocol.get("conditionsModule", {})
    arms_module = protocol.get("armsInterventionsModule", {})
    outcomes_module = protocol.get("outcomesModule", {})  # NEW

    # ... existing field extraction (nct_id, title, status, phase, etc.) ...

    # NEW: Extract outcome measures
    primary_outcomes = outcomes_module.get("primaryOutcomes", [])
    primary_outcome_str = ""
    if primary_outcomes:
        # Get first primary outcome measure and timeframe
        first = primary_outcomes[0]
        measure = first.get("measure", "")
        timeframe = first.get("timeFrame", "")
        # Truncate long outcome descriptions
        primary_outcome_str = measure[:200]
        if timeframe:
            primary_outcome_str += f" (measured at {timeframe})"

    secondary_outcomes = outcomes_module.get("secondaryOutcomes", [])
    secondary_count = len(secondary_outcomes)

    # NEW: Check if results are available (hasResults is TOP-LEVEL, not in protocol!)
    has_results = study.get("hasResults", False)

    # Results date is in statusModule (nested inside date struct)
    results_date_struct = status_module.get("resultsFirstPostDateStruct", {})
    results_date = results_date_struct.get("date", "")

    # Build content with key trial info (UPDATED)
    content_parts = [
        f"{summary[:400]}...",
        f"Trial Phase: {phase}.",
        f"Status: {status}.",
        f"Conditions: {conditions_str}.",
        f"Interventions: {interventions_str}.",
    ]

    if primary_outcome_str:
        content_parts.append(f"Primary Outcome: {primary_outcome_str}.")

    if secondary_count > 0:
        content_parts.append(f"Secondary Outcomes: {secondary_count} additional endpoints.")

    if has_results:
        results_info = "Results Available: Yes"
        if results_date:
            results_info += f" (posted {results_date})"
        content_parts.append(results_info + ".")
    else:
        content_parts.append("Results Available: Not yet posted.")

    content = " ".join(content_parts)

    return Evidence(
        content=content[:2000],
        citation=Citation(
            source="clinicaltrials",
            title=title[:500],
            url=f"https://clinicaltrials.gov/study/{nct_id}",
            date=start_date,
            authors=[],
        ),
        relevance=0.90 if has_results else 0.85,  # Boost relevance for trials with results
    )

API Reference

The ClinicalTrials.gov API v2 returns nested JSON:

{
  "protocolSection": {
    "outcomesModule": {
      "primaryOutcomes": [
        {
          "measure": "Change from Baseline in IIEF-EF Domain Score",
          "description": "...",
          "timeFrame": "Baseline to Week 12"
        }
      ],
      "secondaryOutcomes": [
        {
          "measure": "Subject Global Assessment Question",
          "timeFrame": "Week 12"
        }
      ]
    }
  },
  "hasResults": true
}

See: https://clinicaltrials.gov/data-api/api


Test Plan

Unit Tests (tests/unit/tools/test_clinicaltrials.py)

@pytest.mark.unit
class TestClinicalTrialsOutcomes:
    """Tests for outcome measure extraction."""

    @pytest.mark.asyncio
    async def test_extracts_primary_outcome(self, tool: ClinicalTrialsTool) -> None:
        """Test that primary outcome is extracted from response."""
        mock_study = {
            "protocolSection": {
                "identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
                "statusModule": {"overallStatus": "COMPLETED", "startDateStruct": {"date": "2023"}},
                "descriptionModule": {"briefSummary": "Summary"},
                "designModule": {"phases": ["PHASE3"]},
                "conditionsModule": {"conditions": ["ED"]},
                "armsInterventionsModule": {"interventions": []},
                "outcomesModule": {
                    "primaryOutcomes": [
                        {
                            "measure": "Change in IIEF-EF score",
                            "timeFrame": "Week 12"
                        }
                    ]
                },
            },
            "hasResults": True,
        }

        mock_response = MagicMock()
        mock_response.json.return_value = {"studies": [mock_study]}
        mock_response.raise_for_status = MagicMock()

        with patch("requests.get", return_value=mock_response):
            results = await tool.search("test", max_results=1)

            assert len(results) == 1
            assert "Primary Outcome" in results[0].content
            assert "IIEF-EF" in results[0].content
            assert "Week 12" in results[0].content

    @pytest.mark.asyncio
    async def test_includes_results_status(self, tool: ClinicalTrialsTool) -> None:
        """Test that results availability is shown."""
        mock_study = {
            "protocolSection": {
                "identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
                "statusModule": {
                    "overallStatus": "COMPLETED",
                    "startDateStruct": {"date": "2023"},
                    # Note: resultsFirstPostDateStruct, not resultsFirstSubmitDate
                    "resultsFirstPostDateStruct": {"date": "2024-06-15"},
                },
                "descriptionModule": {"briefSummary": "Summary"},
                "designModule": {"phases": ["PHASE3"]},
                "conditionsModule": {"conditions": ["ED"]},
                "armsInterventionsModule": {"interventions": []},
                "outcomesModule": {},
            },
            "hasResults": True,  # Note: hasResults is TOP-LEVEL
        }

        mock_response = MagicMock()
        mock_response.json.return_value = {"studies": [mock_study]}
        mock_response.raise_for_status = MagicMock()

        with patch("requests.get", return_value=mock_response):
            results = await tool.search("test", max_results=1)

            assert "Results Available: Yes" in results[0].content
            assert "2024-06-15" in results[0].content

    @pytest.mark.asyncio
    async def test_shows_no_results_when_missing(self, tool: ClinicalTrialsTool) -> None:
        """Test that missing results are indicated."""
        mock_study = {
            "protocolSection": {
                "identificationModule": {"nctId": "NCT12345678", "briefTitle": "Test"},
                "statusModule": {"overallStatus": "RECRUITING", "startDateStruct": {"date": "2024"}},
                "descriptionModule": {"briefSummary": "Summary"},
                "designModule": {"phases": ["PHASE2"]},
                "conditionsModule": {"conditions": ["ED"]},
                "armsInterventionsModule": {"interventions": []},
                "outcomesModule": {},
            },
            "hasResults": False,
        }

        mock_response = MagicMock()
        mock_response.json.return_value = {"studies": [mock_study]}
        mock_response.raise_for_status = MagicMock()

        with patch("requests.get", return_value=mock_response):
            results = await tool.search("test", max_results=1)

            assert "Results Available: Not yet posted" in results[0].content

    @pytest.mark.asyncio
    async def test_boosts_relevance_for_results(self, tool: ClinicalTrialsTool) -> None:
        """Trials with results should have higher relevance score."""
        with_results = {
            "protocolSection": {
                "identificationModule": {"nctId": "NCT11111111", "briefTitle": "With Results"},
                "statusModule": {"overallStatus": "COMPLETED", "startDateStruct": {"date": "2023"}},
                "descriptionModule": {"briefSummary": "Summary"},
                "designModule": {"phases": []},
                "conditionsModule": {"conditions": []},
                "armsInterventionsModule": {"interventions": []},
                "outcomesModule": {},
            },
            "hasResults": True,
        }
        without_results = {
            "protocolSection": {
                "identificationModule": {"nctId": "NCT22222222", "briefTitle": "No Results"},
                "statusModule": {"overallStatus": "RECRUITING", "startDateStruct": {"date": "2024"}},
                "descriptionModule": {"briefSummary": "Summary"},
                "designModule": {"phases": []},
                "conditionsModule": {"conditions": []},
                "armsInterventionsModule": {"interventions": []},
                "outcomesModule": {},
            },
            "hasResults": False,
        }

        mock_response = MagicMock()
        mock_response.json.return_value = {"studies": [with_results, without_results]}
        mock_response.raise_for_status = MagicMock()

        with patch("requests.get", return_value=mock_response):
            results = await tool.search("test", max_results=2)

            assert results[0].relevance == 0.90  # With results
            assert results[1].relevance == 0.85  # Without results

Integration Test

@pytest.mark.integration
class TestClinicalTrialsOutcomesIntegration:
    """Integration tests with real API."""

    @pytest.mark.asyncio
    async def test_real_completed_trial_has_outcome(self) -> None:
        """Real completed Phase 3 trials should have outcome measures."""
        tool = ClinicalTrialsTool()

        # Search for completed Phase 3 ED trials (likely to have outcomes)
        results = await tool.search(
            "sildenafil erectile dysfunction Phase 3 COMPLETED",
            max_results=3
        )

        # At least one should have primary outcome
        has_outcome = any("Primary Outcome" in r.content for r in results)
        assert has_outcome, "No completed trials with outcome measures found"

Files to Modify

File Change
src/tools/clinicaltrials.py ADD OutcomesModule and HasResults to FIELDS, update _study_to_evidence()
tests/unit/tools/test_clinicaltrials.py Add outcome parsing tests

Acceptance Criteria

FIELDS Constant (REQUIRED CHANGE)

  • FIELDS includes "OutcomesModule" (returns full outcomesModule)
  • FIELDS includes "HasResults" (returns top-level boolean)

_study_to_evidence() Method

  • Extracts protocolSection.outcomesModule.primaryOutcomes
  • Accesses study.hasResults at TOP LEVEL (not inside protocolSection)
  • Results date extracted from statusModule.resultsFirstPostDateStruct.date
  • Evidence content includes primary outcome measure when available
  • Evidence content shows results availability status
  • Outcome measure text truncated to 200 chars
  • Trials with results have boosted relevance (0.90 vs 0.85)

Testing

  • All unit tests pass
  • Integration test confirms real trials return outcome data
  • Live API test confirms OutcomesModule and HasResults fields work

Edge Cases

  1. No outcomes defined: Some early-phase trials don't have outcomes yet

    • Solution: Gracefully skip outcome section if outcomesModule is empty or missing
  2. Multiple primary outcomes: Some trials have 2-3 primary outcomes

    • Solution: Show first outcome only, mention count of others
  3. Long outcome descriptions: Some measures are very verbose (500+ chars)

    • Solution: Truncate measure to 200 chars with [:200]
  4. hasResults without resultsFirstPostDateStruct: Some completed trials may have results without a posted date

    • Solution: Show "Results Available: Yes" without date
  5. outcomesModule missing entirely: Not all API responses include this module

    • Solution: Use .get("outcomesModule", {}) for safe access

Rollback Plan

If outcome extraction causes issues:

  1. DO NOT modify FIELDS - nothing to revert there
  2. Remove outcome extraction code from _study_to_evidence()
  3. Existing tests should still pass

References