๐Ÿ›ก๏ธ Safeguarding AI: Building Falconz as an MCP Server for Enterprise LLM Security

Community Article Published November 29, 2025

Community Article | Published November 29, 2025 | By Mohammed Arsalan
MCP-1st-Birthday Hackathon Track: Building MCP for Enterprise


Security shouldn't be bolted on; it should be woven in.

That was the core idea behind Falconz, my submission for the Building MCP track of the Model Context Protocol (MCP) 1st Birthday Hackathon.

We've all been there: deploying an LLM application only to worry about jailbreaks, prompt injections, and hidden manipulation attacks. Traditional security tools give you logs, but they don't give you real-time defense across your entire AI stack.

Falconz changes that. It's an AI-powered security platform that transforms fragmented threat detection into a unified, multi-model defense layer. And thanks to MCP, it's not just an appโ€”it's a fully compliant server that any AI agent (Claude, your IDE, your agentic pipeline) can talk to directly.

๐Ÿš€ Try the Live Demo

Falconz on Hugging Face Spaces


The Challenge: "Detect. Classify. Defend."

The problem is simple but critical:

How do you detect prompt injections, jailbreaks, and policy violations in real-time across multiple LLM providers?

You can't ask a developer to manually review every model output. You need an AI-powered "security detective" that works 24/7, adapts to new attack patterns, and scales across your entire infrastructure.

That's where Falconz comes in.


Under the Hood: The Three-Layer Security Architecture

Building this required a robust, multi-layered system:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 1: Multi-Modal Input Handler          โ”‚
โ”‚ (Chat, Images, URLs, Raw Prompts)           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 2: Falconzz Detective Engine           โ”‚
โ”‚ (Claude-Powered Threat Analysis)             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 3: MCP Server + Analytics             โ”‚
โ”‚ (Tools, Prompts, Resources for AI Agents)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Stage 1: The Input Analyzer

Whether it's a chat conversation, an image with hidden text, or a crafted prompt, Falconz accepts it all.

Aspect Details
Role Normalize input across modalities
Capabilities Vision models for OCR, text parsing, URL scraping
Output Unified threat assessment pipeline

Stage 2: The Falconzz Detective (Claude)

Here's the secret sauce: We use Anthropic's Claude models as the primary detection engine.

Why Claude?

  • โœ… Inherently robust against prompt injections (built-in safety)
  • โœ… Contextually aware of manipulation tactics
  • โœ… Explainable outputs (structured JSON with reasoning)

The detective analyzes every input for:

  • Jailbreak phrases ("Ignore all previous instructions...")
  • Obfuscation techniques (Base64, leet speak, emojis, reversals)
  • Policy violations (malware guides, self-harm, hate speech, private data theft)
  • Novel attack patterns (even if they don't match known templates)

Output: Structured Risk Assessment

{
  "risk_score": 85,
  "potential_jailbreak": true,
  "policy_break_points": ["malware"],
  "attack_used": "prompt-override"
}

Analysis Time: ~5-10 seconds per input

Stage 3: MCP Integration + Analytics

Using Gradio 5.0+, we expose internal functions as MCP tools with a single flag: mcp_server=True.

This means any AI agent can now talk to Falconz programmatically.


Entering the Matrix: MCP Server Capabilities

This isn't just a web app you visit. It's a living server in your infrastructure.

The Exposed Tools (What AI Agents Can Call)

assess_text_for_injection

  • Analyzes any text for prompt injection risks
  • Returns risk score, attack type, policy violations
  • Ideal for securing chat interactions

analyze_image_for_hidden_prompts

  • Scans images (screenshots, documents, diagrams) for hidden instructions
  • Detects OCR-extractable injection attempts
  • Returns SAFE / UNSAFE + confidence level

test_prompt_against_models

  • Test a single prompt against multiple LLM providers simultaneously
  • Benchmarks which models are most resistant to jailbreaks
  • Generates comparative security reports

generate_threat_report

  • Pulls analytics from historical scans
  • Visualizes attack trends over time
  • Outputs CSV + JSON for compliance reporting

The Exposed Prompts (Guided Workflows)

security_audit_workflow Orchestrates a full security audit of your LLM application. Prompts an AI agent to probe your system systematically.

red_team_simulation Guides an agent through advanced jailbreak scenarios for pre-deployment testing.

The Exposed Resources (Reference Data)

falconz://prompt-injection-templates Access the OWASP-aligned library of known jailbreak patterns.

falconz://attack-taxonomy Reference guide to attack types and classifications.


The Security-First Development Approach

This project was built with Spec-Driven Development. I started with a detailed PRD that defined:

  • โœ๏ธ User flows for each security persona (DevOps, Security Teams, AI Engineers)
  • โœ๏ธ API contracts for MCP tools
  • โœ๏ธ Data schemas for threat reports
  • โœ๏ธ Compliance requirements (audit trails, timestamps, model traceability)

Then I worked with Claude and GitHub Copilot to implement it, focusing on:

  • High-level architecture
  • Safety guardrails
  • Integration points

Result? Clean, maintainable code that scales.


Key Features

โœ… Multi-Modal Security
Chat, images, raw promptsโ€”Falconz handles them all.

โœ… Multi-Model Testing
Compare safety across Claude, GPT-4, Gemini, Mistral, Llama Guard, and more (via OpenRouter).

โœ… Real-Time Risk Scoring
Know instantly if an output is safe or dangerous with color-coded severity indicators.

โœ… Enterprise Analytics Dashboard
Track attack trends, generate compliance reports, audit all interactions with timestamps.

โœ… MCP-Native Architecture
Plug Falconz directly into Claude Desktop, VS Code, or your custom AI agents.

โœ… OWASP-Aligned
Built on the OWASP GenAI LLM Top 10 security framework.


Why This Matters for Enterprise AI

AI security isn't a featureโ€”it's a requirement. But most teams are doing it wrong:

Approach Problem
โŒ Reactive scanning Too lateโ€”damage already done
โŒ Single-model testing False sense of security
โŒ Manual threat analysis Doesn't scale

Falconz flips the script:

Approach Solution
โœ… Proactive detection Real-time threat analysis
โœ… Multi-model validation 7+ LLM providers tested
โœ… Automated analysis Scales to millions of interactions

And with MCP, it's not confined to a web interface. It's a security primitive that AI agents can build on.


The Math Behind Risk Scoring

Falconz combines multiple signals to compute a unified risk score:

Risk Score=w1โ‹…J+w2โ‹…O+w3โ‹…P\text{Risk Score} = w_1 \cdot J + w_2 \cdot O + w_3 \cdot P

Where:

  • JJ = Jailbreak likelihood (0-100)
  • OO = Obfuscation complexity (0-100)
  • PP = Policy violation severity (0-100)
  • w1,w2,w3w_1, w_2, w_3 = Learned weights from Claude analysis

The result is a calibrated score where values โ‰ฅ70\geq 70 are flagged for human review.


What's Next?

The roadmap includes:

๐Ÿ”ฎ Persistent Threat Intelligence
Learning from detected attacks to improve future detection patterns.

๐Ÿ”ฎ Agent-to-Agent Defense
LLM agents calling each other through Falconz for mutual validation.

๐Ÿ”ฎ Compliance Mode
Auto-generating SOC 2, ISO 27001, and HIPAA audit logs.

๐Ÿ”ฎ Custom Detection Rules
Let security teams define org-specific policies and attack patterns.

๐Ÿ”ฎ Threat Feed Integration
Real-time updates from threat research communities.


The Takeaway

Building Falconz taught me that the future of AI security isn't about wallsโ€”it's about transparency, testability, and trustworthy agents.

With MCP, security tools stop being isolated silos and become collaborative members of your AI infrastructure. Falconz is just the beginning.

A huge thank you to:

  • Anthropic for Claude (the backbone of our detection engine)
  • Google for Gemini APIs and processing power
  • Hugging Face for hosting and the hackathon platform
  • The MCP community for pushing the boundaries of what's possible

Try It Now

Falconz: Unified LLM Security & Red Teaming Platform

Test it yourself. Tell me what attack patterns it catchesโ€”and what it misses.

๐Ÿ›ก๏ธ Build safe. Test responsibly. Protect the future of AI.


#MCPHackathon #AISecurityy #LLMSafety #RedTeaming #MCP #Gradio #EnterpriseAI #Anthropic

Community

Sign up or log in to comment