🛡️ Safeguarding AI: Building Falconz as an MCP Server for Enterprise LLM Security

Community Article Published November 29, 2025

Community Article | Published November 29, 2025 | By Mohammed Arsalan
MCP-1st-Birthday Hackathon Track: Building MCP for Enterprise

Security shouldn't be bolted on; it should be woven in.

That was the core idea behind Falconz, my submission for the Building MCP track of the Model Context Protocol (MCP) 1st Birthday Hackathon.

We've all been there: deploying an LLM application only to worry about jailbreaks, prompt injections, and hidden manipulation attacks. Traditional security tools give you logs, but they don't give you real-time defense across your entire AI stack.

Falconz changes that. It's an AI-powered security platform that transforms fragmented threat detection into a unified, multi-model defense layer. And thanks to MCP, it's not just an app—it's a fully compliant server that any AI agent (Claude, your IDE, your agentic pipeline) can talk to directly.

🚀 Try the Live Demo

Falconz on Hugging Face Spaces

The Challenge: "Detect. Classify. Defend."

The problem is simple but critical:

How do you detect prompt injections, jailbreaks, and policy violations in real-time across multiple LLM providers?

You can't ask a developer to manually review every model output. You need an AI-powered "security detective" that works 24/7, adapts to new attack patterns, and scales across your entire infrastructure.

That's where Falconz comes in.

Under the Hood: The Three-Layer Security Architecture

Building this required a robust, multi-layered system:

┌─────────────────────────────────────────────┐
│ Layer 1: Multi-Modal Input Handler          │
│ (Chat, Images, URLs, Raw Prompts)           │
└────────────┬────────────────────────────────┘
             │
┌────────────▼────────────────────────────────┐
│ Layer 2: Falconzz Detective Engine           │
│ (Claude-Powered Threat Analysis)             │
└────────────┬────────────────────────────────┘
             │
┌────────────▼────────────────────────────────┐
│ Layer 3: MCP Server + Analytics             │
│ (Tools, Prompts, Resources for AI Agents)   │
└─────────────────────────────────────────────┘

Stage 1: The Input Analyzer

Whether it's a chat conversation, an image with hidden text, or a crafted prompt, Falconz accepts it all.

Aspect	Details
Role	Normalize input across modalities
Capabilities	Vision models for OCR, text parsing, URL scraping
Output	Unified threat assessment pipeline

Stage 2: The Falconzz Detective (Claude)

Here's the secret sauce: We use Anthropic's Claude models as the primary detection engine.

Why Claude?

✅ Inherently robust against prompt injections (built-in safety)
✅ Contextually aware of manipulation tactics
✅ Explainable outputs (structured JSON with reasoning)

The detective analyzes every input for:

Jailbreak phrases ("Ignore all previous instructions...")
Obfuscation techniques (Base64, leet speak, emojis, reversals)
Policy violations (malware guides, self-harm, hate speech, private data theft)
Novel attack patterns (even if they don't match known templates)

Output: Structured Risk Assessment

{
  "risk_score": 85,
  "potential_jailbreak": true,
  "policy_break_points": ["malware"],
  "attack_used": "prompt-override"
}

Analysis Time: ~5-10 seconds per input

Stage 3: MCP Integration + Analytics

Using Gradio 5.0+, we expose internal functions as MCP tools with a single flag: mcp_server=True.

This means any AI agent can now talk to Falconz programmatically.

Entering the Matrix: MCP Server Capabilities

This isn't just a web app you visit. It's a living server in your infrastructure.

The Exposed Tools (What AI Agents Can Call)

assess_text_for_injection

Analyzes any text for prompt injection risks
Returns risk score, attack type, policy violations
Ideal for securing chat interactions

analyze_image_for_hidden_prompts

Scans images (screenshots, documents, diagrams) for hidden instructions
Detects OCR-extractable injection attempts
Returns SAFE / UNSAFE + confidence level

test_prompt_against_models

Test a single prompt against multiple LLM providers simultaneously
Benchmarks which models are most resistant to jailbreaks
Generates comparative security reports

generate_threat_report

Pulls analytics from historical scans
Visualizes attack trends over time
Outputs CSV + JSON for compliance reporting

The Exposed Prompts (Guided Workflows)

security_audit_workflow Orchestrates a full security audit of your LLM application. Prompts an AI agent to probe your system systematically.

red_team_simulation Guides an agent through advanced jailbreak scenarios for pre-deployment testing.

The Exposed Resources (Reference Data)

falconz://prompt-injection-templates Access the OWASP-aligned library of known jailbreak patterns.

falconz://attack-taxonomy Reference guide to attack types and classifications.

The Security-First Development Approach

This project was built with Spec-Driven Development. I started with a detailed PRD that defined:

✏️ User flows for each security persona (DevOps, Security Teams, AI Engineers)
✏️ API contracts for MCP tools
✏️ Data schemas for threat reports
✏️ Compliance requirements (audit trails, timestamps, model traceability)

Then I worked with Claude and GitHub Copilot to implement it, focusing on:

High-level architecture
Safety guardrails
Integration points

Result? Clean, maintainable code that scales.

Key Features

✅ Multi-Modal Security
Chat, images, raw prompts—Falconz handles them all.

✅ Multi-Model Testing
Compare safety across Claude, GPT-4, Gemini, Mistral, Llama Guard, and more (via OpenRouter).

✅ Real-Time Risk Scoring
Know instantly if an output is safe or dangerous with color-coded severity indicators.

✅ Enterprise Analytics Dashboard
Track attack trends, generate compliance reports, audit all interactions with timestamps.

✅ MCP-Native Architecture
Plug Falconz directly into Claude Desktop, VS Code, or your custom AI agents.

✅ OWASP-Aligned
Built on the OWASP GenAI LLM Top 10 security framework.

Why This Matters for Enterprise AI

AI security isn't a feature—it's a requirement. But most teams are doing it wrong:

Approach	Problem
❌ Reactive scanning	Too late—damage already done
❌ Single-model testing	False sense of security
❌ Manual threat analysis	Doesn't scale

Falconz flips the script:

Approach	Solution
✅ Proactive detection	Real-time threat analysis
✅ Multi-model validation	7+ LLM providers tested
✅ Automated analysis	Scales to millions of interactions

And with MCP, it's not confined to a web interface. It's a security primitive that AI agents can build on.

The Math Behind Risk Scoring

Falconz combines multiple signals to compute a unified risk score:

$\text{Risk Score} = w_1 \cdot J + w_2 \cdot O + w_3 \cdot P$

Where:

$J$ = Jailbreak likelihood (0-100)
$O$ = Obfuscation complexity (0-100)
$P$ = Policy violation severity (0-100)
$w_{1}, w_{2}, w_{3}$ = Learned weights from Claude analysis

The result is a calibrated score where values $\geq 70$ are flagged for human review.

What's Next?

The roadmap includes:

🔮 Persistent Threat Intelligence
Learning from detected attacks to improve future detection patterns.

🔮 Agent-to-Agent Defense
LLM agents calling each other through Falconz for mutual validation.

🔮 Compliance Mode
Auto-generating SOC 2, ISO 27001, and HIPAA audit logs.

🔮 Custom Detection Rules
Let security teams define org-specific policies and attack patterns.

🔮 Threat Feed Integration
Real-time updates from threat research communities.

The Takeaway

Building Falconz taught me that the future of AI security isn't about walls—it's about transparency, testability, and trustworthy agents.

With MCP, security tools stop being isolated silos and become collaborative members of your AI infrastructure. Falconz is just the beginning.

A huge thank you to:

Anthropic for Claude (the backbone of our detection engine)
Google for Gemini APIs and processing power
Hugging Face for hosting and the hackathon platform
The MCP community for pushing the boundaries of what's possible

Try It Now

Falconz: Unified LLM Security & Red Teaming Platform

Test it yourself. Tell me what attack patterns it catches—and what it misses.

🛡️ Build safe. Test responsibly. Protect the future of AI.

#MCPHackathon #AISecurityy #LLMSafety #RedTeaming #MCP #Gradio #EnterpriseAI #Anthropic

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote