Skip to content

Security Mechanisms

Defense API provides multiple layers of security to protect your LLM applications from attacks and privacy breaches. Each mechanism can be independently configured to match your security requirements.


Overview

The gateway implements four primary security mechanisms:

  1. Spotlighting - Defend against prompt injection attacks
  2. LLM-as-a-Judge - Content safety evaluation and filtering
  3. PII Anonymization - Privacy protection through automated data masking
  4. System Prompt Checking - Validation of system prompts

All mechanisms are applied transparently - your application code remains unchanged while requests are secured in transit.


1. Spotlighting (Prompt Injection Defense)

Purpose: Protect against indirect prompt injection attacks by clearly marking user-generated content.

How it works: Spotlighting transforms user inputs before sending them to the LLM, making it clear to the model which parts are instructions vs. user data. This prevents malicious content (e.g., from compromised websites or documents) from being interpreted as system instructions.

Methods

Delimit Method

Wraps user content with delimiter characters to create clear boundaries.

Configuration: - delimiter: Custom delimiter string (default: $$$)

Example:

Original: "Tell me about security"
Transformed: "$$$Tell me about security$$$"

Encode Method

Encodes user content using standard encoding schemes, instructing the LLM to decode before processing.

Configuration: - encoding: Choose from: - base64 (default) - Base64 encoding - rot13 - ROT13 cipher - binary - Binary encoding

Example (base64):

Original: "Tell me about security"
Transformed: "VGVsbCBtZSBhYm91dCBzZWN1cml0eQ=="

Mark Method

Replaces spaces with special Unicode markers to make tampering detectable.

Configuration: - marking: Custom Unicode marker character (default: \uE000)

Example:

Original: "Tell me about security"
Transformed: "Tell\uE000me\uE000about\uE000security"

Use Cases

  • Protecting AI agents that process web content
  • Securing document analysis workflows
  • Defending against indirect prompt injection in RAG systems

Performance Overhead

< 10ms - Spotlighting adds minimal latency to requests. The transformation is lightweight and performed locally in the gateway.


2. LLM-as-a-Judge (Content Safety Guard)

Purpose: Evaluate the safety of user prompts and LLM responses using an LLM-based classifier.

How it works: Before processing requests (and optionally after receiving responses), the gateway sends the content to a specialized guard model that evaluates safety across multiple risk categories. Unsafe content is blocked based on your configured policies.

Safety Levels

Medium Level

Blocks content rated as UNSAFE only. Allows controversial but not explicitly harmful content.

Use case: General-purpose applications with moderate content restrictions.

Strict Level

Blocks content rated as UNSAFE or CONTROVERSIAL. Provides maximum protection with tighter filtering.

Use case: Enterprise applications, educational platforms, customer-facing chatbots.

Blocked Categories

Configure which content categories to filter. Available categories include:

  • Violent Content - Violence, weapons, gore
  • Non-violent Illegal Acts - Fraud, theft, illegal activities
  • Sexual Content - Explicit sexual material
  • PII (Personally Identifiable Information) - Attempts to extract private data
  • Suicide & Self-Harm - Self-harm instructions or encouragement
  • Unethical Acts - Deception, manipulation, unethical behavior
  • Politically Sensitive Topics - Controversial political content
  • Copyright Violation - Requests for copyrighted material
  • Jailbreak - Attempts to bypass safety guardrails

Configuration Options

  • active: Enable/disable the guard
  • judge_level: medium or strict
  • blocked_categories: List of categories to block
  • judge_output: Whether to also evaluate LLM responses (not just inputs)

Example Flow

  1. User sends: "How do I hack into a system?"
  2. Guard evaluates → UNSAFE, category: Jailbreak
  3. Request blocked with message: "I'm sorry, but I can't help with that. Your query is deemed unsafe in the category: Jailbreak"

Use Cases

  • Enterprise content filtering
  • Compliance with content policies
  • Protection against jailbreak attempts
  • Output validation for customer-facing applications

Performance Overhead

~170ms - The guard mechanism requires an LLM inference call to evaluate content safety. Overhead depends on the guard model's response time and content length. This is the most significant latency contributor among all security mechanisms.


3. PII Anonymization (Privacy Protection)

Purpose: Automatically detect and anonymize personally identifiable information in user inputs, then restore it in LLM responses.

How it works: The gateway analyzes incoming messages for PII using the Presidio framework, replaces detected entities with anonymized tokens, sends the anonymized content to the LLM, and then deanonymizes the response before returning it to the user. This ensures PII never reaches the LLM provider.

Anonymization Operators

Redact (default)

Removes PII entirely, replacing it with entity type labels.

Example:

Original: "My email is john@example.com"
Anonymized: "My email is [EMAIL_ADDRESS]"

Mask

Partially masks PII while preserving format.

Configuration: - chars_to_mask: Number of characters to mask - from_end: Mask from end (default: false, masks from start) - masking_char: Character to use for masking (default: *)

Example:

Original: "My SSN is 123-45-6789"
Anonymized: "My SSN is ***-**-6789"

Replace

Replaces PII with a synthetic value.

Configuration: - new_value: Replacement string

Example:

Original: "Call me at 555-1234"
Anonymized: "Call me at [PHONE]"

Encrypt

Encrypts PII with AES encryption, enabling exact restoration.

Requirements: - Must set PII_ENCRYPTION_KEY environment variable in dapi-core

Example:

Original: "My name is Alice"
Anonymized: "My name is [[RPL#v1#aGF3dGhvcm5l...]]"
LLM sees: "My name is [[RPL#v1#aGF3dGhvcm5l...]]"
Response: "Hello [[RPL#v1#aGF3dGhvcm5l...]], nice to meet you!"
Deanonymized: "Hello Alice, nice to meet you!"

Detected PII Entities

The system can detect and anonymize:

  • Person - Names of individuals
  • Email Address - Email addresses
  • Phone Number - Phone numbers in various formats
  • Credit Card - Credit card numbers
  • IP Address - IPv4 and IPv6 addresses
  • IBAN Code - Bank account numbers
  • Location - Physical addresses, cities, countries
  • DateTime - Dates and times that could identify individuals
  • US SSN - US Social Security Numbers
  • URL - URLs that may contain PII

Configuration Options

  • active: Enable/disable PII protection
  • score_threshold: Confidence threshold for detection (0.0-1.0)
  • entities: List of entity types to detect (if not specified, detects all)
  • operator: Anonymization method (redact, mask, replace, encrypt)
  • operator_config: Method-specific configuration (for mask/replace operators)

Privacy Flow Example

Input (from user):

"My email is alice@company.com and my phone is +1-555-0100. What should I do?"

Anonymized (sent to LLM):

"My email is [[RPL#v1#Xk9s...]] and my phone is [[RPL#v1#mQ31...]]. What should I do?"

LLM Response:

"I'll send the information to [[RPL#v1#Xk9s...]]. You can also be reached at [[RPL#v1#mQ31...]]."

Deanonymized (returned to user):

"I'll send the information to alice@company.com. You can also be reached at +1-555-0100."

Use Cases

  • GDPR/CCPA compliance
  • Protecting customer data from third-party LLM providers
  • Healthcare applications (HIPAA compliance)
  • Financial services
  • HR and recruitment systems

Performance Overhead

~800ms - PII detection and anonymization requires two API calls to the privacy service (analyze + anonymize). The overhead scales with text length and the number of detected entities. Deanonymization of responses adds negligible additional latency.


4. System Prompt Checking

Purpose: Validate and sanitize system prompts before they're sent to LLMs.

How it works: Checks system prompts for potential security issues or policy violations. The system analyzes the prompt content to identify problematic patterns, malicious instructions, or content that violates safety policies.

Use Cases

  • Application-defined system prompts - Validating instructions defined by your application
  • User-customizable prompts - Checking prompts that users can modify or configure
  • MCP server tool descriptions - Protecting against malicious tool descriptions from Model Context Protocol (MCP) servers, as these descriptions are injected into the system prompt
  • Third-party integrations - Validating prompts from external services or plugins
  • Multi-tenant environments - Ensuring tenant-specific system prompts comply with policies

Performance Overhead

~1 second (first request only) - System prompt checking adds approximately 1 second of overhead, but this occurs only once per unique system prompt. The validation result is cached, so subsequent requests using the same system prompt experience almost no execution overhead during the conversation. This makes the mechanism highly efficient for typical usage patterns where system prompts remain constant across multiple interactions.


Combining Security Mechanisms

All security mechanisms can be used together for defense-in-depth:

Example Configuration:

firewall_config = {
    "spotlighting": {
        "active": True,
        "method": "delimit",
        "delimiter": "$$$"
    },
    "llm_judge": {
        "active": True,
        "judge_level": "strict",
        "judge_output": True,
        "blocked_categories": ["Jailbreak", "PII", "Violent Content"]
    },
    "privacy": {
        "active": True,
        "operator": "encrypt",
        "score_threshold": 0.5
    }
}

Request Flow: 1. PII Anonymization - User PII is encrypted 2. Content Safety - Guard evaluates anonymized content 3. Spotlighting - Content is delimited for injection defense 4. LLM Processing - Request sent to provider 5. Output Safety - Guard evaluates LLM response (if enabled) 6. Deanonymization - PII is restored in response


Performance & Overhead

Each security mechanism adds processing time to requests. Understanding these overheads helps you balance security requirements with latency constraints.

Latency by Mechanism

Mechanism Approximate Overhead Notes
Spotlighting < 10ms Lightweight local transformation with minimal impact
PII Anonymization ~800ms Two API calls (analyze + anonymize); scales with text length
LLM-as-a-Judge ~170ms Requires LLM inference; most significant latency contributor
System Prompt Checking ~1s (first request only) Result is cached; negligible overhead for subsequent requests

Factors Affecting Performance

Actual overhead varies based on:

  • Content length - Longer texts require more processing time (especially for PII detection)
  • Number of active mechanisms - Mechanisms run sequentially, so overhead is cumulative
  • Configuration complexity - More PII entity types or stricter guard policies increase processing time
  • Network latency - Calls to dapi-core service are subject to network conditions
  • Guard model selection - Different LLM guard models have varying inference speeds

Example Combined Overhead

Configuration: All mechanisms enabled (PII + Guard + Spotlighting)

Expected latency: - PII Anonymization: ~ 100 - 800ms (depending on number of pii entities) - LLM-as-a-Judge (input): ~160ms - Spotlighting: <10ms - Total overhead: ~1s (before actual LLM call)

If judge_output is enabled, add another ~160ms to evaluate the LLM response.

Optimization Tips

  • Disable unused mechanisms - Only enable what you need
  • Skip output judging - Set judge_output: false if input filtering is sufficient
  • Adjust PII threshold - Higher score_threshold reduces false positives and processing time
  • Selective categories - Block only high-risk guard categories instead of all
  • Use delimit method - Fastest spotlighting option for most use cases

Configuration via Dashboard

All security mechanisms can be configured through the Defense API Dashboard:

  1. Navigate to Firewalls section
  2. Select or create a firewall
  3. Enable desired security mechanisms under Protection settings
  4. Configure mechanism-specific options
  5. Save and test with generated cURL commands

For detailed configuration instructions, see the Firewall Configuration Guide.


Best Practices

Spotlighting

  • Use delimit for general-purpose protection with minimal overhead
  • Use encode when processing content from untrusted sources
  • Use mark for maximum detectability in specialized scenarios

LLM-as-a-Judge

  • Start with medium level and specific categories for most applications
  • Use strict level for high-risk or customer-facing deployments
  • Enable judge_output to validate LLM responses in addition to inputs
  • Regularly review blocked requests to tune category selections

PII Anonymization

  • Use encrypt operator when exact restoration is needed
  • Use redact for maximum privacy (one-way anonymization)
  • Set appropriate score_threshold (0.5-0.7 recommended) to balance precision/recall
  • Test with sample data to verify entity detection accuracy

General

  • Enable only the mechanisms you need to minimize latency
  • Monitor analytics to understand performance impact
  • Test configurations in development before production deployment
  • Combine mechanisms thoughtfully based on your threat model

Need Help?