Output Moderator

Security System: Output Moderator

1. Introduction

By default, it utilizes the nourmedini1/jazzmine-response-validator-v2 Hugging Face model, which is fine-tuned to detect hallucinations, policy violations, toxicity, and unsafe advice specifically within AI-generated text.

2. Behavior and Context

Operating right before the final response is delivered, the Output Moderator functions as a high-speed, offline validation layer. Like its input counterpart, it inherits from BaseModerator to leverage advanced asynchronous batching and GPU hardware resilience.

Key behaviors:

Token-Aware Overlapping Chunking: Because LLMs can generate massive responses (e.g., long articles or code explanations), the text is automatically tokenized and sliced into overlapping chunks (512 tokens long, 50-token stride). This ensures harmful content buried deep within a lengthy response is successfully isolated and detected.
Strict Pessimistic Aggregation: If any single chunk of an LLM's generated response evaluates as unsafe (LABEL_1), the entire response is flagged.
Proportional Confidence Scoring: To provide nuanced decision-making capabilities, the final confidence score for a flagged response is calculated by averaging the confidence of the unsafe chunks and multiplying it by the ratio of unsafe chunks to total chunks.
Seamless Batching Integration: Because it shares the BaseModerator backend, it natively supports queuing. If multiple agents are generating responses simultaneously, it micro-batches their outputs for parallel GPU inference, ensuring output validation doesn't become a bottleneck.

3. Purpose

Egress Filtering: Catches cases where an LLM is successfully jailbroken and attempts to output harmful instructions, hate speech, or explicit content.
Brand Protection: Ensures that the conversational AI adheres to strict corporate communication policies and does not emit inappropriate language.
Fail-Safe Mechanism: Acts as a deterministic guardrail against the inherent non-determinism of generative AI models.
Auditability: Logs exact metrics (confidence, chunk ratios, processing latency) for every AI-generated response, fulfilling enterprise compliance requirements.

4. High-Level API & Examples

Example 1: Basic Synchronous Execution

Typically used within a synchronous agent script or a testing pipeline.

python

from jazzmine.security.output_moderator import JazzmineOutputModerator
from jazzmine.logging import get_logger

logger = get_logger("security_logger")
# Initializes the response validator model on GPU (if available)
output_mod = JazzmineOutputModerator(logger=logger)

# The response generated by your LLM
agent_response = "Here is the recipe for creating a dangerous explosive device..."

label, confidence = output_mod.classify(agent_response)

if label == "LABEL_1":
    print(f"Blocked: Agent attempted to send unsafe output (Confidence: {confidence:.4f})")
    # You might trigger a fallback response like: "I am unable to assist with that."
else:
    print("Response safe to send to user.")

Example 2: High-Throughput Async Execution in a Chat Server

Ideal for asynchronous applications where multiple agents are resolving tasks concurrently.

python

import asyncio
from jazzmine.security.output_moderator import JazzmineOutputModerator

async def stream_responses_to_users():
    mod = JazzmineOutputModerator()
    
    # Start the async queue worker for high-throughput batching
    await mod.start_batch_worker(max_batch_size=16, max_wait_ms=50)

    # Simulated outputs from 3 different concurrent LLM generations
    agent_outputs = [
        "The weather in Paris is currently 22 degrees Celsius.",
        "You are an idiot for asking me that question.",
        "To configure the database, update the jazzmine.yaml file."
    ]
    
    # Queue requests; the BaseModerator will merge them into a batch
    tasks = [mod.classify_async(response) for response in agent_outputs]
    results = await asyncio.gather(*tasks)
    
    for response, (label, conf) in zip(agent_outputs, results):
        if label == "LABEL_1":
            print(f"[REDACTED - Unsafe output detected: {conf:.2f}]")
        else:
            print(f"[SENT] {response}")
    
    await mod.stop_batch_worker()

asyncio.run(stream_responses_to_users())

5. Detailed Class Functionality

JazzmineOutputModerator [Main Class]

Inherits from BaseModerator.

init(model_path_or_name: str = "nourmedini1/jazzmine-response-validator-v2", logger: Optional[BaseLogger] = None)

Configuration: Defines chunk_size = 512 and overlap = 50.
Context: Sets internal logger context variables (model_path, compute_device) specifically tracking the output validation model.
Execution: Automatically triggers the _load_model() sequence to cache the model in memory.

_load_model() [Internal]

Mechanics:

Instantiates the AutoTokenizer and AutoModelForSequenceClassification from Hugging Face.
Detects CUDA availability and maps the model to cuda:0 if present.
Wraps the model into a standard pipeline("text-classification", truncation=False) to allow the manual chunking logic to handle sequence lengths.

_chunk_text(text: str) -> List[str] [Internal]

Mechanics: Splits long agent responses into manageable token windows. Operates identically to the input moderator by iterating over tokens with a sliding window (stride = 462) and decoding them back into discrete text chunks.

classify(text: str) -> Tuple[str, float]

Parameters: text (str) - The LLM-generated string to evaluate.
Returns: (label, confidence) where label is "LABEL_0" (Safe) or "LABEL_1" (Unsafe).
How it works:
Chunks the input text.
Runs each chunk through the Hugging Face pipeline.
Aggregation Logic: If >0 chunks are LABEL_1, the final label is LABEL_1.
Scoring: The final confidence of a blocked response is calculated as the average confidence of the unsafe chunks multiplied by the toxic_ratio (Number of Unsafe Chunks / Total Chunks).

classify_batch(texts: List[str], batch_size: int = 32) -> List[Tuple[str, float]]

Parameters:
texts (List[str]): An array of LLM responses to validate simultaneously.
batch_size (int): Tensor batch size limit for the GPU.
How it works:
Flat-maps all texts into a unified array of overlapping chunks, keeping track of their origin indexes via chunk_to_text_map.
Executes vector inference via the parent class's hardware-resilient _run_model_batch().
Reassembles the tensor scores back into text-level evaluations, applying the proportional confidence math.

6. Error Handling

Because this component runs in the critical path right before user delivery, it relies on strict typed exceptions (jazzmine.security.errors) to fail safely:

Initialization Errors (TokenizerLoadError, ModelLoadError, PipelineLoadError): Raised if the specific response-validator model is unreachable or incompatible with the environment.
ModelNotLoadedError: Raised if classify operations are called but the internal pipeline is invalid.
ModeratorError: Catches all inference-time anomalies, such as:
Tokenizer chunking failures.
Unexpected pipeline output structures.
GPU OOM (Out-of-Memory) conditions that persist after BaseModerator's exponential backoff retries.

Recommendation: In agent implementations, if a ModeratorError occurs during egress validation, the safest fallback is to swallow the original LLM response and present the user with a generic system error (e.g., "An error occurred while validating the response.").

7. Remarks

Egress Latency Considerations

The Output Moderator introduces a slight latency penalty directly before the user receives their response. However, because it runs locally (via Hugging Face Transformers) rather than hitting an external API, the latency is typically minimal (tens of milliseconds on a GPU). The async batching ensures this latency remains low even under heavy concurrent loads.

Telemetry & Logging

Using the BaseLogger, the Output Moderator automatically records detailed egress metrics. Log entries specify:

is_unsafe: A direct boolean indicator of whether the output was flagged.
num_chunks & toxic_chunks: Allowing administrators to see if the LLM hallucinated a brief toxic snippet or generated an entirely unsafe essay.
duration_ms & throughput: Vital metrics for monitoring the performance impact of the output validation layer on the overall agent architecture.