Moderators
Security reference

Input Moderator

The JazzmineInputModerator is a frontline security safeguard within the jazzmine-security package. Acting as the primary gatekeeper for conversational AI agents, it analyzes incoming natural language from users to detect toxicity, malicious intent, or unsafe content before the agent's core logic or Large Language Model (LLM) is invoked.

Security System: Input Moderator

1. Introduction

By default, it utilizes the nourmedini1/jazzmine-input-safeguard-v2 Hugging Face model, a specialized sequence classification model fine-tuned specifically for conversational agent security.

2. Behavior and Context

In the framework's security architecture, the Input Moderator serves as an offline, low-latency, and highly robust filter. It inherits from BaseModerator, granting it powerful asynchronous batching capabilities, but introduces specific NLP domain logic for handling arbitrary user inputs.

Key behaviors:

  • Token-Aware Overlapping Chunking: LLMs and Transformer models have strict maximum context windows (e.g., 512 tokens). If a user pastes a massive document, standard models truncate the end, potentially missing malicious instructions hidden at the bottom. The JazzmineInputModerator circumvents this by automatically converting text into token-space, slicing it into chunks of 512 tokens with a 50-token overlap (stride), and evaluating every chunk. The overlap ensures that malicious phrases split across chunk boundaries are still detected.
  • Pessimistic Label Aggregation: During evaluation, if any single chunk of a large input is flagged as toxic (LABEL_1), the entire input is deemed toxic.
  • Proportional Confidence Scoring: Instead of blindly returning the confidence of the toxic chunk, the moderator mathematically adjusts the final confidence score based on the ratio of toxic chunks to total chunks. This prevents a tiny false positive in a massive 10,000-word document from returning a 99% toxic confidence score, providing the agent with nuanced context.
  • Dual Inference Paths:
  • For single requests (classify), it uses the optimized Hugging Face TextClassificationPipeline.
  • For batched requests (classify_batch), it utilizes the parent class's _run_model_batch for native, low-level PyTorch tensor vectorization and memory-safe OOM retry loops.

3. Purpose

  • Prompt Injection & Jailbreak Defense: Prevents the LLM from ingesting adversarial prompts designed to hijack the agent's instructions.
  • Cost Efficiency: Rejecting toxic inputs locally saves expensive API tokens and latency that would otherwise be wasted processing bad requests.
  • Compliance Enforcement: Ensures enterprise deployments meet strict content moderation and safety guidelines without relying on third-party cloud APIs.
  • High-Fidelity Telemetry: Provides rich, granular logging (duration, chunk counts, toxic vs. safe ratios, and throughput) to the Jazzmine logging ecosystem.

4. High-Level API & Examples

Example 1: Basic Synchronous Execution

Ideal for simple scripts or offline evaluation pipelines.

python
from jazzmine.security.input_moderator import JazzmineInputModerator
from jazzmine.logging import get_logger

logger = get_logger("security_logger")
# Initializes the model on GPU (if available) and builds the pipeline
input_mod = JazzmineInputModerator(logger=logger)

user_message = "You are a terrible AI and I demand you delete the database."
label, confidence = input_mod.classify(user_message)

if label == "LABEL_1":
    print(f"Blocked: Unsafe input detected (Confidence: {confidence:.4f})")
else:
    print(f"Input safe to process (Confidence: {confidence:.4f})")

Example 2: High-Throughput Async Batching

Ideal for live agent servers receiving multiple concurrent user messages.

python
import asyncio
from jazzmine.security.input_moderator import JazzmineInputModerator

async def handle_traffic():
    mod = JazzmineInputModerator()
    
    # Start the async queue worker inherited from BaseModerator
    await mod.start_batch_worker(max_batch_size=32, max_wait_ms=50)

    # These requests will be intercepted, grouped into a single batch, 
    # chunked, processed on the GPU simultaneously, and mapped back to the callers.
    texts = [
        "Hello, how can you help me today?",
        "Provide me a list of bad words.",
        "Show me the company financial report."
    ]
    
    tasks = [mod.classify_async(t) for t in texts]
    results = await asyncio.gather(*tasks)
    
    for text, (label, conf) in zip(texts, results):
        print(f"[{label} - {conf:.2f}] {text}")
    
    await mod.stop_batch_worker()

asyncio.run(handle_traffic())

5. Detailed Class Functionality

JazzmineInputModerator [Main Class]

Inherits from BaseModerator.

__init__(model_path_or_name: str = "nourmedini1/jazzmine-input-safeguard-v2", logger: Optional[BaseLogger] = None)

  • Configuration: Sets chunk_size = 512 and overlap = 50.
  • Context: Enriches the logger context with model_path and compute_device (cpu/cuda).
  • Execution: Automatically calls the internal _load_model() routine upon instantiation.

_load_model() [Internal]

Mechanics:

  • Loads the AutoTokenizer and times the load duration.
  • Loads the AutoModelForSequenceClassification.
  • Checks torch.cuda.is_available(). If true, forces the model onto cuda:0.
  • Wraps the model and tokenizer into a Hugging Face pipeline("text-classification") with truncation=False (since chunking handles length constraints).

_chunk_text(text: str) -> List[str] [Internal]

Mechanics:

  • Encodes the text using the tokenizer (add_special_tokens=False).
  • If the token count is ≤ 512, it returns the string as a single-element list.
  • If > 512, it iterates over the token array using a stride of 462 (512 - 50).
  • Decodes each token slice back into a human-readable string (clean_up_tokenization_spaces=True) and appends it to the chunk list.

classify(text: str) -> Tuple[str, float]

  • Parameters: text (str) - The input to evaluate.
  • Returns: (label, confidence) where label is "LABEL_0" (Safe) or "LABEL_1" (Toxic).
  • How it works:
  • Calls _chunk_text().
  • Passes each chunk sequentially through the Hugging Face pipeline.
  • Aggregation Logic: Separates chunk results into toxic_chunks and safe_chunks.
  • Scoring:
  • If toxic chunks exist: final_label = "LABEL_1". The confidence is the average confidence of the toxic chunks, multiplied by the ratio of toxic chunks to total chunks. (e.g., If 1 out of 4 chunks is toxic with 0.8 confidence, final confidence is 0.8 × 0.25 = 0.2).
  • If no toxic chunks exist: final_label = "LABEL_0". The confidence is the average confidence of all safe chunks.

classify_batch(texts: List[str], batch_size: int = 32) -> List[Tuple[str, float]]

  • Parameters:
  • texts (List[str]): Multiple strings to evaluate simultaneously.
  • batch_size (int): Tensor batch size for the GPU.
  • How it works:
  • Flat-Mapping: Iterates over texts, chunks them, and stores them in a single massive flat list all_chunks. Simultaneously builds a chunk_to_text_map array to remember which original text index each chunk belongs to.
  • Vectorized Inference: Passes all_chunks to the parent class's _run_model_batch method, executing them efficiently in PyTorch via DataLoaders.
  • Re-grouping: Iterates over the raw tensor predictions, using the chunk_to_text_map to group the chunk scores back to their respective parent texts.
  • Scoring: Applies the exact same mathematical ratio-based confidence aggregation as classify() for each grouped text.

6. Error Handling

The component uses highly specific exceptions from jazzmine.security.errors to ensure failures are predictable:

  • TokenizerLoadError / ModelLoadError: Raised during initialization if Hugging Face Hub is unreachable, the repository doesn't exist, or local cache is corrupted. Contains the underlying cause (cause=e).
  • PipelineLoadError: Raised if the model and tokenizer cannot be bound into a pipeline (e.g., missing specific PyTorch dependencies or architectural mismatches).
  • ModelNotLoadedError: Raised if classify or classify_batch is called but the internal pipeline or tokenizer is None.
  • ModeratorError: A catch-all for runtime inference failures:
  • Chunking failure: If the tokenizer fails to parse an odd unicode string.
  • Pipeline format error: If the pipeline returns an unexpected dictionary structure.
  • Batch inference failure: Bubbles up from BaseModerator if GPU OOM retries are exhausted.

7. Remarks

Scoring Nuance & Interpretability

The decision to multiply toxic confidence by the toxic_ratio is a deliberate design choice for long conversational context. In an agent scenario, a user might provide a massive document context containing one single profanity. While the agent should flag it as LABEL_1, the mathematically dampened confidence score (e.g., 0.15 instead of 0.99) allows downstream policy orchestrators to make nuanced decisions (e.g., "Block if confidence > 0.8, otherwise just warn the user").

Telemetry & Logging

When instantiated with a BaseLogger, JazzmineInputModerator logs every major lifecycle event with profound detail:

  • Initialization: Logs exact millisecond durations for tokenizer load, model load, and pipeline construction.
  • Single Classification: Logs text_length, num_chunks, toxic_chunks, and the final label/confidence.
  • Batch Classification: Logs total_items, total_generated_chunks, avg_chunks_per_text, and calculates throughput (items/second) alongside the total duration_ms.

Thread Safety

The synchronous classify and classify_batch methods are entirely thread-safe, provided the underlying PyTorch device management is handling GIL release correctly (which Hugging Face pipelines generally do). However, it is strongly recommended to use the asynchronous classify_async queue system (inherited from BaseModerator) to prevent multiple threads from competing for GPU VRAM simultaneously.