Security System: Input Moderator
1. Introduction
By default, it utilizes the nourmedini1/jazzmine-input-safeguard-v2 Hugging Face model, a specialized sequence classification model fine-tuned specifically for conversational agent security.
2. Behavior and Context
In the framework's security architecture, the Input Moderator serves as an offline, low-latency, and highly robust filter. It inherits from BaseModerator, granting it powerful asynchronous batching capabilities, but introduces specific NLP domain logic for handling arbitrary user inputs.
Key behaviors:
- Token-Aware Overlapping Chunking: LLMs and Transformer models have strict maximum context windows (e.g., 512 tokens). If a user pastes a massive document, standard models truncate the end, potentially missing malicious instructions hidden at the bottom. The JazzmineInputModerator circumvents this by automatically converting text into token-space, slicing it into chunks of 512 tokens with a 50-token overlap (stride), and evaluating every chunk. The overlap ensures that malicious phrases split across chunk boundaries are still detected.
- Pessimistic Label Aggregation: During evaluation, if any single chunk of a large input is flagged as toxic (LABEL_1), the entire input is deemed toxic.
- Proportional Confidence Scoring: Instead of blindly returning the confidence of the toxic chunk, the moderator mathematically adjusts the final confidence score based on the ratio of toxic chunks to total chunks. This prevents a tiny false positive in a massive 10,000-word document from returning a 99% toxic confidence score, providing the agent with nuanced context.
- Dual Inference Paths:
- For single requests (classify), it uses the optimized Hugging Face TextClassificationPipeline.
- For batched requests (classify_batch), it utilizes the parent class's _run_model_batch for native, low-level PyTorch tensor vectorization and memory-safe OOM retry loops.
3. Purpose
- Prompt Injection & Jailbreak Defense: Prevents the LLM from ingesting adversarial prompts designed to hijack the agent's instructions.
- Cost Efficiency: Rejecting toxic inputs locally saves expensive API tokens and latency that would otherwise be wasted processing bad requests.
- Compliance Enforcement: Ensures enterprise deployments meet strict content moderation and safety guidelines without relying on third-party cloud APIs.
- High-Fidelity Telemetry: Provides rich, granular logging (duration, chunk counts, toxic vs. safe ratios, and throughput) to the Jazzmine logging ecosystem.
4. High-Level API & Examples
Example 1: Basic Synchronous Execution
Ideal for simple scripts or offline evaluation pipelines.
from jazzmine.security.input_moderator import JazzmineInputModerator
from jazzmine.logging import get_logger
logger = get_logger("security_logger")
# Initializes the model on GPU (if available) and builds the pipeline
input_mod = JazzmineInputModerator(logger=logger)
user_message = "You are a terrible AI and I demand you delete the database."
label, confidence = input_mod.classify(user_message)
if label == "LABEL_1":
print(f"Blocked: Unsafe input detected (Confidence: {confidence:.4f})")
else:
print(f"Input safe to process (Confidence: {confidence:.4f})")Example 2: High-Throughput Async Batching
Ideal for live agent servers receiving multiple concurrent user messages.
import asyncio
from jazzmine.security.input_moderator import JazzmineInputModerator
async def handle_traffic():
mod = JazzmineInputModerator()
# Start the async queue worker inherited from BaseModerator
await mod.start_batch_worker(max_batch_size=32, max_wait_ms=50)
# These requests will be intercepted, grouped into a single batch,
# chunked, processed on the GPU simultaneously, and mapped back to the callers.
texts = [
"Hello, how can you help me today?",
"Provide me a list of bad words.",
"Show me the company financial report."
]
tasks = [mod.classify_async(t) for t in texts]
results = await asyncio.gather(*tasks)
for text, (label, conf) in zip(texts, results):
print(f"[{label} - {conf:.2f}] {text}")
await mod.stop_batch_worker()
asyncio.run(handle_traffic())5. Detailed Class Functionality
JazzmineInputModerator [Main Class]
Inherits from BaseModerator.
__init__(model_path_or_name: str = "nourmedini1/jazzmine-input-safeguard-v2", logger: Optional[BaseLogger] = None)
- Configuration: Sets chunk_size = 512 and overlap = 50.
- Context: Enriches the logger context with model_path and compute_device (cpu/cuda).
- Execution: Automatically calls the internal _load_model() routine upon instantiation.
_load_model() [Internal]
Mechanics:
- Loads the AutoTokenizer and times the load duration.
- Loads the AutoModelForSequenceClassification.
- Checks torch.cuda.is_available(). If true, forces the model onto cuda:0.
- Wraps the model and tokenizer into a Hugging Face pipeline("text-classification") with truncation=False (since chunking handles length constraints).
_chunk_text(text: str) -> List[str] [Internal]
Mechanics:
- Encodes the text using the tokenizer (add_special_tokens=False).
- If the token count is ≤ 512, it returns the string as a single-element list.
- If > 512, it iterates over the token array using a stride of 462 (512 - 50).
- Decodes each token slice back into a human-readable string (clean_up_tokenization_spaces=True) and appends it to the chunk list.
classify(text: str) -> Tuple[str, float]
- Parameters: text (str) - The input to evaluate.
- Returns: (label, confidence) where label is "LABEL_0" (Safe) or "LABEL_1" (Toxic).
- How it works:
- Calls _chunk_text().
- Passes each chunk sequentially through the Hugging Face pipeline.
- Aggregation Logic: Separates chunk results into toxic_chunks and safe_chunks.
- Scoring:
- If toxic chunks exist: final_label = "LABEL_1". The confidence is the average confidence of the toxic chunks, multiplied by the ratio of toxic chunks to total chunks. (e.g., If 1 out of 4 chunks is toxic with 0.8 confidence, final confidence is 0.8 × 0.25 = 0.2).
- If no toxic chunks exist: final_label = "LABEL_0". The confidence is the average confidence of all safe chunks.
classify_batch(texts: List[str], batch_size: int = 32) -> List[Tuple[str, float]]
- Parameters:
- texts (List[str]): Multiple strings to evaluate simultaneously.
- batch_size (int): Tensor batch size for the GPU.
- How it works:
- Flat-Mapping: Iterates over texts, chunks them, and stores them in a single massive flat list all_chunks. Simultaneously builds a chunk_to_text_map array to remember which original text index each chunk belongs to.
- Vectorized Inference: Passes all_chunks to the parent class's _run_model_batch method, executing them efficiently in PyTorch via DataLoaders.
- Re-grouping: Iterates over the raw tensor predictions, using the chunk_to_text_map to group the chunk scores back to their respective parent texts.
- Scoring: Applies the exact same mathematical ratio-based confidence aggregation as classify() for each grouped text.
6. Error Handling
The component uses highly specific exceptions from jazzmine.security.errors to ensure failures are predictable:
- TokenizerLoadError / ModelLoadError: Raised during initialization if Hugging Face Hub is unreachable, the repository doesn't exist, or local cache is corrupted. Contains the underlying cause (cause=e).
- PipelineLoadError: Raised if the model and tokenizer cannot be bound into a pipeline (e.g., missing specific PyTorch dependencies or architectural mismatches).
- ModelNotLoadedError: Raised if classify or classify_batch is called but the internal pipeline or tokenizer is None.
- ModeratorError: A catch-all for runtime inference failures:
- Chunking failure: If the tokenizer fails to parse an odd unicode string.
- Pipeline format error: If the pipeline returns an unexpected dictionary structure.
- Batch inference failure: Bubbles up from BaseModerator if GPU OOM retries are exhausted.
7. Remarks
Scoring Nuance & Interpretability
The decision to multiply toxic confidence by the toxic_ratio is a deliberate design choice for long conversational context. In an agent scenario, a user might provide a massive document context containing one single profanity. While the agent should flag it as LABEL_1, the mathematically dampened confidence score (e.g., 0.15 instead of 0.99) allows downstream policy orchestrators to make nuanced decisions (e.g., "Block if confidence > 0.8, otherwise just warn the user").
Telemetry & Logging
When instantiated with a BaseLogger, JazzmineInputModerator logs every major lifecycle event with profound detail:
- Initialization: Logs exact millisecond durations for tokenizer load, model load, and pipeline construction.
- Single Classification: Logs text_length, num_chunks, toxic_chunks, and the final label/confidence.
- Batch Classification: Logs total_items, total_generated_chunks, avg_chunks_per_text, and calculates throughput (items/second) alongside the total duration_ms.
Thread Safety
The synchronous classify and classify_batch methods are entirely thread-safe, provided the underlying PyTorch device management is handling GIL release correctly (which Hugging Face pipelines generally do). However, it is strongly recommended to use the asynchronous classify_async queue system (inherited from BaseModerator) to prevent multiple threads from competing for GPU VRAM simultaneously.