Output Sanitizer | jazzmine-security

Security System: Output Sanitizer

1. Introduction

It provides specialized classes to neutralize JavaScript in PDFs, prevent CSV formula injection, and strip Cross-Site Scripting (XSS) vectors from HTML, ensuring that the final artifacts delivered to the user's browser or local machine are strictly safe.

2. Behavior and Context

Operating as a data-transformation layer, these sanitizers are typically invoked when an agent exports data to a file or renders rich text. All specific sanitizers inherit from BaseSanitizer, which endows them with highly robust concurrency and streaming capabilities.

Key behaviors:

Format-Specific Mitigation: Applies targeted sanitization strategies (e.g., stripping /JS annotations from PDFs, prepending ' to executable CSV cells, and using Bleach for HTML).
Advanced Concurrency Support: Built-in methods for synchronous, asynchronous, multithreaded, and multi-processed batching. This is vital because file parsing (like PDF sanitization) can be heavily CPU-bound.
Resilient Stream Processing: The sanitize_stream method can yield sanitized items iteratively across different execution modes, allowing infinite streams of data to be sanitized without memory exhaustion.
Fail-Safe Batching: If a single item in a batch fails sanitization, the concurrent worker returns the Exception object for that specific index rather than crashing the entire batch queue.

3. Purpose

XSS Prevention: Strips malicious <script> tags and risky attributes from HTML responses.
CSV Injection (Macro) Mitigation: Protects users from downloading AI-generated spreadsheets that could execute malicious formulas in Excel or Google Sheets.
Document Safety: Neutralizes embedded macros, JavaScript, and auto-launch actions within PDFs.
High-Throughput Processing: Ensures that applications serving thousands of concurrent users can sanitize files and data streams without bottlenecking the main application thread.

4. High-Level API & Examples

Example 1: Basic HTML Sanitization

Typically used before rendering an agent's markdown/HTML response in a web dashboard.

python

from jazzmine.security.output_sanitizer import JazzmineHTMLSanitizer
from jazzmine.logging import get_logger

logger = get_logger("security_logger")
html_sanitizer = JazzmineHTMLSanitizer(logger=logger)

dirty_html = "<p>Agent generated this.</p><script>alert('hacked!');</script>"
clean_html = html_sanitizer.sanitize(html_content=dirty_html)

print(clean_html)
# Output: <p>Agent generated this.</p>&lt;script&gt;alert('hacked!');&lt;/script&gt;

Example 2: Concurrent CSV Batch Processing

Ideal for handling large datasets generated by a data-analyst agent.

python

from jazzmine.security.output_sanitizer import JazzmineCSVSanitizer

csv_sanitizer = JazzmineCSVSanitizer()

dirty_csv_rows = [
    ["Name", "Role", "Salary"],
    ["Alice", "Admin", "=1+1"],      # Potential CSV Injection
    ["Bob", "User", "@SUM(A1:A2)"]   # Potential CSV Injection
]

# Uses thread-pooling to sanitize rows efficiently
safe_csv = csv_sanitizer.sanitize(input_data=dirty_csv_rows)

print(safe_csv)
# Output: 
# Name,Role,Salary
# Alice,Admin,'=1+1
# Bob,User,'@SUM(A1:A2)

Example 3: PDF Sanitization

Used when an agent retrieves or compiles a PDF report for the user.

python

from jazzmine.security.output_sanitizer import JazzminePDFSanitizer

pdf_sanitizer = JazzminePDFSanitizer()

with open("agent_report_dirty.pdf", "rb") as f:
    clean_pdf_bytes = pdf_sanitizer.sanitize(file_input=f)

with open("agent_report_safe.pdf", "wb") as f:
    f.write(clean_pdf_bytes)

5. Detailed Class Functionality

BaseSanitizer [Abstract Base Class]

The parent class providing the concurrency infrastructure. All child classes must implement the sanitize method.

sanitize_batch(items: List[Any], max_workers: Optional[int] = None, **kwargs) -> List[Any]

Parameters: items - A list of items to sanitize.
How it works: Uses ThreadPoolExecutor to map the sanitize function across the list. Catches errors and places the Exception object in the corresponding index of the return list.

sanitize_batch_async(items: List[Any], max_workers: Optional[int] = None, **kwargs) -> List[Any]

How it works: Native asyncio implementation of sanitize_batch. Wraps blocking sanitization calls in loop.run_in_executor.

sanitize_batch_process(items: List[Any], max_workers: Optional[int] = None, **kwargs) -> List[Any]

How it works: Uses ProcessPoolExecutor for true parallel execution, bypassing the Global Interpreter Lock (GIL). Best used for heavy operations like bulk PDF sanitization.

sanitize_stream(source: Iterable[Any], mode: str = "thread", chunk_size: int = 64, **kwargs) -> Generator

Parameters:
source: An iterable of items.
mode: "sync", "async", "thread", or "process".
How it works: A powerful streaming interface that yields sanitized results back to the caller as soon as they are ready. Uses chunking to manage memory efficiently.

JazzminePDFSanitizer

Inherits from BaseSanitizer.

sanitize(file_input: Union[bytes, io.BytesIO], chunked_pages: Optional[int] = None, **kwargs) -> bytes

Parameters:
file_input: The raw bytes or byte stream of the PDF.
chunked_pages: Int defining how many pages to process at once (helps manage memory on massive PDFs).
Returns: Sanitized PDF as a bytes object.
How it works:
Reads the PDF using pypdf.
Iterates through the pages (and root dictionary) looking for DANGEROUS_KEYS (/JS, /JavaScript, /Launch, /EmbeddedFiles, etc.).
Deletes dangerous dictionary keys and purges malicious Action objects (/A) inside Annotations (/Annots).
Writes the sanitized output to a new byte stream.

JazzmineCSVSanitizer

Inherits from BaseSanitizer.

sanitize(input_data: Union[str, List[List[Any]]], force_quote: bool = False, **kwargs) -> str

Parameters:
input_data: Either a raw CSV string or a nested list of rows/cells.
force_quote: If True, wraps all cells in quotes (csv.QUOTE_ALL).
Returns: A safe CSV-formatted string.
How it works:
Parses the input.
Iterates through every cell and applies _escape_cell().
_escape_cell(): Checks if a string starts with =, +, -, @, %, \t, \r. If it does, it prepends a single quote (') to force spreadsheet applications (like Excel) to treat the payload as plain text rather than an executable macro/formula.

JazzmineHTMLSanitizer

Inherits from BaseSanitizer.

sanitize(html_content: str, allowed_tags: Optional[List[str]] = None, allowed_attributes: Optional[Dict] = None, strip_comments: bool = True, **kwargs) -> str

Parameters:
html_content: The raw HTML string.
allowed_tags: Defaults to a safe list (e.g., p, b, i, a, h1-h6, code, pre).
allowed_attributes: Defaults to safe attributes (e.g., href for a, src for img).
strip_comments: Removes HTML comments entirely.
Returns: Cleaned HTML string.
How it works: Utilizes the Mozilla-backed bleach library to parse the HTML tree and systematically strip out unapproved tags (like <script>, <iframe>, <style>) and attributes (like onclick, onload), neutralizing XSS attacks. Also linkifies raw URLs safely.

sanitize_strict(text: str) -> str [Class Method]

How it works: A utility method that strips all HTML tags from a given string, returning purely plain text.

6. Error Handling

Because this component processes arbitrary data formats, it uses tightly scoped typed exceptions (jazzmine.security.errors) to prevent silent failures:

PDFSanitizationError: Raised if the PDF is corrupted, password-protected (preventing parsing), or if memory limits are breached during rewrite. Includes original exception in cause and input length in context.
CSVSanitizationError: Raised if malformed lists or invalid string structures are passed.
HTMLSanitizationError: Raised if the bleach parsing engine encounters an unrecoverable failure.

Concurrency Error Handling: When using sanitize_batch or sanitize_stream, a failure on one item does not stop the batch. The Exception object is caught and placed in the result array at the original index. It is the responsibility of the calling application to check if elements in the returned array are instances of Exception and handle them accordingly.

7. Remarks

Multiprocessing vs. Multithreading

For JazzminePDFSanitizer, using sanitize_batch_process (ProcessPoolExecutor) is highly recommended for batches of large PDFs. Because PDF parsing is heavily CPU-bound, multithreading will be bottlenecked by Python's Global Interpreter Lock (GIL). For JazzmineHTMLSanitizer and JazzmineCSVSanitizer, standard threading or async is usually sufficient.

Telemetry & Logging

When instantiated with a BaseLogger, sanitizers automatically record valuable context:

input_len and output_length metrics.
input_type tracking (to differentiate between raw strings and list objects).
Error tracing with error_type and error_message kwargs, providing immediate visibility into corrupted files or failing validations without cluttering stack traces.