Sandbox
Core reference

Executor Harness

The Executor Harness is the runtime kernel that resides inside every jazzmine sandbox container. It is a self-contained Python script (executor_harness.py) responsible for managing the container's lifecycle, dynamic tool loading, and the secure execution of LLM-generated scripts. It acts as the final destination for tool-calling requests, providing the actual execution context where Python code meets business logic.

Tool System: Executor Harness

1. Behavior and Context

In the framework's architecture, the harness is the ENTRYPOINT of the Docker image.

  • Persistent Execution: Unlike a standard script that runs and exits, the harness is a long-lived process. It initializes once and then enters a loop, reading execution requests from stdin and streaming results to stdout.
  • Isolation: It runs under a restricted user (sandbox) with all Linux capabilities dropped and a read-only root filesystem.
  • Async-to-Sync Bridging: It manages a dedicated background thread and a private asyncio event loop. It automatically wraps async def tools so that LLM-generated scripts (which are written as standard synchronous Python) can call them without await syntax.
  • Event-Driven Communication: It communicates with the host via a newline-terminated JSON wire protocol, emitting granular events for logs, intermediate data, and final results.

2. Purpose

  • Secure Code Evaluation: Providing a restricted environment for executing untrusted code via Python's exec() and compile() built-ins.
  • Resource Monitoring: Enforcing internal wall-clock timeouts using Unix signals (SIGALRM) to ensure a runaway script cannot hang the container.
  • Dynamic Capability: Loading tools from a volume-mounted directory at runtime, allowing the same base image to support different sets of skills.
  • Handshake Orchestration: Emitting standard signals (like ready and script_done) to allow the host-side SandboxPool and ScriptExecutor to synchronize with the container's state.

3. High-Level API (Internal Script Scope)

The "API" of the harness is what is visible to the LLM-generated script body. The harness injects a specific set of helpers and pre-loaded tools into the global namespace.

Example: Script Body Logic

python
# The LLM generates this body. The harness provides the tools and helpers.

# 1. Call a pre-loaded tool (Harness handles the async/sync bridging)
res = fetch_user_profile(user_id="user_99")

# 2. Logic check
if res.success:
    # 3. Emit an intermediate update
    emit_intermediate("profile_found", {"username": res.data["name"]})
    
    # 4. Finalize the task
    emit_result({"status": "completed", "points": res.data["loyalty_points"]})
else:
    # 5. Log a debug message and fail
    emit_log(f"Failed to find user: {res.message}", level="warning")
    emit_result({"error": "user_not_found"})

5. Detailed Functionality

Core Helpers (Callable by Scripts)

emit_result(data)

  • Functionality: Finalizes the current task and returns the final payload to the Agent.
  • Parameters: data (Any): A JSON-serializable object.
  • Note: Calling this finishes the logic portion of the script.

emit_intermediate(label, data)

  • Functionality: Streams partial results back to the Agent during execution.
  • Parameters: label (str), data (Any).
  • Use Case: Providing the Agent with data to "think" about before the final answer is ready.

emit_log(message, level="info")

  • Functionality: Sends a diagnostic message to the host logs without affecting the task result.

Internal Runtime Logic

_load_tools()

  • Functionality: Scans the /tools/ directory for .py files.
  • Mechanism: It reads each file, compiles it, and executes it within the _base_globals dictionary. This makes every function defined in those files available for the LLM to call.

_wrap_async_tools()

  • Functionality: Automates asyncio integration.
  • Mechanism: It inspects all loaded tool functions. If a function is an async def, it is wrapped in a synchronous closure that uses asyncio.run_coroutine_threadsafe to execute the task on the dedicated tool-loop thread.

_run_script(script, execution_id, timeout, mode)

  • Functionality: The evaluation core.
  • Context Isolation: Creates a fresh dict of globals for each run.
  • Timeout: Sets a signal.alarm(timeout).
  • Execution: Runs the compiled script bytecode.
  • Error Capture: If an exception occurs, it captures the traceback.format_exc() and emits an error event.

main()

  • Functionality: The request-response loop.
  • It emits {"type": "ready"} upon successful startup.
  • It reads JSON requests from sys.stdin.
  • It performs _check_secrets to ensure required environment variables are present before execution.
  • It always emits {"type": "script_done"} after a script finishes, regardless of whether it succeeded, failed, or timed out.

5. The Wire Protocol (Stdout Events)

The harness communicates with the host via these JSON event types:

Event TypeDescription
readySent once when the container is fully initialized and tools are loaded.
final_resultSent when the script calls emit_result(). Contains the data payload.
intermediateSent when the script calls emit_intermediate().
errorSent if the script crashes or times out. Includes message and traceback.
logSent when the script calls emit_log().
script_doneCrucial: Sent after every execution turn. Signals the host to stop waiting.

6. Error Handling

  • Timeouts: If the script exceeds the allocated time, the OS sends a SIGALRM. The harness catches this, resets the alarm, and returns a structured error: "Script timed out after Xs".
  • Security Violations: If the script attempts to call sys.exit(), the harness catches the SystemExit exception and emits a standard error event instead of allowing the process to die.
  • Secret Validation: The harness checks for missing environment variables before running the script. If a required secret (e.g., STRIPE_KEY) is missing, it aborts and returns an error message listing the missing keys.

7. Remarks

  • IO Thread Safety: The harness uses a threading.Lock (_stdout_lock) to ensure that JSON events from the main thread and the async-tool thread do not overlap or corrupt the stdout stream.
  • Synchronous LLM Scripts: The decision to wrap async tools as sync functions is intentional. It allows the Agent to write standard Python logic without understanding complex asynchronous concepts, leading to significantly higher code-generation success rates.
  • State Management: By using a dedicated globals dictionary per execution, the harness ensures that variables defined in a previous failed attempt do not pollute the namespace of a retry attempt.