LLM Providers
Core reference

LLM Providers: LocalLLM

The LocalLLM provider is designed for ultra-low latency, offline execution, and maximum data privacy. Unlike cloud-based providers that communicate over a network, this class executes a local binary (such as llama-cli from the llama.cpp ecosystem or a custom compiled model) directly on the host machine. It treats the model as a command-line tool, providing a "Zero-Network" intelligence solution.

1. Behavior and Context

In the jazzmine architecture, LocalLLM operates as a System Process Wrapper.

  • Prompt Flattening: Since local CLI binaries typically expect a single string rather than a structured JSON list of messages, this class flattens the conversation history into a formatted text block (e.g., User: ... Assistant: ...).
  • Subprocess Execution: It utilizes Python's subprocess for synchronous calls and asyncio.create_subprocess_exec for asynchronous calls.
  • IO Redirection: It redirects the binary's stdout to capture the generated text and monitors stderr for runtime errors.

2. Purpose

  • Privacy & Compliance: Ideal for sensitive environments where data must never leave the local machine or traverse the public internet.
  • Edge Computing: Deploying agents on hardware with restricted or unreliable connectivity (e.g., local servers or factory-floor workstations).
  • Cost Elimination: Running intelligence on local hardware to bypass per-token API costs.
  • Development Speed: Rapid local testing without relying on third-party API availability or quotas.

3. High-Level API Examples

Example: Running with llama.cpp

python
from jazzmine.core.llm import LocalLLM

# Point to your compiled binary and model weights
llm = LocalLLM(
    model="llama-3-8b",
    binary_path="/usr/local/bin/llama-cli",
    temperature=0.1
)

# Standard generation call
response = await llm.agenerate(messages)
print(response.text)

4. Detailed Functionality

__init__(binary_path, **kwargs)

Functionality: Initializes the provider with the location of the model binary.

Parameters:

  • binary_path (str): The absolute path to the executable file on the filesystem.
  • **kwargs: Standard parameters like model, temperature, and max_tokens.

_format_prompt(messages) [Internal]

Functionality: Converts a list of MessagePart objects into a single cohesive string prompt.

How it works: It iterates through the messages, capitalizes the role name (e.g., "User: ", "System: "), and appends the content. It concludes the string with "Assistant: " to prompt the model for a response.


generate / agenerate

Functionality: Spawns the local process and waits for it to finish generating the full response.

How it works:

  • Formats the input messages using _format_prompt.
  • Executes the binary with the -p (prompt) flag.
  • Captures the output from stdout.
  • Returns an LLMResponse where token usage is estimated using the framework's heuristic (since local binaries rarely return standardized usage metadata).

stream / astream

Functionality: Yields text line-by-line as the local binary prints to stdout.

How it works: It initializes the process and reads from the output pipe buffer. Because the binary is run with -u (unbuffered) or standard streaming flags, the agent receives partial tokens in near real-time.


5. Error Handling

  • Non-Zero Exit Codes: If the local binary crashes (e.g., due to a CUDA error or a missing .gguf file), LocalLLM raises an LLMInternalError containing the full contents of the binary's stderr.
  • Binary Not Found: If the binary_path is incorrect or the file is not executable, the system raises a standard Python FileNotFoundError or PermissionError.
  • Empty Output: If the process completes successfully but returns no text, it yields a warning and returns an empty LLMResponse.

6. Remarks

  • CLI Flag Assumptions: This class assumes the binary follows the llama.cpp convention where -p is the flag for the prompt. If your custom binary uses different flags (e.g., --input), you must wrap it in a small shell script or modify the generate logic.
  • Hardware Dependency: Performance is strictly bound to the host's CPU/GPU and RAM. If the model is too large for the system, generation may be extremely slow.
  • Resource Management: close() and aclose() are present for API compatibility but are no-ops, as each generation call creates and destroys its own subprocess.
  • Heuristic Usage: Token counts in LLMUsage are approximations calculated by character length, which is sufficient for simple context window management.