1. Behavior and Context
In the jazzmine architecture, LocalLLM operates as a System Process Wrapper.
- Prompt Flattening: Since local CLI binaries typically expect a single string rather than a structured JSON list of messages, this class flattens the conversation history into a formatted text block (e.g., User: ... Assistant: ...).
- Subprocess Execution: It utilizes Python's subprocess for synchronous calls and asyncio.create_subprocess_exec for asynchronous calls.
- IO Redirection: It redirects the binary's stdout to capture the generated text and monitors stderr for runtime errors.
2. Purpose
- Privacy & Compliance: Ideal for sensitive environments where data must never leave the local machine or traverse the public internet.
- Edge Computing: Deploying agents on hardware with restricted or unreliable connectivity (e.g., local servers or factory-floor workstations).
- Cost Elimination: Running intelligence on local hardware to bypass per-token API costs.
- Development Speed: Rapid local testing without relying on third-party API availability or quotas.
3. High-Level API Examples
Example: Running with llama.cpp
from jazzmine.core.llm import LocalLLM
# Point to your compiled binary and model weights
llm = LocalLLM(
model="llama-3-8b",
binary_path="/usr/local/bin/llama-cli",
temperature=0.1
)
# Standard generation call
response = await llm.agenerate(messages)
print(response.text)4. Detailed Functionality
__init__(binary_path, **kwargs)
Functionality: Initializes the provider with the location of the model binary.
Parameters:
- binary_path (str): The absolute path to the executable file on the filesystem.
- **kwargs: Standard parameters like model, temperature, and max_tokens.
_format_prompt(messages) [Internal]
Functionality: Converts a list of MessagePart objects into a single cohesive string prompt.
How it works: It iterates through the messages, capitalizes the role name (e.g., "User: ", "System: "), and appends the content. It concludes the string with "Assistant: " to prompt the model for a response.
generate / agenerate
Functionality: Spawns the local process and waits for it to finish generating the full response.
How it works:
- Formats the input messages using _format_prompt.
- Executes the binary with the -p (prompt) flag.
- Captures the output from stdout.
- Returns an LLMResponse where token usage is estimated using the framework's heuristic (since local binaries rarely return standardized usage metadata).
stream / astream
Functionality: Yields text line-by-line as the local binary prints to stdout.
How it works: It initializes the process and reads from the output pipe buffer. Because the binary is run with -u (unbuffered) or standard streaming flags, the agent receives partial tokens in near real-time.
5. Error Handling
- Non-Zero Exit Codes: If the local binary crashes (e.g., due to a CUDA error or a missing .gguf file), LocalLLM raises an LLMInternalError containing the full contents of the binary's stderr.
- Binary Not Found: If the binary_path is incorrect or the file is not executable, the system raises a standard Python FileNotFoundError or PermissionError.
- Empty Output: If the process completes successfully but returns no text, it yields a warning and returns an empty LLMResponse.
6. Remarks
- CLI Flag Assumptions: This class assumes the binary follows the llama.cpp convention where -p is the flag for the prompt. If your custom binary uses different flags (e.g., --input), you must wrap it in a small shell script or modify the generate logic.
- Hardware Dependency: Performance is strictly bound to the host's CPU/GPU and RAM. If the model is too large for the system, generation may be extremely slow.
- Resource Management: close() and aclose() are present for API compatibility but are no-ops, as each generation call creates and destroys its own subprocess.
- Heuristic Usage: Token counts in LLMUsage are approximations calculated by character length, which is sufficient for simple context window management.