LLM Providers: LocalLLM

1. Behavior and Context

In the jazzmine architecture, LocalLLM operates as a System Process Wrapper.

Prompt Flattening: Since local CLI binaries typically expect a single string rather than a structured JSON list of messages, this class flattens the conversation history into a formatted text block (e.g., User: ... Assistant: ...).
Subprocess Execution: It utilizes Python's subprocess for synchronous calls and asyncio.create_subprocess_exec for asynchronous calls.
IO Redirection: It redirects the binary's stdout to capture the generated text and monitors stderr for runtime errors.

2. Purpose

Privacy & Compliance: Ideal for sensitive environments where data must never leave the local machine or traverse the public internet.
Edge Computing: Deploying agents on hardware with restricted or unreliable connectivity (e.g., local servers or factory-floor workstations).
Cost Elimination: Running intelligence on local hardware to bypass per-token API costs.
Development Speed: Rapid local testing without relying on third-party API availability or quotas.

3. High-Level API Examples

Example: Running with llama.cpp

python

from jazzmine.core.llm import LocalLLM

# Point to your compiled binary and model weights
llm = LocalLLM(
    model="llama-3-8b",
    binary_path="/usr/local/bin/llama-cli",
    temperature=0.1
)

# Standard generation call
response = await llm.agenerate(messages)
print(response.text)

4. Detailed Functionality

init(binary_path, **kwargs)

Functionality: Initializes the provider with the location of the model binary.

Parameters:

binary_path (str): The absolute path to the executable file on the filesystem.
**kwargs: Standard parameters like model, temperature, and max_tokens.

_format_prompt(messages) [Internal]

Functionality: Converts a list of MessagePart objects into a single cohesive string prompt.

How it works: It iterates through the messages, capitalizes the role name (e.g., "User: ", "System: "), and appends the content. It concludes the string with "Assistant: " to prompt the model for a response.

generate / agenerate

Functionality: Spawns the local process and waits for it to finish generating the full response.

How it works:

Formats the input messages using _format_prompt.
Executes the binary with the -p (prompt) flag.
Captures the output from stdout.
Returns an LLMResponse where token usage is estimated using the framework's heuristic (since local binaries rarely return standardized usage metadata).

stream / astream

Functionality: Yields text line-by-line as the local binary prints to stdout.

How it works: It initializes the process and reads from the output pipe buffer. Because the binary is run with -u (unbuffered) or standard streaming flags, the agent receives partial tokens in near real-time.

5. Error Handling

Non-Zero Exit Codes: If the local binary crashes (e.g., due to a CUDA error or a missing .gguf file), LocalLLM raises an LLMInternalError containing the full contents of the binary's stderr.
Binary Not Found: If the binary_path is incorrect or the file is not executable, the system raises a standard Python FileNotFoundError or PermissionError.
Empty Output: If the process completes successfully but returns no text, it yields a warning and returns an empty LLMResponse.

6. Remarks

CLI Flag Assumptions: This class assumes the binary follows the llama.cpp convention where -p is the flag for the prompt. If your custom binary uses different flags (e.g., --input), you must wrap it in a small shell script or modify the generate logic.
Hardware Dependency: Performance is strictly bound to the host's CPU/GPU and RAM. If the model is too large for the system, generation may be extremely slow.
Resource Management: close() and aclose() are present for API compatibility but are no-ops, as each generation call creates and destroys its own subprocess.
Heuristic Usage: Token counts in LLMUsage are approximations calculated by character length, which is sufficient for simple context window management.