LLM Providers: OpenAICompatibleLLM

1. Behavior and Context

In the jazzmine architecture, this class acts as the primary bridge to modern Large Language Models.

Networking: It utilizes the httpx library to manage high-performance synchronous and asynchronous connection pools.
Protocol: It communicates over HTTP/JSON with the /v1/chat/completions endpoint.
Flexibility: Because many modern AI tools mimic OpenAI's API structure, this single class enables an agent to switch from a cloud-hosted GPT-4 to a locally-hosted Llama-3 by simply changing a URL.

2. Purpose

Standardization: Use the same interface for local development (Ollama) and production deployment (OpenAI).
Performance: Leverages HTTP/2 and async I/O to handle real-time token streaming and concurrent message processing.
Abstraction: Automatically handles JSON payload construction, headers, and response parsing, including token usage normalization.

3. High-Level API Examples

Example: Connecting to OpenAI

python

from jazzmine.core.llm import OpenAICompatibleLLM

# Standard OpenAI setup
llm = OpenAICompatibleLLM(
    model="gpt-4o",
    api_key="sk-...",
    base_url="https://api.openai.com",
    temperature=0.0,
    timeout=30.0
)

# Ollama serves an OpenAI-compatible API at port 11434
llm = OpenAICompatibleLLM(
    model="llama3.1",
    api_key="ollama", # Ollama doesn't require a key, but a placeholder is needed
    base_url="http://localhost:11434",
    timeout=60.0
)

4. Detailed Functionality

init(api_key, base_url, chat_endpoint, ...)

Functionality: Configures the communication parameters and initializes the underlying HTTP clients.

Parameters:

Parameter	Type	Default	Description
api_key	str	Required	Secret key for the Bearer token authorization.
base_url	str	Required	The root URL of the API (e.g., https://api.openai.com).
chat_endpoint	str	"/v1/chat/completions"	The specific path for chat completions.
top_p	Optional[float]	None	Nucleus sampling parameter.

generate / agenerate

Functionality: Sends a full conversation history to the model and waits for a complete response.

How it works:

Assembles a JSON payload including model, messages, temperature, and max_tokens.
Sends a POST request to the configured endpoint.
Calculates latency and parses the JSON response using normalize_usage to ensure consistent token metrics.

stream / astream

Functionality: Maintains an open connection and yields tokens as they are generated by the model.

How it works:

Enables "stream": true in the request payload.
Iterates over the server-sent events (SSE).
Filters out metadata lines and the [DONE] signal, yielding only the incremental text found in choices[0].delta.content.

_handle_request_error(e) [Internal]

Functionality: Maps low-level httpx exceptions to jazzmine errors.

Mappings:

httpx.TimeoutException → LLMTimeoutError
httpx.RequestError → LLMConnectionError

5. Error Handling

HTTP 429 (Rate Limit): Specifically caught and raised as LLMRateLimitError.
HTTP 5xx (Server Error): Raised as LLMInternalError with the raw response body included for debugging.
Response Validation: If the JSON response is missing the expected choices array (common in some proxy environments), the parser will raise an LLMInternalError or KeyError depending on the provider's output.

6. Remarks

Trailing Slashes: When providing a base_url, ensure you do not include the /v1/chat/completions suffix unless you are also overriding the chat_endpoint parameter.
Top-P Sampling: If top_p is provided in the constructor, it is included in every request. Use this instead of temperature if you prefer nucleus sampling.
Resource Management: This class maintains both a client (sync) and an aclient (async). Always use the context manager or call await llm.aclose() to ensure both pools are terminated.