1. Behavior and Context
In the jazzmine architecture, this class acts as the primary bridge to modern Large Language Models.
- Networking: It utilizes the httpx library to manage high-performance synchronous and asynchronous connection pools.
- Protocol: It communicates over HTTP/JSON with the /v1/chat/completions endpoint.
- Flexibility: Because many modern AI tools mimic OpenAI's API structure, this single class enables an agent to switch from a cloud-hosted GPT-4 to a locally-hosted Llama-3 by simply changing a URL.
2. Purpose
- Standardization: Use the same interface for local development (Ollama) and production deployment (OpenAI).
- Performance: Leverages HTTP/2 and async I/O to handle real-time token streaming and concurrent message processing.
- Abstraction: Automatically handles JSON payload construction, headers, and response parsing, including token usage normalization.
3. High-Level API Examples
Example: Connecting to OpenAI
from jazzmine.core.llm import OpenAICompatibleLLM
# Standard OpenAI setup
llm = OpenAICompatibleLLM(
model="gpt-4o",
api_key="sk-...",
base_url="https://api.openai.com",
temperature=0.0,
timeout=30.0
)
# Ollama serves an OpenAI-compatible API at port 11434
llm = OpenAICompatibleLLM(
model="llama3.1",
api_key="ollama", # Ollama doesn't require a key, but a placeholder is needed
base_url="http://localhost:11434",
timeout=60.0
)4. Detailed Functionality
__init__(api_key, base_url, chat_endpoint, ...)
Functionality: Configures the communication parameters and initializes the underlying HTTP clients.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| api_key | str | Required | Secret key for the Bearer token authorization. |
| base_url | str | Required | The root URL of the API (e.g., https://api.openai.com). |
| chat_endpoint | str | "/v1/chat/completions" | The specific path for chat completions. |
| top_p | Optional[float] | None | Nucleus sampling parameter. |
generate / agenerate
Functionality: Sends a full conversation history to the model and waits for a complete response.
How it works:
- Assembles a JSON payload including model, messages, temperature, and max_tokens.
- Sends a POST request to the configured endpoint.
- Calculates latency and parses the JSON response using normalize_usage to ensure consistent token metrics.
stream / astream
Functionality: Maintains an open connection and yields tokens as they are generated by the model.
How it works:
- Enables "stream": true in the request payload.
- Iterates over the server-sent events (SSE).
- Filters out metadata lines and the [DONE] signal, yielding only the incremental text found in choices[0].delta.content.
_handle_request_error(e) [Internal]
Functionality: Maps low-level httpx exceptions to jazzmine errors.
Mappings:
- httpx.TimeoutException → LLMTimeoutError
- httpx.RequestError → LLMConnectionError
5. Error Handling
- HTTP 429 (Rate Limit): Specifically caught and raised as LLMRateLimitError.
- HTTP 5xx (Server Error): Raised as LLMInternalError with the raw response body included for debugging.
- Response Validation: If the JSON response is missing the expected choices array (common in some proxy environments), the parser will raise an LLMInternalError or KeyError depending on the provider's output.
6. Remarks
- Trailing Slashes: When providing a base_url, ensure you do not include the /v1/chat/completions suffix unless you are also overriding the chat_endpoint parameter.
- Top-P Sampling: If top_p is provided in the constructor, it is included in every request. Use this instead of temperature if you prefer nucleus sampling.
- Resource Management: This class maintains both a client (sync) and an aclient (async). Always use the context manager or call await llm.aclose() to ensure both pools are terminated.