LLM Providers
Core reference

LLM Providers: OpenAICompatibleLLM

The OpenAICompatibleLLM is the most versatile provider within the jazzmine framework. While named after OpenAI, it implements the industry-standard "Chat Completions" API protocol. This allows it to function as a "Universal Adapter" for any model provider that adheres to this standard, including official OpenAI services, high-performance cloud providers like Groq or Together AI, and local inference servers like Ollama or vLLM.

1. Behavior and Context

In the jazzmine architecture, this class acts as the primary bridge to modern Large Language Models.

  • Networking: It utilizes the httpx library to manage high-performance synchronous and asynchronous connection pools.
  • Protocol: It communicates over HTTP/JSON with the /v1/chat/completions endpoint.
  • Flexibility: Because many modern AI tools mimic OpenAI's API structure, this single class enables an agent to switch from a cloud-hosted GPT-4 to a locally-hosted Llama-3 by simply changing a URL.

2. Purpose

  • Standardization: Use the same interface for local development (Ollama) and production deployment (OpenAI).
  • Performance: Leverages HTTP/2 and async I/O to handle real-time token streaming and concurrent message processing.
  • Abstraction: Automatically handles JSON payload construction, headers, and response parsing, including token usage normalization.

3. High-Level API Examples

Example: Connecting to OpenAI

python
from jazzmine.core.llm import OpenAICompatibleLLM

# Standard OpenAI setup
llm = OpenAICompatibleLLM(
    model="gpt-4o",
    api_key="sk-...",
    base_url="https://api.openai.com",
    temperature=0.0,
    timeout=30.0
)

# Ollama serves an OpenAI-compatible API at port 11434
llm = OpenAICompatibleLLM(
    model="llama3.1",
    api_key="ollama", # Ollama doesn't require a key, but a placeholder is needed
    base_url="http://localhost:11434",
    timeout=60.0
)

4. Detailed Functionality

__init__(api_key, base_url, chat_endpoint, ...)

Functionality: Configures the communication parameters and initializes the underlying HTTP clients.

Parameters:

ParameterTypeDefaultDescription
api_keystrRequiredSecret key for the Bearer token authorization.
base_urlstrRequiredThe root URL of the API (e.g., https://api.openai.com).
chat_endpointstr"/v1/chat/completions"The specific path for chat completions.
top_pOptional[float]NoneNucleus sampling parameter.

generate / agenerate

Functionality: Sends a full conversation history to the model and waits for a complete response.

How it works:

  • Assembles a JSON payload including model, messages, temperature, and max_tokens.
  • Sends a POST request to the configured endpoint.
  • Calculates latency and parses the JSON response using normalize_usage to ensure consistent token metrics.

stream / astream

Functionality: Maintains an open connection and yields tokens as they are generated by the model.

How it works:

  • Enables "stream": true in the request payload.
  • Iterates over the server-sent events (SSE).
  • Filters out metadata lines and the [DONE] signal, yielding only the incremental text found in choices[0].delta.content.

_handle_request_error(e) [Internal]

Functionality: Maps low-level httpx exceptions to jazzmine errors.

Mappings:

  • httpx.TimeoutException → LLMTimeoutError
  • httpx.RequestError → LLMConnectionError

5. Error Handling

  • HTTP 429 (Rate Limit): Specifically caught and raised as LLMRateLimitError.
  • HTTP 5xx (Server Error): Raised as LLMInternalError with the raw response body included for debugging.
  • Response Validation: If the JSON response is missing the expected choices array (common in some proxy environments), the parser will raise an LLMInternalError or KeyError depending on the provider's output.

6. Remarks

  • Trailing Slashes: When providing a base_url, ensure you do not include the /v1/chat/completions suffix unless you are also overriding the chat_endpoint parameter.
  • Top-P Sampling: If top_p is provided in the constructor, it is included in every request. Use this instead of temperature if you prefer nucleus sampling.
  • Resource Management: This class maintains both a client (sync) and an aclient (async). Always use the context manager or call await llm.aclose() to ensure both pools are terminated.