1. Behavior and Context
In the jazzmine architecture, CohereLLM functions as a specialized high-reasoning backend.
- V2 Protocol: It communicates with the https://api.cohere.com/v2 endpoint, moving away from older legacy structures to a more standardized chat completion format.
- Dual-Client Management: Like other providers, it maintains both an httpx.Client and httpx.AsyncClient to handle synchronous background tasks and asynchronous user-facing chat loops without blocking.
- Exact Token Mapping: Cohere provides high-fidelity token usage metadata in its response. This provider extracts input_tokens and output_tokens directly, allowing for precise cost tracking and context management.
2. Purpose
- Enterprise Reasoning: Leveraging models optimized for business logic and structured data extraction.
- RAG Optimization: Ideal for agents that rely heavily on EpisodicMemory and large context windows, as Command R is built to cite and reason over long documents.
- Cost Efficiency: Providing a balance of high performance and competitive pricing for the "Agent Reasoning" loop.
3. High-Level API Examples
Example: Initializing the Cohere Provider
from jazzmine.core.llm import CohereLLM
# Initialize with the Command R+ model for maximum intelligence
llm = CohereLLM(
model="command-r-plus",
api_key="your-cohere-api-key",
temperature=0.3,
max_tokens=2048,
timeout=30.0
)
# Standard async generation
response = await llm.agenerate(messages)
print(f"Cohere says: {response.text}")
print(f"Turn cost: {response.usage.total_tokens} tokens")4. Detailed Functionality
__init__(api_key, model, **kwargs)
Functionality: Sets up the API credentials and initializes the asynchronous connection pool.
Parameters:
- api_key (str): Your Cohere API key.
- model (str): The model ID. Defaults to "command-r-plus".
- **kwargs: Inherited parameters like temperature, max_tokens, and timeout.
_prepare_payload(messages, stream) [Internal]
Functionality: Converts the framework's MessagePart list into the Cohere V2 JSON schema.
How it works: It maps the standard roles (user, assistant, system) directly to the Cohere messages array and injects the model, stream status, and sampling configuration.
_parse_response(data, start_time) [Internal]
Functionality: Navigates the Cohere V2 response tree to extract text and usage data.
How it works:
- Text Extraction: Accesses the text content via the path MessagePart -> content[0] -> text.
- Usage Normalization: It maps Cohere's input_tokens and output_tokens fields to the standardized LLMUsage object.
stream / astream
Functionality: Processes real-time token events from the Cohere API.
How it works: It listens for JSON events emitted by the /chat endpoint. It specifically identifies events of type: "content-delta". It then extracts the incremental text change from the delta object and yields it to the caller.
5. Error Handling
- LLMRateLimitError: Explicitly caught when the Cohere API returns a 429 status code, indicating that the plan's quota has been exceeded.
- LLMTimeoutError: Raised if the request exceeds the seconds specified in the constructor (defaulting to the httpx timeout).
- LLMInternalError: Raised for general server-side issues or non-200 status codes not covered by specific error types.
- JSON Resilience: Both the standard and streaming parsers include try...except blocks for JSONDecodeError, ensuring that malformed partial chunks do not crash the entire agent turn.
6. Remarks
- V2 Endpoint: This provider targets https://api.cohere.com/v2. If you are using an older version of the Cohere API, ensure you check for endpoint compatibility.
- Command R optimization: When using this provider for the main agent loop, it is recommended to set a lower temperature (e.g., 0.1 to 0.3) to maximize the consistency of the tool-calling logic.
- Context Management: Always call await llm.aclose() at the end of your session to ensure the httpx clients are closed properly.