Core Systems
Core reference

Core: EmbeddingService

EmbeddingService is the local model export utility used to prepare embedding assets for Jazzmine memory runtime.

1. Overview

It is responsible for:

  • downloading/loading tokenizer and transformer model from Hugging Face,
  • exporting ONNX models for local embedding inference,
  • optionally quantizing ONNX to INT8,
  • exporting tokenizer-only assets for remote embedding backends that still need local BM25 tokenization.

The class is synchronous and uses lazy loading for both tokenizer and model.

2. Public API Coverage Checklist

This document covers every public method and key internal helper in the implementation.

2.1 Public Methods

MethodIncludedPurpose
__init__(model_id="BAAI/bge-small-en-v1.5")YesConfigure model identifier and lazy state
export_tokenizer_only(output_dir)YesSave tokenizer files only
export_onnx_model(output_dir, quantized=True, opset=17)YesExport FP32 ONNX, optionally quantize to INT8

2.2 Internal Helpers

HelperIncludedPurpose
_model_id_for_filename()YesConvert model id into filename-safe slug
_load_tokenizer()YesLazy-load tokenizer
_load_model()YesLazy-load model and switch to eval mode
_quantize(fp32_path, int8_path)YesINT8 static quantization with calibration reader

3. Constructor and Methods

3.1 __init__(model_id="BAAI/bge-small-en-v1.5")

Initializes service state.

State fields:

  • self.model_id selected model id,
  • self.tokenizer initially None,
  • self.model initially None.

No network or model loading occurs during construction.

3.2 export_tokenizer_only(output_dir)

Exports tokenizer files to output_dir.

Flow:

  1. ensure target directory exists,
  2. lazy-load tokenizer if needed,
  3. call save_pretrained(output_dir),
  4. return output_dir.

Use case:

  • remote embeddings (OpenAI/Cohere/vLLM/etc.) with local sparse/BM25 support.

3.3 export_onnx_model(output_dir, quantized=True, opset=17)

Exports embedding model to ONNX.

High-level steps:

  1. ensure target directory exists,
  2. load tokenizer and model lazily,
  3. save tokenizer assets into same directory,
  4. build dummy batched tokenized inputs,
  5. export FP32 ONNX,
  6. optionally quantize to INT8 QDQ.

Return value:

  • FP32 mode (quantized=False): returns FP32 ONNX path,
  • quantized mode (quantized=True): returns INT8 ONNX path.

4. ONNX Export Design

4.1 Wrapper Model

The exporter uses an internal _OnnxExportWrapper(torch.nn.Module) to normalize output behavior across transformer models.

Important behavior:

  • inspects forward signature to avoid unsupported kwargs,
  • passes explicit kwargs to avoid tracing collisions,
  • disables optional outputs when supported (use_cache, output_attentions, output_hidden_states),
  • always returns a pair:
  • last_hidden_state,
  • pooler_output (fallback to CLS token from last_hidden_state[:, 0, :] when absent).

4.2 Dynamic Shape Strategy

Primary export path uses dynamo exporter:

  • sets dynamic batch and sequence dimensions using torch.export.Dim,
  • enforces opset max(opset, 18) for dynamo export.

Fallback export path uses legacy exporter:

  • uses dynamic_axes for both inputs and outputs,
  • keeps pooled output batch dimension dynamic to avoid runtime shape mismatch,
  • uses provided opset.

4.3 Export Fallback Behavior

Default behavior:

  • try dynamo export first,
  • if dynamo fails, log fallback message and retry with legacy export.

This provides resilience across varying PyTorch/ONNX exporter compatibility levels.

5. Quantization Pipeline

5.1 _quantize(fp32_path, int8_path)

Uses onnxruntime.quantization.quantize_static with:

  • QuantFormat.QDQ,
  • weight_type=QInt8,
  • activation_type=QInt8.

Calibration data:

  • internal TextCalibrationDataReader built from tokenizer,
  • minimal repeated text corpus,
  • fixed max sequence length 128,
  • emits input_ids and attention_mask numpy arrays.

5.2 Quantization Fallback

When quantization fails on a dynamo-exported FP32 model:

  1. remove failed FP32 artifact,
  2. re-export with legacy exporter,
  3. retry quantization.

If quantization fails after legacy export, exception is propagated.

5.3 Output Cleanup

On successful quantized export:

  • FP32 intermediate ONNX file is deleted,
  • INT8 ONNX path is returned.

6. Filenames and Artifacts

Model id is transformed from org/model to org-model for filenames.

Generated artifacts:

  • tokenizer files from save_pretrained,
  • FP32 model: <slug>_fp32.onnx,
  • INT8 model: <slug>_int8_qdq.onnx.

7. Error Handling and Fallback Semantics

7.1 Export Errors

Handled in export_onnx_model by staged fallback:

  • dynamo export failure falls back to legacy export,
  • quantization failure on dynamo path retries through legacy path.

7.2 Propagated Failures

These are not swallowed after fallback options are exhausted:

  • tokenizer/model load errors,
  • final legacy export errors,
  • final quantization errors,
  • filesystem errors (permission/path/disk issues).

8. Practical Examples

8.1 Tokenizer-Only Export

python
from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
tokenizer_dir = svc.export_tokenizer_only("./models/bge-small")
print(tokenizer_dir)

8.2 Local INT8 ONNX Export

python
from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
model_path = svc.export_onnx_model(
    output_dir="./models/bge-small",
    quantized=True,
    opset=17,
)
print(model_path)

8.3 FP32 ONNX Export

python
from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
fp32_path = svc.export_onnx_model(
    output_dir="./models/bge-small",
    quantized=False,
    opset=17,
)
print(fp32_path)

9. Operational Guidance

  • Keep tokenizer and ONNX files in the same directory to simplify downstream memory configuration.
  • Prefer quantized=True for lower memory footprint and faster inference on CPU-heavy deployments.
  • Keep opset aligned with your deployed ONNX Runtime capabilities; default 17 is a balanced baseline.
  • Treat exporter fallback logs as useful diagnostics when upgrading PyTorch/transformers/ONNX stacks.