Core: EmbeddingService

1. Overview

It is responsible for:

downloading/loading tokenizer and transformer model from Hugging Face,
exporting ONNX models for local embedding inference,
optionally quantizing ONNX to INT8,
exporting tokenizer-only assets for remote embedding backends that still need local BM25 tokenization.

The class is synchronous and uses lazy loading for both tokenizer and model.

2. Public API Coverage Checklist

This document covers every public method and key internal helper in the implementation.

2.1 Public Methods

Method	Included	Purpose
__init__(model_id="BAAI/bge-small-en-v1.5")	Yes	Configure model identifier and lazy state
export_tokenizer_only(output_dir)	Yes	Save tokenizer files only
export_onnx_model(output_dir, quantized=True, opset=17)	Yes	Export FP32 ONNX, optionally quantize to INT8

2.2 Internal Helpers

Helper	Included	Purpose
_model_id_for_filename()	Yes	Convert model id into filename-safe slug
_load_tokenizer()	Yes	Lazy-load tokenizer
_load_model()	Yes	Lazy-load model and switch to eval mode
_quantize(fp32_path, int8_path)	Yes	INT8 static quantization with calibration reader

3. Constructor and Methods

3.1 init(model_id="BAAI/bge-small-en-v1.5")

Initializes service state.

State fields:

self.model_id selected model id,
self.tokenizer initially None,
self.model initially None.

No network or model loading occurs during construction.

3.2 export_tokenizer_only(output_dir)

Exports tokenizer files to output_dir.

Flow:

ensure target directory exists,
lazy-load tokenizer if needed,
call save_pretrained(output_dir),
return output_dir.

Use case:

remote embeddings (OpenAI/Cohere/vLLM/etc.) with local sparse/BM25 support.

3.3 export_onnx_model(output_dir, quantized=True, opset=17)

Exports embedding model to ONNX.

High-level steps:

ensure target directory exists,
load tokenizer and model lazily,
save tokenizer assets into same directory,
build dummy batched tokenized inputs,
export FP32 ONNX,
optionally quantize to INT8 QDQ.

Return value:

FP32 mode (quantized=False): returns FP32 ONNX path,
quantized mode (quantized=True): returns INT8 ONNX path.

4. ONNX Export Design

4.1 Wrapper Model

The exporter uses an internal _OnnxExportWrapper(torch.nn.Module) to normalize output behavior across transformer models.

Important behavior:

inspects forward signature to avoid unsupported kwargs,
passes explicit kwargs to avoid tracing collisions,
disables optional outputs when supported (use_cache, output_attentions, output_hidden_states),
always returns a pair:
last_hidden_state,
pooler_output (fallback to CLS token from last_hidden_state[:, 0, :] when absent).

4.2 Dynamic Shape Strategy

Primary export path uses dynamo exporter:

sets dynamic batch and sequence dimensions using torch.export.Dim,
enforces opset max(opset, 18) for dynamo export.

Fallback export path uses legacy exporter:

uses dynamic_axes for both inputs and outputs,
keeps pooled output batch dimension dynamic to avoid runtime shape mismatch,
uses provided opset.

4.3 Export Fallback Behavior

Default behavior:

try dynamo export first,
if dynamo fails, log fallback message and retry with legacy export.

This provides resilience across varying PyTorch/ONNX exporter compatibility levels.

5. Quantization Pipeline

5.1 _quantize(fp32_path, int8_path)

Uses onnxruntime.quantization.quantize_static with:

QuantFormat.QDQ,
weight_type=QInt8,
activation_type=QInt8.

Calibration data:

internal TextCalibrationDataReader built from tokenizer,
minimal repeated text corpus,
fixed max sequence length 128,
emits input_ids and attention_mask numpy arrays.

5.2 Quantization Fallback

When quantization fails on a dynamo-exported FP32 model:

remove failed FP32 artifact,
re-export with legacy exporter,
retry quantization.

If quantization fails after legacy export, exception is propagated.

5.3 Output Cleanup

On successful quantized export:

FP32 intermediate ONNX file is deleted,
INT8 ONNX path is returned.

6. Filenames and Artifacts

Model id is transformed from org/model to org-model for filenames.

Generated artifacts:

tokenizer files from save_pretrained,
FP32 model: <slug>_fp32.onnx,
INT8 model: <slug>_int8_qdq.onnx.

7. Error Handling and Fallback Semantics

7.1 Export Errors

Handled in export_onnx_model by staged fallback:

dynamo export failure falls back to legacy export,
quantization failure on dynamo path retries through legacy path.

7.2 Propagated Failures

These are not swallowed after fallback options are exhausted:

tokenizer/model load errors,
final legacy export errors,
final quantization errors,
filesystem errors (permission/path/disk issues).

8. Practical Examples

8.1 Tokenizer-Only Export

python

from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
tokenizer_dir = svc.export_tokenizer_only("./models/bge-small")
print(tokenizer_dir)

8.2 Local INT8 ONNX Export

python

from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
model_path = svc.export_onnx_model(
    output_dir="./models/bge-small",
    quantized=True,
    opset=17,
)
print(model_path)

8.3 FP32 ONNX Export

python

from jazzmine.core.embedding import EmbeddingService

svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
fp32_path = svc.export_onnx_model(
    output_dir="./models/bge-small",
    quantized=False,
    opset=17,
)
print(fp32_path)

9. Operational Guidance

Keep tokenizer and ONNX files in the same directory to simplify downstream memory configuration.
Prefer quantized=True for lower memory footprint and faster inference on CPU-heavy deployments.
Keep opset aligned with your deployed ONNX Runtime capabilities; default 17 is a balanced baseline.
Treat exporter fallback logs as useful diagnostics when upgrading PyTorch/transformers/ONNX stacks.

1. Overview

2. Public API Coverage Checklist

2.1 Public Methods

2.2 Internal Helpers

3. Constructor and Methods

3.1 __init__(model_id="BAAI/bge-small-en-v1.5")

3.2 export_tokenizer_only(output_dir)

3.3 export_onnx_model(output_dir, quantized=True, opset=17)

4. ONNX Export Design

4.1 Wrapper Model

4.2 Dynamic Shape Strategy

4.3 Export Fallback Behavior

5. Quantization Pipeline

5.1 _quantize(fp32_path, int8_path)

5.2 Quantization Fallback

5.3 Output Cleanup

6. Filenames and Artifacts

7. Error Handling and Fallback Semantics

7.1 Export Errors

7.2 Propagated Failures

8. Practical Examples

8.1 Tokenizer-Only Export

8.2 Local INT8 ONNX Export

8.3 FP32 ONNX Export

9. Operational Guidance

3.1 init(model_id="BAAI/bge-small-en-v1.5")