1. Overview
It is responsible for:
- downloading/loading tokenizer and transformer model from Hugging Face,
- exporting ONNX models for local embedding inference,
- optionally quantizing ONNX to INT8,
- exporting tokenizer-only assets for remote embedding backends that still need local BM25 tokenization.
The class is synchronous and uses lazy loading for both tokenizer and model.
2. Public API Coverage Checklist
This document covers every public method and key internal helper in the implementation.
2.1 Public Methods
| Method | Included | Purpose |
|---|---|---|
| __init__(model_id="BAAI/bge-small-en-v1.5") | Yes | Configure model identifier and lazy state |
| export_tokenizer_only(output_dir) | Yes | Save tokenizer files only |
| export_onnx_model(output_dir, quantized=True, opset=17) | Yes | Export FP32 ONNX, optionally quantize to INT8 |
2.2 Internal Helpers
| Helper | Included | Purpose |
|---|---|---|
| _model_id_for_filename() | Yes | Convert model id into filename-safe slug |
| _load_tokenizer() | Yes | Lazy-load tokenizer |
| _load_model() | Yes | Lazy-load model and switch to eval mode |
| _quantize(fp32_path, int8_path) | Yes | INT8 static quantization with calibration reader |
3. Constructor and Methods
3.1 __init__(model_id="BAAI/bge-small-en-v1.5")
Initializes service state.
State fields:
- self.model_id selected model id,
- self.tokenizer initially None,
- self.model initially None.
No network or model loading occurs during construction.
3.2 export_tokenizer_only(output_dir)
Exports tokenizer files to output_dir.
Flow:
- ensure target directory exists,
- lazy-load tokenizer if needed,
- call save_pretrained(output_dir),
- return output_dir.
Use case:
- remote embeddings (OpenAI/Cohere/vLLM/etc.) with local sparse/BM25 support.
3.3 export_onnx_model(output_dir, quantized=True, opset=17)
Exports embedding model to ONNX.
High-level steps:
- ensure target directory exists,
- load tokenizer and model lazily,
- save tokenizer assets into same directory,
- build dummy batched tokenized inputs,
- export FP32 ONNX,
- optionally quantize to INT8 QDQ.
Return value:
- FP32 mode (quantized=False): returns FP32 ONNX path,
- quantized mode (quantized=True): returns INT8 ONNX path.
4. ONNX Export Design
4.1 Wrapper Model
The exporter uses an internal _OnnxExportWrapper(torch.nn.Module) to normalize output behavior across transformer models.
Important behavior:
- inspects forward signature to avoid unsupported kwargs,
- passes explicit kwargs to avoid tracing collisions,
- disables optional outputs when supported (use_cache, output_attentions, output_hidden_states),
- always returns a pair:
- last_hidden_state,
- pooler_output (fallback to CLS token from last_hidden_state[:, 0, :] when absent).
4.2 Dynamic Shape Strategy
Primary export path uses dynamo exporter:
- sets dynamic batch and sequence dimensions using torch.export.Dim,
- enforces opset max(opset, 18) for dynamo export.
Fallback export path uses legacy exporter:
- uses dynamic_axes for both inputs and outputs,
- keeps pooled output batch dimension dynamic to avoid runtime shape mismatch,
- uses provided opset.
4.3 Export Fallback Behavior
Default behavior:
- try dynamo export first,
- if dynamo fails, log fallback message and retry with legacy export.
This provides resilience across varying PyTorch/ONNX exporter compatibility levels.
5. Quantization Pipeline
5.1 _quantize(fp32_path, int8_path)
Uses onnxruntime.quantization.quantize_static with:
- QuantFormat.QDQ,
- weight_type=QInt8,
- activation_type=QInt8.
Calibration data:
- internal TextCalibrationDataReader built from tokenizer,
- minimal repeated text corpus,
- fixed max sequence length 128,
- emits input_ids and attention_mask numpy arrays.
5.2 Quantization Fallback
When quantization fails on a dynamo-exported FP32 model:
- remove failed FP32 artifact,
- re-export with legacy exporter,
- retry quantization.
If quantization fails after legacy export, exception is propagated.
5.3 Output Cleanup
On successful quantized export:
- FP32 intermediate ONNX file is deleted,
- INT8 ONNX path is returned.
6. Filenames and Artifacts
Model id is transformed from org/model to org-model for filenames.
Generated artifacts:
- tokenizer files from save_pretrained,
- FP32 model: <slug>_fp32.onnx,
- INT8 model: <slug>_int8_qdq.onnx.
7. Error Handling and Fallback Semantics
7.1 Export Errors
Handled in export_onnx_model by staged fallback:
- dynamo export failure falls back to legacy export,
- quantization failure on dynamo path retries through legacy path.
7.2 Propagated Failures
These are not swallowed after fallback options are exhausted:
- tokenizer/model load errors,
- final legacy export errors,
- final quantization errors,
- filesystem errors (permission/path/disk issues).
8. Practical Examples
8.1 Tokenizer-Only Export
from jazzmine.core.embedding import EmbeddingService
svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
tokenizer_dir = svc.export_tokenizer_only("./models/bge-small")
print(tokenizer_dir)8.2 Local INT8 ONNX Export
from jazzmine.core.embedding import EmbeddingService
svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
model_path = svc.export_onnx_model(
output_dir="./models/bge-small",
quantized=True,
opset=17,
)
print(model_path)8.3 FP32 ONNX Export
from jazzmine.core.embedding import EmbeddingService
svc = EmbeddingService(model_id="BAAI/bge-small-en-v1.5")
fp32_path = svc.export_onnx_model(
output_dir="./models/bge-small",
quantized=False,
opset=17,
)
print(fp32_path)9. Operational Guidance
- Keep tokenizer and ONNX files in the same directory to simplify downstream memory configuration.
- Prefer quantized=True for lower memory footprint and faster inference on CPU-heavy deployments.
- Keep opset aligned with your deployed ONNX Runtime capabilities; default 17 is a balanced baseline.
- Treat exporter fallback logs as useful diagnostics when upgrading PyTorch/transformers/ONNX stacks.