OGBert-110M-Sentence

A 110M parameter ModernBERT-based sentence embedding model for glossary and domain-specific text.

Related models:

mjbommar/ogbert-110m-base - Base MLM model for fill-mask tasks

Model Details

Property	Value
Architecture	ModernBERT + Mean Pooling + L2 Normalize
Parameters	110M
Hidden size	768
Layers	12
Attention heads	12
Vocab size	32,768
Max sequence	1,024 tokens
Embedding dim	768 (L2 normalized)

Training

Pretraining: Masked Language Modeling on domain-specific glossary corpus
Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
Checkpoint: Step 8K (selected for optimal downstream performance)
Key finding: L2 normalization of embeddings is critical for clustering/retrieval performance

Performance

Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using KMeans clustering.

Model	Params	ARI	Cluster Acc
OGBert-110M-Sentence	110M	0.941	0.975
BERT-base	110M	0.896	0.950
RoBERTa-base	125M	0.941	0.975
MiniLM-L6-v2	22M	0.833	0.925

OGBert-110M-Sentence matches RoBERTa-base on clustering and beats MiniLM by 13%.

Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

Model	Params	Sample MRR
OGBert-110M-Sentence	110M	0.959
BERT-base	110M	0.994
RoBERTa-base	125M	0.989
MiniLM-L6-v2	22M	0.958

Word Similarity (SimLex-999)

Spearman correlation between model cosine similarities and human judgments.

Model	Params	SimLex-999 (ρ)
OGBert-110M-Sentence	110M	0.345
BERT-base	110M	0.070
RoBERTa-base	125M	-0.061

OGBert-110M-Sentence achieves 5x better word similarity than BERT-base.

Summary vs Baselines

At the same size as BERT-base, OGBert-110M-Sentence achieves:

5x better word similarity (SimLex)
Matches RoBERTa on clustering (ARI)
13% better clustering than MiniLM

Usage

Sentence-Transformers (Recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-110m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default

Example - Domain Similarity:

sentences = [
    'The financial audit revealed discrepancies in the quarterly report.',
    'An accounting review found errors in the fiscal statement.',
    'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)

The model correctly identifies higher similarity within the same domain.

Direct Transformers Usage

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-110m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)

For Fill-Mask Tasks

Use mjbommar/ogbert-110m-base instead.

Citation

If you use this model, please cite the OpenGloss dataset:

@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}

License

Apache 2.0

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train mjbommar/ogbert-110m-sentence

Paper for mjbommar/ogbert-110m-sentence

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Paper • 2511.18622 • Published Nov 23, 2025

Evaluation results

adjusted_rand_index on Custom Domain Clustering
self-reported

0.941
mrr on Custom Domain Retrieval
self-reported

0.959