OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
Paper
•
2511.18622
•
Published
A 110M parameter ModernBERT-based sentence embedding model for glossary and domain-specific text.
Related models:
| Property | Value |
|---|---|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 110M |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocab size | 32,768 |
| Max sequence | 1,024 tokens |
| Embedding dim | 768 (L2 normalized) |
Evaluated on 80 domain-specific documents across 10 categories using KMeans clustering.
| Model | Params | ARI | Cluster Acc |
|---|---|---|---|
| OGBert-110M-Sentence | 110M | 0.941 | 0.975 |
| BERT-base | 110M | 0.896 | 0.950 |
| RoBERTa-base | 125M | 0.941 | 0.975 |
| MiniLM-L6-v2 | 22M | 0.833 | 0.925 |
OGBert-110M-Sentence matches RoBERTa-base on clustering and beats MiniLM by 13%.
Mean Reciprocal Rank for same-category document retrieval.
| Model | Params | Sample MRR |
|---|---|---|
| OGBert-110M-Sentence | 110M | 0.959 |
| BERT-base | 110M | 0.994 |
| RoBERTa-base | 125M | 0.989 |
| MiniLM-L6-v2 | 22M | 0.958 |
Spearman correlation between model cosine similarities and human judgments.
| Model | Params | SimLex-999 (ρ) |
|---|---|---|
| OGBert-110M-Sentence | 110M | 0.345 |
| BERT-base | 110M | 0.070 |
| RoBERTa-base | 125M | -0.061 |
OGBert-110M-Sentence achieves 5x better word similarity than BERT-base.
At the same size as BERT-base, OGBert-110M-Sentence achieves:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mjbommar/ogbert-110m-sentence')
embeddings = model.encode(['your text here']) # L2 normalized by default
Example - Domain Similarity:
sentences = [
'The financial audit revealed discrepancies in the quarterly report.',
'An accounting review found errors in the fiscal statement.',
'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)
The model correctly identifies higher similarity within the same domain.
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-110m-sentence')
inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
Use mjbommar/ogbert-110m-base instead.
If you use this model, please cite the OpenGloss dataset:
@article{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito II, Michael J.},
journal={arXiv preprint arXiv:2511.18622},
year={2025}
}
Apache 2.0