---
language: en
license: mit
tags:
  - security
  - compliance
  - cre
  - opencre
  - sentence-transformers
  - bi-encoder
  - cybersecurity
  - framework-mapping
  - nist
  - owasp
  - mitre-atlas
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
  - custom
model-index:
  - name: TRACT CRE Assignment
    results:
      - task:
          type: sentence-similarity
          name: CRE Hub Assignment
        metrics:
          - name: hit@1 (micro-averaged, LOFO)
            type: accuracy
            value: 0.537
          - name: ECE (calibration error)
            type: calibration
            value: 0.079
---

# TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies

## What Is This?

**In plain English:** Security frameworks like NIST 800-53, OWASP ASVS, and MITRE ATLAS each describe security controls in their own language. For example, NIST might say *"The system enforces password complexity requirements"* while OWASP says *"Verify that passwords have a minimum length of 12 characters."* These two controls are about the same thing, but they use different words and numbering systems.

[OpenCRE](https://opencre.org) is a public taxonomy that acts as a **Rosetta Stone for security frameworks** -- it organizes security concepts into ~522 "hubs" (topics like *"Password policy"*, *"Input validation"*, *"Access control"*) and maps controls from different frameworks to these hubs.

**TRACT is an AI model that automates this mapping.** Give it any security control text, and it tells you which CRE hub(s) that control belongs to. This saves hundreds of hours of manual expert work when onboarding a new security framework.

**Who is this for?**
- **Security professionals** mapping frameworks for compliance crosswalks
- **GRC (Governance, Risk, Compliance) teams** harmonizing multiple standards
- **Researchers** studying relationships across security taxonomies
- **Tool builders** who need automated framework-to-framework translation

## Quick Start

### Installation

```bash
pip install sentence-transformers numpy
```

### Basic Usage (5 lines)

```python
from sentence_transformers import SentenceTransformer
import numpy as np, json

# Load the model and its bundled data
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")

# Predict: what CRE hub does this control belong to?
query = model.encode(["Enforce password complexity requirements"], normalize_embeddings=True)
sims = (query @ hub_emb.T)[0]
for idx in np.argsort(sims)[-5:][::-1]:
    print(f"  {hub_ids[idx]}: {sims[idx]:.3f}")
```

### Full Inference with Calibration (Recommended)

The bundled `predict.py` script handles text sanitization, temperature-scaled confidence scores, and out-of-distribution detection:

```bash
# Single control
python predict.py "Ensure AI models are tested for adversarial robustness"

# Batch from file (one control per line)
python predict.py --file controls.txt --top-k 10

# JSON output for programmatic use
python predict.py --file controls.txt --top-k 5 --json
```

Example output:
```
  555-083 (0.342) Testing against backdoor poisoning
  011-322 (0.218) Testing against evasion
  663-550 (0.147) Testing against model theft by inference
  130-171 (0.089) Runtime model io integrity controls
  234-123 (0.064) Weakening training set backdoors
```

Each line shows: `hub_id (calibrated_confidence) hub_name`. Higher confidence = stronger match. An `[OOD]` flag appears when the input is too dissimilar to anything the model has seen (see [Out-of-Distribution Detection](#out-of-distribution-detection) below).

---

## How It Works

### The Assignment Paradigm

TRACT uses an **assignment** approach, not a pairwise comparison:

```
g(control_text) --> CRE_hub_position
```

Each control is independently mapped to the CRE ontology. The model never compares two controls directly ("is control A similar to control B?"). Instead, it asks: "where in the CRE taxonomy does this control belong?"

This matters because:
1. **Scalability:** Adding a new framework requires encoding its controls once, not comparing them against every existing control
2. **Consistency:** The CRE hub assignment is independent of what other frameworks exist
3. **Transitivity:** If NIST control X maps to hub H, and OWASP control Y also maps to hub H, then X and Y are implicitly related -- without ever comparing them directly

### Architecture (Technical)

```
Input text --> [Tokenizer] --> [BGE-large-en-v1.5 + LoRA] --> 1024-dim embedding
                                                                    |
                                                               (dot product)
                                                                    |
Hub embeddings (522 x 1024, pre-computed) --------------------------+
                                                                    |
                                                          cosine similarity scores
                                                                    |
                                                    [temperature scaling (T=0.0738)]
                                                                    |
                                                          calibrated confidence
                                                                    |
                                                    [OOD check (threshold=0.568)]
                                                                    |
                                                          ranked predictions
```

- **Base model:** [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) (335M parameters, 1024-dimensional embeddings)
- **Fine-tuning method:** LoRA (Low-Rank Adaptation) -- rank=16, alpha=32, dropout=0.1, applied to query/key/value attention matrices
- **This release contains fully merged weights** -- no adapter files needed, loads like any SentenceTransformer model
- **Training objective:** MNRL (Multiple Negatives Ranking Loss) with contrastive learning -- the model learns to place controls close to their correct hub and far from incorrect hubs in embedding space
- **Text-aware batch sampling:** Training batches group semantically similar controls together, creating harder negatives that force the model to make finer distinctions
- **Training data:** 4,237 framework-to-hub links from 22 OpenCRE-linked frameworks, producing 4,061 training pairs after deduplication

---

## Evaluation

### What Is LOFO Cross-Validation?

Standard train/test splits would leak information: if OWASP ASVS controls appear in both training and test sets, the model could memorize ASVS-specific language rather than learning general security concepts.

**Leave-One-Framework-Out (LOFO)** is stricter. For each evaluation fold:
1. One entire framework is held out (e.g., all MITRE ATLAS controls)
2. The model is trained on the remaining frameworks
3. **Hub firewall:** Hub representations are rebuilt WITHOUT the held-out framework's data -- this prevents the model from "remembering" the held-out framework's contributions to hub embeddings
4. The model predicts hub assignments for the held-out framework's controls

This simulates the real use case: mapping a **brand-new framework** the model has never seen.

### Results

| Fold | hit@1 | Zero-shot | Delta | hit@any | n |
|---|---|---|---|---|---|
| MITRE ATLAS | 0.279 | 0.273 | +0.006 | 0.27906976744186046 | 43 |
| NIST AI 100-2 | 0.429 | 0.107 | +0.322 | 0.42857142857142855 | 28 |
| OWASP AI Exchange | 0.762 | 0.619 | +0.143 | 0.7619047619047619 | 63 |
| OWASP Top10 for LLM | 0.333 | 0.333 | +0.000 | 0.3333333333333333 | 6 |
| OWASP Top10 for ML | 0.714 | 0.429 | +0.285 | 0.7142857142857143 | 7 |
| **Micro average** | **0.537** | **0.400** | **+0.138** | **0.537** | **147** |


**Reading this table:**
- **hit@1:** The model's top prediction matches the correct hub (strict accuracy)
- **Zero-shot:** Accuracy of the base model before fine-tuning (the improvement from training)
- **Delta:** How much fine-tuning helped (positive = improvement)
- **hit@any:** Accuracy when the control correctly maps to multiple hubs (since ~35% of controls belong to more than one hub, this is a fairer measure)
- **n:** Number of controls in that framework's test set

**What the numbers mean:**
- **OWASP AI Exchange (76.2%):** Strong performance -- the model correctly assigns 3 out of 4 AI security controls to their right hub on the first try
- **MITRE ATLAS (27.9%):** Weakest fold. ATLAS techniques are highly specific ("Adversarial Perturbation" vs. "Data Poisoning") and map to closely related hubs that are hard to disambiguate. The model often picks a neighboring hub rather than the exact one
- **Micro average (53.7%):** Overall, the model's top prediction is correct about half the time -- and when accounting for multi-hub controls, accuracy is higher

### Confidence Intervals

All metrics include bootstrap confidence intervals (10,000 resamples, 95% CI). The aggregate hit@1 CI is [0.462, 0.612], reflecting the relatively small evaluation set (147 controls across 5 AI frameworks).

---

## Calibration: Understanding Confidence Scores

### What Is Calibration?

Raw model outputs are cosine similarities (how close two vectors are). These are useful for **ranking** (higher = better match) but are NOT probabilities. A score of 0.85 does not mean "85% chance this is correct."

TRACT applies **temperature scaling** to convert rankings into better-calibrated confidence scores:

```
confidence = softmax(similarity / T)
```

where T=0.0738 (learned from a held-out calibration set of 420 traditional framework controls).

### Calibration Metrics

| Metric | Value | What It Means |
|---|---|---|
| **Temperature (T)** | 0.0738 | Sharpens the similarity distribution -- small T means the model is very "peaky" (strongly favors top matches) |
| **ECE** | 0.079 (95% CI [0.049, 0.111]) | Expected Calibration Error -- how far confidence scores deviate from true accuracy. 0.0 = perfectly calibrated. 0.079 means scores are off by ~8 percentage points on average |
| **OOD threshold** | 0.568 | If the maximum similarity is below this, the input is likely outside the model's knowledge (see below) |
| **Conformal quantile** | 0.9971 | 99.7% of correct predictions fall above this similarity threshold |

### Out-of-Distribution Detection

When you give the model text that is completely unrelated to security (e.g., a recipe or a news article), it will still produce predictions -- but they will all have low similarity scores. The model flags inputs as **out-of-distribution (OOD)** when:

```
max(similarity_to_any_hub) < 0.568
```

OOD predictions are marked with `[OOD]` in the output. **Treat OOD predictions with extra skepticism** -- they indicate the model is guessing rather than making an informed assignment.

---

## Bridge Analysis: Connecting AI and Traditional Security

### Background

The CRE ontology contains 522 hubs. Some hubs are linked only by AI security frameworks (like MITRE ATLAS), some only by traditional frameworks (like NIST 800-53), and some by both:

| Category | Count | Example |
|---|---|---|
| AI-only | 21 | "Testing against evasion," "GenAI model alignment" |
| Traditional-only | 382 | "Input validation," "Password policy" |
| Naturally bridged (both) | 60 | "Data poisoning" (linked by both ATLAS and CWE) |
| Unlinked (structural) | 59 | Internal grouping nodes without framework links |

### What Bridge Analysis Does

For the 21 AI-only hubs, the model identifies which traditional hubs are conceptually closest using embedding similarity. For example:

> "Human AI oversight" (AI-only) ←→ "Security governance regarding people" (traditional)
> Cosine similarity: 0.774

Both hubs are about the same core concept: **humans must remain accountable for security decisions**, whether in AI systems or traditional security programs.

### Method and Review Process

1. **Compute similarity matrix:** 21 AI-only hubs x 382 traditional-only hubs (8,022 pairs)
2. **Extract top-3:** For each AI-only hub, take the 3 most similar traditional hubs (63 candidates total)
3. **Expert review:** A human security expert reviewed all 63 candidates and accepted or rejected each based on domain knowledge -- the similarity score is a ranking signal, not an automatic classifier
4. **Acceptance threshold:** Candidates above the 99th percentile of the full similarity matrix (cosine >= 0.45) were considered; 4 additional candidates were rejected for specious LLM-rationalized connections

### Results

- **Candidates evaluated:** 63
- **Accepted bridges:** 46 (recorded as bidirectional `related_hub_ids` in the hierarchy)
- **Rejected:** 17

Accepted bridges are stored in `cre_hierarchy.json` as `related_hub_ids`. They represent **lateral conceptual connections** between AI and traditional security -- they do not change the hierarchical structure, model weights, or calibration.

Full bridge evidence, similarity scores, and review decisions are in `bridge_report.json`.

---

## Bundled Files

This repository contains the model plus all data needed for standalone inference:

| File | Size | Purpose |
|---|---|---|
| `0_Transformer/model.safetensors` | ~1.3 GB | Fully merged model weights (BGE-large + LoRA, no adapter needed) |
| `predict.py` | ~5 KB | Standalone inference script -- run without installing TRACT |
| `train.py` | ~3 KB | Reproduction guide with exact hyperparameters |
| `hub_ids.json` | ~12 KB | Ordered list of 522 hub IDs matching model output dimensions |
| `hub_embeddings.npy` | ~2 MB | Pre-computed 522 x 1024 hub embedding matrix |
| `cre_hierarchy.json` | ~800 KB | Full CRE taxonomy tree with bridge links |
| `hub_descriptions.json` | ~200 KB | Human-readable descriptions for each hub |
| `calibration.json` | ~1 KB | Temperature, OOD threshold, conformal quantile |
| `bridge_report.json` | ~15 KB | Bridge analysis evidence and review decisions |

### Reproducing the Model

See `train.py` for the exact configuration. Full reproduction requires cloning the [TRACT repository](https://github.com/rockcyber/TRACT) which contains custom training procedures (text-aware batch sampling, LOFO cross-validation with hub firewall, temperature-scaled contrastive loss).

---

## Detailed Usage Examples

### Example 1: Map a Single Control

```python
from sentence_transformers import SentenceTransformer
import numpy as np
import json

# Load everything
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")        # shape: (522, 1024)
hierarchy = json.load(open("cre_hierarchy.json"))
cal = json.load(open("calibration.json"))

# Encode your control text (normalize_embeddings=True is required)
text = "The application must validate all user input before processing"
query = model.encode([text], normalize_embeddings=True)  # shape: (1, 1024)

# Compute similarities (dot product = cosine for unit vectors)
sims = (query @ hub_emb.T)[0]                  # shape: (522,)

# Apply temperature scaling for calibrated confidence
def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

confidence = softmax(sims / cal["t_deploy"])

# Get top-5 predictions
top5 = np.argsort(confidence)[-5:][::-1]
for idx in top5:
    hid = hub_ids[idx]
    hub = hierarchy["hubs"][hid]
    ood = " [OOD]" if float(np.max(sims)) < cal["ood_threshold"] else ""
    print(f"  {hid} ({confidence[idx]:.3f}){ood} {hub['name']}")
    print(f"    Path: {hub['hierarchy_path']}")
```

### Example 2: Batch-Map an Entire Framework

```python
import json
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")

# Your framework controls (e.g., parsed from a CSV or JSON)
controls = [
    {"id": "AC-1", "text": "Access control policy and procedures"},
    {"id": "AC-2", "text": "Account management and provisioning"},
    {"id": "IA-5", "text": "Authenticator management including password rules"},
]

# Encode all controls at once (much faster than one at a time)
texts = [c["text"] for c in controls]
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)

# Compute all similarities in one matrix multiply
all_sims = embeddings @ hub_emb.T  # shape: (n_controls, 522)

# Build crosswalk
crosswalk = []
for i, ctrl in enumerate(controls):
    top_idx = int(np.argmax(all_sims[i]))
    crosswalk.append({
        "control_id": ctrl["id"],
        "control_text": ctrl["text"],
        "predicted_hub": hub_ids[top_idx],
        "similarity": round(float(all_sims[i, top_idx]), 4),
    })

# Save as JSON
with open("crosswalk.json", "w") as f:
    json.dump(crosswalk, f, indent=2)
```

### Example 3: Find Related Hubs via Bridges

```python
import json

hierarchy = json.load(open("cre_hierarchy.json"))

# Find all AI/traditional bridge connections
for hub_id, hub in hierarchy["hubs"].items():
    related = hub.get("related_hub_ids", [])
    if related:
        print(f"{hub['name']} ({hub_id})")
        for rid in related:
            rhub = hierarchy["hubs"][rid]
            print(f"  <-> {rhub['name']} ({rid})")
        print()
```

---

## Limitations and Known Issues

1. **ATLAS fold performance (27.9% hit@1):** MITRE ATLAS techniques map to closely related hubs (e.g., "Data Poisoning" vs. "Adversarial Perturbation") that are hard to disambiguate. The model often predicts a neighboring hub rather than the exact one. hit@5 is 27.9%, showing the correct hub is usually in the top 5.

2. **Multi-hub controls (35%):** About 1 in 3 controls legitimately maps to more than one hub. hit@1 alone understates performance -- the hit@any column in the evaluation table is a fairer measure.

3. **Calibration is approximate:** ECE=0.079 means confidence scores are off by ~8 percentage points on average. Treat them as ordinal rankings (higher = better), not as exact probabilities.

4. **Training data scope:** Calibrated on 420 traditional framework holdout items. Accuracy on AI-specific text may differ from the reported metrics, especially for concepts not well-represented in the 5 AI frameworks.

5. **Not a replacement for expert judgment:** Model predictions are a **starting point** for compliance crosswalks. A security professional should review all assignments, especially for high-stakes compliance work.

6. **Language:** English only. The base model (BGE-large-en-v1.5) and all training data are English.

7. **What does NOT work for this task:** DeBERTa-v3-NLI achieves hit@1=0.000 -- Natural Language Inference (textual entailment) is fundamentally different from semantic similarity for taxonomy assignment. Do not substitute NLI models.

---

## Ethical Considerations

- This model is a **decision-support tool**, not an autonomous compliance engine. All predictions require human review before use in security assessments or regulatory filings.
- The model was trained on publicly available security framework data. No proprietary or confidential data was used.
- Active learning rounds during development used expert-reviewed predictions, not autonomous deployment.
- Bridge analysis connections were individually reviewed by a human security expert; automated connections were not added without review.

## Environmental Impact

- **Training compute:** NVIDIA H100 GPU via RunPod, 4.2 GPU-hours total (including LOFO cross-validation, ablation studies, and final deployment model)
- **Inference deployment:** Runs on an NVIDIA Jetson Orin AGX edge device (~30W TDP). A single control prediction takes <100ms on consumer hardware.
- **Carbon context:** Estimated 1.3 kWh training energy (US average grid: ~0.5 kg CO2e)

## Glossary

| Term | Definition |
|---|---|
| **CRE** | Common Requirements Enumeration -- a universal taxonomy of security topics maintained by [OpenCRE.org](https://opencre.org) |
| **Hub** | A node in the CRE taxonomy tree representing a security concept (e.g., "Input validation," "Access control") |
| **LOFO** | Leave-One-Framework-Out -- cross-validation method where an entire framework is held out for testing |
| **Hub firewall** | During LOFO evaluation, hub embeddings are rebuilt WITHOUT the held-out framework to prevent information leakage |
| **hit@1** | The model's single best prediction matches the correct hub |
| **hit@any** | The model's top prediction matches ANY of the control's correct hubs (relevant for multi-hub controls) |
| **ECE** | Expected Calibration Error -- measures how well confidence scores match actual accuracy |
| **OOD** | Out-of-Distribution -- input text is too different from training data for reliable prediction |
| **LoRA** | Low-Rank Adaptation -- an efficient fine-tuning method that trains small adapter matrices instead of modifying all model weights |
| **Bridge** | A discovered conceptual connection between an AI-specific and a traditional CRE hub |
| **Temperature scaling** | A post-hoc calibration technique that sharpens or smooths the model's output distribution |

## Citation

```bibtex
@software{tract2026,
  title = {TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies},
  author = {Rock},
  year = {2026},
  url = {https://github.com/rockcyber/TRACT}
}
```

## License

MIT License for model weights and code. The base model ([BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)) is also MIT licensed.

Bundled data files (CRE hierarchy, hub descriptions, bridge report) are sourced from publicly available security frameworks and [OpenCRE.org](https://opencre.org), provided under CC0 1.0 Universal.