--- license: apache-2.0 language: - en tags: - text-classification - fact-checking - hallucination-detection - modernbert - lora - llm-routing - llm-gateway datasets: - squad - trivia_qa - hotpot_qa - truthful_qa - databricks/databricks-dolly-15k - tatsu-lab/alpaca - pminervini/HaluEval - neural-bridge/rag-dataset-12000 pipeline_tag: text-classification base_model: answerdotai/ModernBERT-base metrics: - accuracy - f1 model-index: - name: HaluGate-Sentinel results: - task: type: text-classification name: Fact-Check Need Classification metrics: - type: accuracy value: 0.964 name: Validation Accuracy - type: f1 value: 0.965 name: F1 Score --- # HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper **HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**. It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**. The model classifies prompts into: - **`FACT_CHECK_NEEDED`**: Information-seeking queries that depend on external/world knowledge - e.g., “When was the Eiffel Tower built?” - e.g., “What is the GDP of Japan in 2023?” - **`NO_FACT_CHECK_NEEDED`**: Creative, coding, opinion, or pure reasoning/math tasks - e.g., “Write a poem about spring” - e.g., “Implement quicksort in Python” - e.g., “What is the meaning of life?” This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`. --- ## Model Details - **Model name**: `HaluGate Sentinel` - **Repository**: `llm-semantic-router/halugate-sentinel` - **Task**: Binary text classification (prompt-level) - **Labels**: - `0` → `NO_FACT_CHECK_NEEDED` - `1` → `FACT_CHECK_NEEDED` - **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) - **Fine-tuning method**: LoRA (rank = 16, alpha = 32) - **Validation Accuracy**: 96.4% - **Validation F1 Score**: 0.965 - **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types --- ## Position in a Hallucination Mitigation Pipeline HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture: 1. **Stage 0 — HaluGate Sentinel (this model)** Classifies user prompts and decides whether **fact-checking is needed**: - `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation. - `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers). 2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)** Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies. HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries. --- ## Usage ### Basic Inference ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch MODEL_ID = "llm-semantic-router/halugate-sentinel" model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'} def classify_prompt(text: str): inputs = tokenizer( text, return_tensors="pt", truncation=True, max_length=512, ) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1)[0] pred_id = int(torch.argmax(probs).item()) label = id2label.get(pred_id, str(pred_id)) confidence = float(probs[pred_id].item()) return label, confidence # Examples print(classify_prompt("When was the Eiffel Tower built?")) # → ('FACT_CHECK_NEEDED', 0.99...) print(classify_prompt("Write a poem about spring")) # → ('NO_FACT_CHECK_NEEDED', 0.98...) print(classify_prompt("Implement a binary search in Python")) # → ('NO_FACT_CHECK_NEEDED', 0.97...) ```` ### Example: Integrating with a Router / Gateway Pseudocode for a routing decision: ```python label, prob = classify_prompt(user_prompt) FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD: route = "hallucination_gatekeeper" # RAG / tools / verifiers else: route = "direct_generation" # Use `route` to select downstream pipelines in your LLM gateway. ``` --- ## Training Data Balanced dataset of **50,000** prompts: ### FACT_CHECK_NEEDED (25,000 samples) Information-seeking and knowledge-intensive questions drawn from: * **NISQ-ISQ**: Gold-standard information-seeking questions * **HaluEval**: Hallucination-focused QA benchmark * **FaithDial**: Information-seeking dialogue questions * **FactCHD**: Fact-conflicting / hallucination-prone queries * **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets * **TruthfulQA**: High-risk factual queries * **CoQA**: Conversational factual questions ### NO_FACT_CHECK_NEEDED (25,000 samples) Tasks that typically do **not** require external factual verification: * **NISQ-NonISQ**: Non-information-seeking questions * **Databricks Dolly**: Creative writing, summarization, brainstorming * **WritingPrompts**: Creative writing prompts * **Alpaca**: Coding, math, opinion, and general instructions The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”. --- ## Intended Use ### Primary Use Cases * **LLM Gateway / Router** * Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers). * Avoid unnecessary compute for creative / coding / opinion tasks. * **Hallucination Gatekeeper Frontline** * Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`. * Implement different safety and latency policies for the two classes. * **Traffic Analytics & Risk Scoring** * Monitor proportion of factual vs non-factual traffic. * Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly. ### Non-Goals * It does *not* verify the correctness of any answer. * It should not be used as a generic toxicity / safety classifier. * It does not handle non-English prompts reliably (trained on English only). --- ## How It Works * **Architecture**: * ModernBERT-base encoder * Classification head on top of `[CLS]` / pooled representation * **Fine-tuning**: * LoRA on the base encoder * Binary cross-entropy / cross-entropy loss on the two labels * Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED * **Decision Boundary**: * Borderline / philosophical / highly abstract questions may be assigned lower confidence. * Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle. --- ## Limitations * **Language**: * Trained on English data only. * Performance on other languages is not guaranteed. * **Borderline Queries**: * Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous. * In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy. * **Domain Coverage**: * General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning. * **Not a Verifier**: * This model only decides if a prompt **needs factual support**. * Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers). --- ## Ethical Considerations * **Risk Trade-off**: * Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks. * Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments. * **Deployment Recommendation**: * For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path. --- ## Citation If you use HaluGate Sentinel in academic work or production systems, please cite: ```bibtex @software{halugate_sentinel_2024, title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers}, author = {vLLM Project}, year = {2024}, url = {https://github.com/vllm-project/semantic-router} } ``` --- ## Acknowledgements * Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) * Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above. * Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.