Binary Transformers: Learning Language from Raw Binary
Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.
This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning:
| Model | Vocab | Input | Weights | Description |
|---|---|---|---|---|
| Byte-level | 256 | bytes (0x00-0xFF) | real | One token per byte value |
| Bit-level | 2 | bits (0, 1) | real | Pure binary, 8 tokens per byte |
| Dibit | 4 | dibits (00,01,10,11) | real | 2-bit tokens, 4 per byte |
| Pure Binary | 2 | bits (0, 1) | binary (-1/+1) | BITS ALL THE WAY DOWN |
Why?
Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates:
- Tokenizer overhead and complexity
- Language/domain bias baked into vocabulary
- Preprocessing bottleneck
What if we eliminated tokenization entirely?
These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: wire-speed learning where models absorb network traffic in real-time.
Results (Live Experiments - 16 Jan 2026)
Byte-Level (vocab=256)
Data: 350KB web crawl
BPB: 4.68 (vs 8.0 random = 41% compression)
Speed: 8.7 KB/s learning rate
Params: 0.6M
Learns HTML structure, XML tags, timestamps from raw bytes.
Bit-Level (vocab=2)
Data: 550KB
Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression)
Speed: 0.7 KB/s
Params: 85M
Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s.
Dibit (vocab=4: 00,01,10,11)
Data: 437KB
BPB: 7.55 (vs 8.0 random = 5.7% compression)
Speed: 0.25 KB/s
Params: 37.8M
2-bit tokens provide 2x context efficiency vs bit-level. Best compression so far!
Pure Binary (vocab=2, binary weights)
Data: 806KB
Entropy: 0.995 bit/bit (0.5% compression)
Binary params: 99.8%
Params: 4.7M
BITS ALL THE WAY DOWN - input bits, binary weights (-1/+1), output bits. On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate.
Architecture
All models use standard transformer architecture with:
- Causal self-attention
- GELU activation
- LayerNorm
- AdamW optimizer
- Straight-Through Estimator (STE) for binary weight gradients
Key Innovation: Online Learning
Unlike traditional batch training, these models learn from streaming data:
- Micro-batches (32-512 tokens)
- Single-pass, no data curation
- Real-time network stream compatible
Usage
Byte-Level
# Pipe any data source
cat data.bin | python byte_trainer.py
curl -s http://example.com | python byte_trainer.py
zcat crawl.jsonl.gz | python byte_trainer.py
Bit-Level
cat data.bin | python bit_trainer.py
Dibit (2-bit tokens)
cat data.bin | python dibit_trainer.py
Pure Binary (binary weights)
cat data.bin | python purebit_trainer.py
Configuration
Edit the CONFIG dict in each trainer:
CONFIG = {
"d": 256, # embedding dimension
"layers": 6, # transformer layers
"heads": 8, # attention heads
"vocab": 2, # vocabulary size
"ctx": 2048, # context length
}
Files
byte_trainer.py # Vocab=256, one token per byte
bit_trainer.py # Vocab=2, pure bits
dibit_trainer.py # Vocab=4, 2-bit tokens (00,01,10,11)
purebit_trainer.py # Vocab=2 + binary weights (-1/+1)
Insights
Byte-level is sweet spot - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead
Bit-level works but slow - 8x longer sequences mean 8x less context per forward pass
Dibit balances - 2-bit tokens give 2x context vs bit-level while staying "pure binary"
Binary weights viable - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups
HTML is natural SFT - Web data contains instruction-following patterns:
<h3>Question</h3><p>Answer,<dt>Term</dt><dd>Definition</dd>, JSON Q&A
Future Work
- Scale to billions of parameters
- Custom CUDA kernels for binary ops (XNOR + popcount)
- FPGA/ASIC implementation for true wire-speed learning
- Hierarchical binary models (bit โ byte โ word emergence)
Citation
@misc{opentransformer2026binary,
title={Binary Transformers: Learning Language from Raw Binary},
author={OpenTransformer},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/OpenTransformer/binary-transformers}
}
License
MIT
Acknowledgments
Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project.