Binary Transformers: Learning Language from Raw Binary

Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.

This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning:

Model	Vocab	Input	Weights	Description
Byte-level	256	bytes (0x00-0xFF)	real	One token per byte value
Bit-level	2	bits (0, 1)	real	Pure binary, 8 tokens per byte
Dibit	4	dibits (00,01,10,11)	real	2-bit tokens, 4 per byte
Pure Binary	2	bits (0, 1)	binary (-1/+1)	BITS ALL THE WAY DOWN

Why?

Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates:

Tokenizer overhead and complexity
Language/domain bias baked into vocabulary
Preprocessing bottleneck

What if we eliminated tokenization entirely?

These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: wire-speed learning where models absorb network traffic in real-time.

Results (Live Experiments - 16 Jan 2026)

Byte-Level (vocab=256)

Data: 350KB web crawl
BPB: 4.68 (vs 8.0 random = 41% compression)
Speed: 8.7 KB/s learning rate
Params: 0.6M

Learns HTML structure, XML tags, timestamps from raw bytes.

Bit-Level (vocab=2)

Data: 550KB
Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression)
Speed: 0.7 KB/s
Params: 85M

Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s.

Dibit (vocab=4: 00,01,10,11)

Data: 437KB
BPB: 7.55 (vs 8.0 random = 5.7% compression)
Speed: 0.25 KB/s
Params: 37.8M

2-bit tokens provide 2x context efficiency vs bit-level. Best compression so far!

Pure Binary (vocab=2, binary weights)

Data: 806KB
Entropy: 0.995 bit/bit (0.5% compression)
Binary params: 99.8%
Params: 4.7M

BITS ALL THE WAY DOWN - input bits, binary weights (-1/+1), output bits. On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate.

Architecture

All models use standard transformer architecture with:

Causal self-attention
GELU activation
LayerNorm
AdamW optimizer
Straight-Through Estimator (STE) for binary weight gradients

Key Innovation: Online Learning

Unlike traditional batch training, these models learn from streaming data:

Micro-batches (32-512 tokens)
Single-pass, no data curation
Real-time network stream compatible

Usage

Byte-Level

# Pipe any data source
cat data.bin | python byte_trainer.py
curl -s http://example.com | python byte_trainer.py
zcat crawl.jsonl.gz | python byte_trainer.py

Bit-Level

cat data.bin | python bit_trainer.py

Dibit (2-bit tokens)

cat data.bin | python dibit_trainer.py

Pure Binary (binary weights)

cat data.bin | python purebit_trainer.py

Configuration

Edit the CONFIG dict in each trainer:

CONFIG = {
    "d": 256,      # embedding dimension
    "layers": 6,   # transformer layers
    "heads": 8,    # attention heads
    "vocab": 2,    # vocabulary size
    "ctx": 2048,   # context length
}

Files

byte_trainer.py    # Vocab=256, one token per byte
bit_trainer.py     # Vocab=2, pure bits
dibit_trainer.py   # Vocab=4, 2-bit tokens (00,01,10,11)
purebit_trainer.py # Vocab=2 + binary weights (-1/+1)

Insights

Byte-level is sweet spot - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead
Bit-level works but slow - 8x longer sequences mean 8x less context per forward pass
Dibit balances - 2-bit tokens give 2x context vs bit-level while staying "pure binary"
Binary weights viable - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups
HTML is natural SFT - Web data contains instruction-following patterns: <h3>Question</h3><p>Answer, <dt>Term</dt><dd>Definition</dd>, JSON Q&A

Future Work

Scale to billions of parameters
Custom CUDA kernels for binary ops (XNOR + popcount)
FPGA/ASIC implementation for true wire-speed learning
Hierarchical binary models (bit → byte → word emergence)

Citation

@misc{opentransformer2026binary,
  title={Binary Transformers: Learning Language from Raw Binary},
  author={OpenTransformer},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/OpenTransformer/binary-transformers}
}

License

MIT

Acknowledgments

Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project.

Downloads last month: -; Downloads are not tracked for this model. How to track