Instructions to use ArcadaLabs/Ouro-2.6B-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ArcadaLabs/Ouro-2.6B-mlx-bf16 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("ArcadaLabs/Ouro-2.6B-mlx-bf16") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use ArcadaLabs/Ouro-2.6B-mlx-bf16 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "ArcadaLabs/Ouro-2.6B-mlx-bf16"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "ArcadaLabs/Ouro-2.6B-mlx-bf16" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArcadaLabs/Ouro-2.6B-mlx-bf16", "messages": [ {"role": "user", "content": "Hello"} ] }'
ArcadaLabs/Ouro-2.6B-mlx-bf16
Unquantized (bf16) MLX conversion of ByteDance/Ouro-2.6B, a Looped Language Model that applies the same transformer blocks recurrently to achieve performance well above its parameter count.
This is a full-precision conversion with no quantization. For a smaller 4-bit quantized version, see mlx-community/Ouro-2.6B-4bit.
Use with mlx-lm
pip install mlx-lm
Important: mlx-lm does not ship with built-in support for the Ouro architecture. You need to add a custom ouro.py model file to your mlx-lm models directory. See the "Setup Notes" section below.
from mlx_lm import load, generate
model, tokenizer = load("ArcadaLabs/Ouro-2.6B-mlx-bf16")
messages = [{"role": "user", "content": "hello"}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
Setup Notes
As of mlx-lm 0.29.1, the "ouro" model type is not recognized out of the box. To use this model, you need to place a custom ouro.py file in your mlx-lm models directory (typically at <your-venv>/lib/python3.x/site-packages/mlx_lm/models/ouro.py).
The implementation is adapted from the mlx-community/Ouro-2.6B-Thinking-4bit model card's architecture and the original ByteDance/Ouro-2.6B PyTorch implementation. Key adaptations for MLX:
- Sandwich RMSNorm (pre and post normalization for both attention and MLP)
- Recurrent looping over transformer blocks (
total_ut_stepspasses) - KV cache sized to
num_hidden_layers * total_ut_stepsentries - Early exit gate (linear projection, used during training)
A reference ouro.py implementation is included in this repository.
Benchmark (Single Prompt, Informal)
Tested on Apple M5 Pro (64GB). This is a single-prompt sanity check to verify the model runs correctly, give a rough sense of throughput, and show that the conversion produces coherent output. It is not a rigorous benchmark.
Prompt:
A medieval kingdom discovers electricity 300 years before the industrial revolution. Write a short analysis of: (1) what societal structures would change first, (2) what would remain surprisingly unchanged, and (3) one second-order consequence that most people wouldn't immediately consider. Be specific and cite historical parallels where relevant.
Performance:
| Variant | Tokens | Time | Throughput | Peak Memory |
|---|---|---|---|---|
| MLX bf16 (this model) | 1024 | 85.3s | 12.0 tok/s | 5.76 GB |
| PyTorch fp16 on MPS | 1024 | 148.9s | 6.9 tok/s | 4.97 GB |
MLX bf16 output (this model):
Click to expand full response (1024 tokens)
<think>
Okay, let's tackle this question. The user wants an analysis of how a medieval kingdom
discovering electricity 300 years early would affect societal structures, what remains
unchanged, and a second-order consequence. They also want historical parallels.
First, I need to break down each part. For part (1), societal structures changing first.
Electricity in a medieval context -- so no industrial base, feudal systems, agrarian economy.
The first changes would likely be in technology and maybe some social mobility. Maybe the
church or nobility controlling electricity? Like how the Catholic Church controlled knowledge
in the Middle Ages. So perhaps the church or ruling elites would monopolize electricity,
leading to new power structures. Maybe the rise of a new class of engineers or technicians.
Historical parallel could be the Renaissance's spread of knowledge, but here it's electricity
instead.
For part (2), what remains unchanged. Medieval society was heavily agrarian, feudal. Even with
electricity, the basic economic structure might stay. Land ownership, serfdom, maybe. The
social hierarchy could persist because the feudal system is deeply ingrained. Maybe the
church's influence remains strong, as they might control the new technology. Historical
parallel: the Roman Empire's persistence despite technological changes.
Part (3), a second-order consequence. Maybe something like environmental impact. Early
electricity use could lead to pollution, but in a medieval setting, maybe resource depletion.
For example, mining for materials needed for electricity (like coal, metals) could strain
resources, leading to conflicts or economic shifts. Or maybe the spread of ideas through
communication devices (like telegraphs) could lead to social unrest, similar to how the
printing press spread ideas in the Renaissance. But the user wants something not immediately
obvious. Maybe the development of new industries that disrupt traditional crafts, leading to
labor issues. Or the creation of a new class of workers dependent on electricity, leading to
urbanization.
</think>
(Generation hit the 1024 token limit during the thinking phase)
Note: The base model (non-Thinking) produced <think> tags in this run, which is unexpected since its chat template does not inject them. This may be an artifact of sampling (temperature=1.0) or a mixup during the benchmark run. We cannot verify the cause.
PyTorch fp16 output (same model, same prompt):
Click to expand full response (1024 tokens)
The model produced an extended internal reasoning trace followed by a structured response. Below is an excerpt from the structured portion (the full 1024 tokens were mostly reasoning):
(1) Societal Structures That Would Change First
The earliest societal shifts would center on labor productivity and urban infrastructure.
Electric lighting would revolutionize nighttime activities, reducing reliance on candles and
oil lamps. Cities like Florence or Paris (analogous to medieval hubs) could see safer streets,
extended working hours in workshops, and a surge in nighttime commerce. Factories powered by
electric motors would outpace manual labor, undermining the feudal system's agrarian focus.
Wealth would shift from land ownership to industrial capital, accelerating the decline of
feudal hierarchies -- mirroring how the Industrial Revolution eroded medieval guilds and
manorialism.
(2) What Would Remain Surprisingly Unchanged
Despite technological leaps, hierarchical social structures and religious authority would
persist...
(Excerpt from 1024-token output; generation hit the token limit)
Both backends produce coherent, structured reasoning. Outputs differ between runs; we used temperature=1.0 with top_p=0.7, so we cannot draw conclusions about output equivalence from a single sampled run. A proper comparison would require greedy decoding and token-level comparison.
Model Details
- Architecture: Looped Language Model (LoopLM) / Universal Transformer
- Parameters: 2.6B
- Layers: 24 (applied recurrently 4 times = 96 effective layers)
- Hidden size: 2048
- Attention: Multi-Head (16 heads, 128 head dim)
- FFN: SwiGLU (intermediate size 5632)
- Position encoding: RoPE
- Vocabulary: 49,152 tokens
- Training tokens: 7.7T
- Context length: 4K (training), extendable to 64K
- Precision in this conversion: bfloat16
- Disk size: ~5.2 GB
Credits
- Original model by ByteDance Seed
- MLX architecture reference from mlx-community/Ouro-2.6B-Thinking-4bit
- Conversion and MLX model implementation by Arcada Labs
- Converted using mlx-lm 0.29.1
Citation
@article{ouro2025,
title={Scaling Latent Reasoning via Looped Language Models},
author={ByteDance Seed},
year={2025},
url={https://arxiv.org/abs/2510.25741}
}
- Downloads last month
- 255
Quantized
Model tree for ArcadaLabs/Ouro-2.6B-mlx-bf16
Base model
ByteDance/Ouro-2.6B