This mixes mochi with a development version of diffusers to achieve high quality fast inference with the full 161 frames on a single 24gb card. This repo contains only the transformer. After installing the diffusers main branch with pip install git+https://github.com/huggingface/diffusers, it can be loaded normally and used in a pipeline like so:

from diffusers import MochiPipeline, MochiTransformer3DModel
from diffusers.utils import export_to_video
import torch
transformer = MochiTransformer3DModel.from_pretrained("imnotednamode/mochi-1-preview-mix-nf4", torch_dtype=torch.bfloat16)
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", revision="refs/pr/18", torch_dtype=torch.bfloat16, transformer=transformer)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
frames = pipe("A camera follows a squirrel running around on a tree branch", num_inference_steps=100, guidance_scale=4.5, height=480, width=848, num_frames=161).frames[0]
export_to_video(frames, "mochi.mp4", fps=15)

I've noticed raising the guidance_scale will allow the model to make a coherent output with less steps, but also reduces motion (?), as the model is trying to align mostly with the text prompt. This can also, to an extent, improve the degradation of using full nf4 weights.

This version works by mixing nf4 weights and bf16 weights together. I notice that using pure nf4 weights degrades the model quality significantly, but using bf16 or LLM.int8 weights means the full 161 frames can't fit into vram. This version strikes a balance (Everything except a few blocks is in bf16).

Here's a comparison

bf16: