Two months ago, we benchmarked @google’s Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent, and even @alibaba-pai.
That’s changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in today’s hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
5 replies
·
reacted to jasoncorkill's
post with ❤️8 months ago
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
🚀 Building Better Evaluations: 32K Image Annotations Now Available
Today, we're releasing an expanded version: 32K images annotated with 3.7M responses from over 300K individuals which was completed in under two weeks using the Rapidata Python API.
A few months ago, we published one of our most liked dataset with 13K images based on the @data-is-better-together's dataset, following Google's research on "Rich Human Feedback for Text-to-Image Generation" (https://arxiv.org/abs/2312.10240). It collected over 1.5M responses from 150K+ participants.
In the examples below, users highlighted words from prompts that were not correctly depicted in the generated images. Higher word scores indicate more frequent issues. If an image captured the prompt accurately, users could select [No_mistakes].
We're continuing to work on large-scale human feedback and model evaluation. If you're working on related research and need large, high-quality annotations, feel free to get in touch: [email protected].