Build awesome datasets for video generation

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Sayak Paul · 2025-02-12 · via Hugging Face - Blog

Back to Articles

(This post was authored by hlky and Sayak)

Tooling for image generation datasets is well established, with img2dataset being a fundamental tool used for large scale dataset preparation, and complemented with various community guides, scripts and UIs that cover smaller scale initiatives.

Our ambition is to make tooling for video generation datasets equally established, by creating open video dataset scripts suited for small scale, and leveraging video2dataset for large scale use cases.

“If I have seen further it is by standing on the shoulders of giants”

In this post, we provide an overview of the tooling we are developing to make it easy for the community to build their own datasets for fine-tuning video generation models. If you cannot wait to get started already, we welcome you to check out the codebase here.

Table of contents

Tooling
Filtering examples
Putting this tooling to use 👨‍🍳
Your Turn

Tooling

Typically, video generation is conditioned on natural language text prompts such as: "A cat walks on the grass, realistic style". Then in a video, there are a number of qualitative aspects for controllability and filtering, like so:

Motion
Aesthetics
Presence of watermarks
Presence of NSFW content

Video generation models are only as good as the data they are trained on. Therefore, these aspects become crucial when curating the datasets for training/fine-tuning.

Our 3 stage pipeline draws inspiration from works like Stable Video Diffusion, LTX-Video, and their data pipelines.

Stage 1 (Acquisition)

Like video2dataset we opt to use yt-dlp for downloading videos.

We create a script Video to Scenes to split long videos into short clips.

Stage 2 (Pre-processing/filtering)

Extracted frames

detect watermarks with LAION-5B-WatermarkDetection
predict an aesthetic score with improved-aesthetic-predictor
detect presence of NSFW content with Falconsai/nsfw_image_detection

Entire video

predict a motion score with OpenCV

Stage 3 (Processing)

Florence-2 microsoft/Florence-2-large to run Florence-2 tasks <CAPTION>, <DETAILED_CAPTION>, <DENSE_REGION_CAPTION> and <OCR_WITH_REGION> on extracted frames. This provides different captions, object recognition and OCR that can be used for filtering in various ways.

We can bring in any other captioner in this regard. We can also caption the entire video (e.g., with a model like Qwen2.5) as opposed to captioning individual frames.

Filtering examples

In the dataset for the model finetrainers/crush-smol-v0, we opted for captions from Qwen2VL and we filtered on pwatermark < 0.1 and aesthetic > 5.5. This highly restrictive filtering resulted in 47 videos out of 1493 total.

Let's review the example frames from pwatermark -

Two with text have scores of 0.69 and 0.61

The "toy car with a bunch of mice in it" scores 0.60 then 0.17 as the toy car is crushed.

All example frames were filtered by pwatermark < 0.1. pwatermark is effective at detecting text/watermarks however the score gives no indication whether it is a text overlay or a toy car's license plate. Our filtering required all scores to be below the threshold, an average across frames would be a more effective strategy for pwatermark with a threshold of around 0.2 - 0.3.

Let's review the example frames from aesthetic scores -

The pink castle initially scores 5.5 then 4.44 as it is crushed

The action figure scores lower at 4.99 dropping to 4.84 as it is crushed.

The shard of glass scores low at 4.04

Aesthetic	Image
4.04

In our filtering we required all scores to be below the threshold, in this case using the aesthetic score from the first frame only would be a more effective strategy.

If we review finetrainers/crush-smol we can notice that many of the objects being crushed are round or rectangular and colorful which is similar to our findings in the example frames. Aesthetic scores can be useful yet have a bias that will potentially filter out good data when used with extreme thresholds like > 5.5. It may be more effective as a filter for bad content than good with a minimum threshold of around 4.25 - 4.5.

OCR/Caption

Here we provide some visual examples for each filter as well as the captions from Florence-2.

Image	Caption	Detailed Caption
	A toy car with a bunch of mice in it.	The image shows a blue toy car with three white mice sitting in the back of it, driving down a road with a green wall in the background.

With OCR labels	With OCR and region labels

Putting this tooling to use 👨‍🍳

We have created various datasets with the tooling in an attempt to generate cool video effects, similar to the Pika Effects:

We then used these datasets to fine-tune the CogVideoX-5B model using finetrainers. Below is an example output from finetrainers/crush-smol-v0:

Prompt: DIFF_crush A red candle is placed on a metal platform, and a large metal cylinder descends from above, flattening the candle as if it were under a hydraulic press. The candle is crushed into a flat, round shape, leaving a pile of debris around it.

Your Turn

We hope this tooling gives you a headstart to create small and high-quality video datasets for your own custom applications. We will continue to add more useful filters to the repository, so, please keep an eye out. Your contributions are also more than welcome 🤗

Thanks to Pedro Cuenca for his extensive reviews on the post.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Tooling

Stage 1 (Acquisition)

Stage 2 (Pre-processing/filtering)

Extracted frames

Entire video

Stage 3 (Processing)

Filtering examples

OCR/Caption

Putting this tooling to use 👨‍🍳

Your Turn