Generating Long, High-Quality AI Videos with LongCat-Video on ComfyUI (Google Colab Setup)

November 03, 2025

A 27-minute 15fps video generated with ComfyUI on Google Colab.

LongCat-Video is an AI project that pushes the limits of video generation. Unlike most AI video models that degrade in color or consistency after a few seconds, LongCat-Video enables generating videos longer than 30 seconds, all while maintaining sharp image quality and stable colors.

The models and code were optimized for ComfyUI by Kijai, one of the major contributors to the ComfyUI ecosystem. This optimization makes it much easier to experiment with LongCat-Video and generate long, coherent video sequences right from your browser.

Running LongCat-Video on Google Colab

I’ve created a Google Colab notebook that automatically sets up ComfyUI with the LongCat-Video models. So far, I’ve tested it on the L4 GPU, but it should also work on the free T4 if you reduce the number of frames generated per sequence and set blocks_to_swap to 0 if the RAM crashes with the default settings.

You can open it directly here: notebook

ComfyUI Workflow and Environment SetUp

The workflow is based on Kijai’s ComfyUI workflow and it is available for download here: workflow

To improve stability and image consistency when running on Colab GPUs:

Replaced sageattention 1.0.6 (which caused scene changes during video generation) with SDPA for smoother, more stable, but slower results.
Tried sageattention 2++, but it resulted in runtime errors on Colab.
FlashAttention took too long while preparing the environment, so I left it out.

Setting-Up & Running the workflow

Run the 'Prepare Environment' cell above to get a link (e.g. https://localhost:8188/ if you use the default settings which utilizes google colab's native port.) Click on the link to open another tab which launches the ComfyUI Interface. You can now drag the downloaded workflow into the interface to view the nodes.

The nodes in the purple group are for loading and setting the LogCat diffusion model and LoRA, including the attention mechanism and the number of blocks to swap depending on your VRAM. The nodes below the model nodes are for uploading your starting image and setting the generation parameters. To the right are the nodes in five blue Sampler groups which are used to generate parts of the video sequentially from left to right. Let's zoom into each section for a clearer picture.

The nodes you need to pay attention to in the Models group are the WanVideo Model Loader and the WanVideo Block Swap nodes.

The attention_mode in the WanVideo Model Loader node is set to sdpa. The sdpa results in longer generation times compared with other attention mechanisms like sageattention and flashattention, but this workflow was set to use sdpa for reasons stated previously in this blog. If you are running this workflow locally or in another environment where you were successfully able to install sageattention 2++ or flashattention 2++, then it's better you change the attention_mode to either of those options for faster inference.

The blocks_to_swap input in the WanVideo Block Swap node is set to 10. This means that if the model has a total of 40 blocks, then only 30 blocks will be loaded directly to the VRAM, while 10 blocks will remain in the RAM to be swapped for 10 blocks in the VRAM when it reaches their turn to work on the input. This was done to prevent the L4 GPU from running out of memory when all 40 blocks are loaded directly in the VRAM. If using the A100, then you can set this value to 0 to fully utilize the GPU for faster inference. If using the free T4, then you might also need to set this value to zero due to the low RAM and reduce the number of frames to be generated in each section to 61 or less. You might also need to reduce the resolution of the target video to avoid Out Of Memory errors. The default number of frames is 93 and the target video resolution is 640 x 608. Increasing any of these values might require you to increase the value of blocks_to_swap in the L4 and A100 to avoid running out of memory.

You can use the Load Image node to upload an image if you are aiming for image-to-video generation. If your goal is text-to-video, then you can select the Load Image and Resize Image V2 nodes and disable them by pressing 'Ctrl B' on your keyboard.

You can change the values of the width & height nodes based on the resolution values in the Default bucket aspect ratios node to determine the reslution of your video. Note that larger values use more VRAM.

The value of the Sampleframes node determines the length of each of the five videos which will be generated and later combined. You will need to reduce this value to 61 or less if using the T4. The inputs of other nodes are set with the recommended values, but you are free to experiment with other values.

Upcoming Additions

This post is still a work in progress. More additions coming...

Search This Blog

The Wandering Pen