How to use Wan2.2 Google Colab Notebooks for Image to Video Generation

 



The above video shows possible results and introduces the basic settings.

This is a guide on how to use two recently released wan2.2 google colab notebooks for a highly impressive image to video generation that compares favorably with similar platforms like Seedance 1.0, Hailuo 02, and Kling 2.0. Wan 2.2, which is an improvement on wan2.1, is the top open-source video generation model at the time of writing this guide, and the LoRAs built on wan2.1 can still work with wan2.2. This significantly expands its video generation capabilities. You can learn more about wan2.2 from this github repository:  https://github.com/Wan-Video/Wan2.2

 

Notebook 1 (Free!)

The first notebook is titled WAN2.2 IMAGE TO VIDEO WITH LIGHTX2V LoRA and it can be found in this github repository: https://github.com/Isi-dev/Google-Colab_Notebooks . This notebook enables you to generate at least 5 seconds of a standard definition video from a single image for free in about 13 minutes using the T4 GPU offered by google colab. It also enables you to generate a 4 second high-definition video for free in less than 35 minutes.

Now let’s proceed with using this notebook.

 

Before running the code to prepare the environment on the free T4 GPU, check the following settings.

Model_quant: You can select any of four pairs of models from the smallest to the largest in terms of memory requirements: Q4_K_M, Q5_K_M, Q6_K, or Q8_0. The pair is made of a high noise model and a low noise model. The high noise model runs for a set number of steps, usually half of the total number of denoising steps, and the low noise model completes the denoising process with the remaining steps. You can generate a 5 second (frames = 81) 480p video with model quants Q4_0 to Q5_K_M, and sometimes Q6_K depending on the content of the image. If you want to use the Q8_0 model, then you will have to reduce the number of frames or else your T4 session will crash. For a 720p video, you can use the Q4_K_M to generate a 4 & 1/2 second (frames = 73) video. You will have to reduce the number of frames to use higher models.

download_loRA_1 & lora_1_download_url: You can put the link to a LoRA from huggingface or civitai in the lora_1_download_url textbox and select download_loRA_1 to download it. This also applies to download_loRA_2 & lora_2_download_url, and download_loRA_3 & lora_3_download_url, enabling you to download up to 3 LoRAs.

token_if_civitai_url: If any of your LoRA links are from civitai, then you will need to put your civitai token in this textbox. You can learn how to create your civitai token from this video: https://youtu.be/49NkIV_QpBM

You can now click run to prepare the environment.

 

After the environment setup completes, you can run the above part of the notebook to upload your image. If you want the image to be displayed, then enable the display_upload checkbox. Note that only jpg, jpeg, or png images will be displayed. Other image formats may or may not work for video generation.

 

There are lots of settings under Generate video. Let’s first look at the above video settings.

positive_prompt: You can put a prompt in this textbox to describe the video you want to generate from the image. Describing camera views and motions are quite tricky with the Wan models. You can use examples like: “The camera pushes in to focus on her face” for zooming in,  “The camera pulls out to reveal more of the scene” for zooming out, “The camera moves to a low angle shot to follow the woman’s legs”, “The camera rotates to the left/right”, “the camera moves to the right/left hand side of the frame” etc. The example prompt is, “The beautiful woman walks forward and smiles as the camera pulls out to reveal more of the scene. She is walking to viewers.” If the LoRA you want to use has trigger words, then remember to use it in your prompt.

prompt_assist: This setting gives you four options: “none”, “walking to camera”, “walking from camera”, and “swaying”. If you select “none”, then only your prompt and the LoRAs you select will be used for video generation. If you select any of the other three options, then some built-in LoRAs will modify the model, and the appropriate trigger word will be added to your prompt to generate a video that adheres more to the prompt. You may not need this setting for wan2.2 as it follows your prompts much better than wan2.1.

negative_prompt: The default negative prompt is in Chinese, and you can leave it as it is. If the cfg_scale is 1, the negative prompt will be ignored.

width & height: For a standard definition or 480p video, set the width and height to 480 and 854 respectively for a portrait video, and to 854 and 480 respectively for a landscape video. For a high definition or 720p video, set the width and height to 720 and 1280 respectively for a portrait video, and to 1280 and 720 respectively for a landscape video.

seed: The default seed is 0. This will result in the generation of a random seed which will result in different videos whenever a video is generated. If you want to see how changes in other settings would affect the generated video, then change the seed to a value that is not 0. This will keep the seed constant, and the same video will be generated every time if all other settings remain the same.

High_noise_steps: The default value is 3. This means that the high noise model will be used for the first 3 denoising steps while the low noise model will be used for the remaining steps. This value is usually set to half of the total steps.

steps: The default steps value is 6. This value is usually set to 20 and the high_noise_steps is then set to 10 if the lightx2v LoRAs are disabled. The greater the value, the higher the video quality, but the longer it will take to generate a video. The video quality doesn’t improve beyond a certain value. More tests will need to be conducted to know this value.

cfg_scale: the default cfg_scale for this notebook is 1. This assumes that the lightx2v LoRAs will be enabled and the recommended cfg for these LoRAs is 1. If none of these LoRAs are enabled, then you can increase the cfg_scale for a normal video generation with the base Wan 2.1 model which also works with the negative prompt.

sampler_name: There are many samplers to choose from. Different samplers result in different video quality and details. You can try all of them to see which gives the best result.

scheduler: The scheduler plays a key role in maintaining temporal consistency (making sure things like objects, lighting, and movement stay smooth and steady from one video frame to the next). It sets the plan for how noise is removed over time, while the sampler follows that plan to generate each frame from noise using the model. The default scheduler is ‘simple,’ but you are free to experiment with others.

frames: 65 frames are equivalent to a 4 second 16fps video. This is the recommended duration for a 720p video on the T4 GPU. Greater values will likely crash the T4. You can set it to a value like 81 if you want to generate a 5 second 480p video. You can try higher frame values to test the limits for a 480p video.

overwrite_previous_video: you can find your generated videos in the directory at the left side of the colab notebook – comfyui/output/. If you don’t select the overwrite_previous_video checkbox, all the videos you generate will be saved in the output folder. If you select it, then each generation will replace the previous one, keeping only one video in the output folder.

 

It's better you use the default model configuration settings. But let’s see what each does.

use_sage_attention: selecting this enables the use of sage attention which helps to speed up video generation.

flow_shifts: this acts as a control knob for the video generation process. By adjusting it, you can fine-tune the energy of the motion in generated videos, and potentially enhance visual quality, especially for faster generation with fewer steps. Lower values tend to lead to less movements in the video, while higher values cause more energetic movements. 8.0 is recommended for wan2.2 based on tests results.  flow_shift is for the high noise model, while flow_shift2 is for the low noise model.

The above image shows the activation, strength, and steps for the lightx2v LoRAs which enable the generation of 5 seconds of a 480p video in about 13 minutes, and 4 seconds of a 720p video in around 28 minutes on the T4. Without the LoRAs, the steps value would have been set to 20 for good results. The lightx2v and lightx2v2 LoRAs with strengths of 3 and 1.5 are for the high noise and low noise models respectively. This may not necessarily be the best setting overall, but it’s the best setting many have tried.

The above image shows the LoRA configuration settings for activating and setting the strengths of the LoRAs you may have downloaded in the Prepare Environment section of the notebook. It’s advisable that you enable only one LoRA for a generation except of course if you understand the LoRAs you downloaded and know that combining them wouldn’t result in issues.

The teacache settings enables you determine whether to use teacache and the amount and time to use it. Setting rel_li_thresh to zero disables teacache. Setting it greater than zero can skip one or more of the steps in the video generation process depending on the magnitude. This can speed up generation, but it can significantly reduce the quality of videos with realistic styles. This might not be a problem if you have a very good video upscale workflow. It is recommended that you disable teacache if using the lightx2v LoRAs which can produce good results in just 6 steps. Teacache was reasonable to use to speed up the diffusion process by skipping closely similar noise removal steps when we couldn’t get good results below 20 steps.

You can run this last section to increase the fps of your generated videos and create slow or fast motion effects. The default settings will change your video to 30fps with the video slowed a little bit. Increasing the FRAME_MULITIPLIER while keeping the vid_fps the same will result in a slower and longer video. Increasing the vid_fps while keeping the FRAME_MULITIPLIER the same will result in a faster and shorter video. The crf_value determines the quality of the resulting video. Lower values create higher quality videos but at the cost of larger file sizes.

Now that you know how to use the notebook, you can proceed with your free video generation.

 

Notebook 2

Click this link to access this notebook

The second notebook is titled WAN2.2 IMAGE TO VIDEO WITH LIGHTX2V LoRA (+ Video upscale/enhancement and face restoration). It is an extension of notebook 1 to include the extras mentioned in the title. It enables the generation of up to 5 seconds of a 720p video from a 480p video on the T4 GPU. You can also use it to upscale and do face swaps for any of your images or videos and increase the fps of any video.

This notebook includes all the parts of the previous notebook so you can study the details under the previous notebook before learning of the additional parts of this notebook.

 

The above optional section follows from the Generate video section under notebook 1. You can run this part to upload any image or video you want to upscale and/or enhance. Note that you can only upscale 5 seconds videos to 720p or high definition on the T4 GPU, and only 5 seconds videos with 16fps have been tested. You can enable the display_upload checkbox to see your uploaded image or video. Make sure you run the Prepare Environment section before running this section to upload your image or video.

 

Most of the settings in the Enhance & Upscale Image/Video section shown above are like those shown under the Generate video section. I will only describe those not previously described.

upscale_optional_image_or_video: If you uploaded an image or a video with the optional ‘Upload Image/Video to be upscaled’ code block, then enable this checkbox to upscale or enhance your uploaded image or video. Remember to disable it if you want to upscale or enhance your generated videos.

 

upscale_by: This is the factor that is used to upscale your images and videos. The default value of 1.5 upscales a 480p to a 720p video.  Going up to a value of 2 for a 480p video will likely crash the T4.

denoise: This value determines how much the resulting image or video changes from your uploaded image or video. The default value of 0.1 can be used together with the upscale_by value of 1.5 to get an upscaled video without much change from the original. You can reduce this value to about 0.06 to focus on upscaling the video with little change from the original. But if the original video is of very poor quality or contains lots of warping, artifacts or deformities, then you can enhance the video by setting this denoise value up to 0.4 or more. You might need to set it up to 0.7 for videos of very low quality. If you only want to enhance a video or image, then set the upscale_by value to 1 and set this value to 0.3 or more depending on the quality of the initial video or image. Describing your image or video in the positive_prompt textbox might make the generated video closer to the input video in content even with high denoise values.

upscale_model: You can choose from 3 upscale models. The first two (4x-UltraSharp.pth and 4x_foolhardy_Remacri.pth) are for upscaling images and videos with realistic style. The 4x-AnimeSharp.pth is for upscaling anime or cartoon images.

select_every_nth_frame: This setting applies only to videos. If you have a video made of 200 frames, then a value of 1 will output an upscaled and/or enhanced video with 200 frames. A value of 2 will output a video with 100 frames which will be made up of the even number frames from the initial video. A value of 3 will output a video with 66 frames… Setting higher values are useful to speed up the upscaling or/and enhancement of a long video without much motion and to avoid session crashes. The resulting video can then be returned to its original number of frames through frame interpolation.

skip_first_frames: This setting, like the name implies, skips the first frames from a video and the number of frames skipped depends on the value used. A value of zero means no frame will be skipped. A value of 15 means the first 15 frames will not be included in the output video.

max_output_frames: This setting determines the maximum number of frames to upscale and/or enhance. The resulting number of frames will either be less or equal to the set value.

 

You can run the optional section shown above to upload any image that contains a face you want to copy and transfer to a person in another image or video. You can enable the display_upload checkbox to see your uploaded image. Ensure you run the Prepare Environment section before running this code block to upload your image.

You can run the optional section shown above to upload any image or video that contains a face you want to change. You can enable the display_upload checkbox to see your uploaded image or video. Ensure you run the Prepare Environment section before running this section to upload your image or video.

 

For the Restore/Change Face settings shown above, you don’t have to make any change if you just want to restore or change the face of a single person in an image. If you are working with a video, then you can refer to the description of the video settings in the Enhance & Upscale image/Video section. If you want to change or restore the faces of two or more people, then assuming there are three people in the target or video you generated, upscaled, or uploaded under the last optional section, and there are two faces in the source or image uploaded in the ‘Upload a Reference Image to use for face restoration/change’ section, then the leftmost image is assumed to have the value of 0, the next 1 and so on. So, in the source_face_index, you can write 0,1 to represent the two reference faces you want to copy. In the target_face index, you can write 0,0,1. If you then run this section, the leftmost face in the source image will be used to replace the first two faces in the video, while the rightmost face in the source image will replace the remaining face in the video.

If you set detect_gender_source to female, then only a female in the source image will be detected and used. If you set detect_gender_target to female, then the faces of all the females in the target image or video will be replaced by the face in the source image.

 

If you want to increase the frames per second or fps of any video, you can run the optional section shown above to upload the video. You can enable the display_upload checkbox before running to see your uploaded video. Ensure you run the Prepare Environment section before running this section.

 

 

You can run this last section to increase the fps of your generated or uploaded videos and create slow or fast motion effects. If you uploaded a video with the optional ‘Upload another Video for frame interpolation’, then enable the interpolate_optional_video checkbox to interpolate your uploaded video. Remember to disable it if you want to interploate your generated videos.  If your video is 16 fps, the default settings will change it to 30fps with the video slowed a little bit. Increasing the FRAME_MULITIPLIER while keeping the vid_fps the same will result in a slower and longer video. Increasing the vid_fps while keeping the FRAME_MULITIPLIER the same will result in a faster and shorter video. The crf_value determines the quality of the resulting video. Lower values create higher quality videos but at the cost of larger file sizes.

Please ensure you read this guide properly before seeking support. If you need clarification on any part of this guide or the notebooks, then email me at isijomo@yahoo.com  

Comments

Popular posts from this blog

Windows Installation Guide for UniAnimate and Animate-X in ComfyUI

Running Animate-X on Google Colab for Free