I dumped Kling AI for these colab notebooks
This guide shows you how to use two Google Colab notebooks for turning a single image into a 5 second video with free and one time payment options. With models like WAN 2.1 and LoRAs like causvid, lightx2v, and fusionx, you can generate realistic or stylized videos in under 10 minutes using the free T4 GPU. You’ll also learn how to enhance videos, upscale them to 720p, fix faces, and interpolate frames.
Notebook 1 (Free!)
Click this link to access this notebook
The first notebook is titled Faster WAN 2.1 IMAGE TO
VIDEO WITH CAUSVID, LIGHTX2V, & FUSION-X LoRAs and it is a significant
improvement over the widely loved WAN IMAGE TO VIDEO WITH Q4 & Q6 GGUF
MODELS, LORA, & TEACACHE notebook which can be found in this github
repository: https://github.com/Isi-dev/Google-Colab_Notebooks. This
notebook enables you to generate at least 5 second 16fps standard definition
video from a single image for free in less than 10 minutes using the T4 GPU
offered by google colab. It also enables you to generate a 4 second 16fps high-definition
video for free in about 20 minutes.
Wan 2.1 is the top open-source video generation model at the
time of writing this guide and the models and LoRAs built on it like the
causvid, lightx2v and fusion x brings it to a level on par with commercial
based image to video generation models like those offered by Kling AI.
Section 1: Understanding the notebook (You can skip
to section 2 if you just want to see how to use the notebook)
The notebook begins with an introduction providing general
information on what you can do with it and the recommended settings. Here’s more
explanation or extra information on each part of the introduction:
- If you
deselect the use480p checkbox, the 720p Wan 2.1 base model will
be downloaded and used rather than the 480p Wan 2.1 base model. When used
with the walking to viewers LoRA, the 720p model is about 3
minutes faster than the 480p model for high definition (720p) videos.
This part indicates that you can either choose to download any
of the wan 2.1 480p or the 720p quantized models. If you select the use480p checkbox,
then a 480p model will be downloaded and used. If you don’t then a 720p model
will be used instead. You can generate 480p (standard definition) or 720p (high
definition) videos with all the models, but the 720p models tend to be faster
at generating 720p videos than the 480p models.
- You
can use the free T4 GPU to generate a 5 second 480P video (frames=81) at
16fps with the Q5_K_M GGUF 480p model and with the default settings in
less than 10 minutes. A 4-second 720p video (frames=65) can be generated
in roughly 22 minutes with the Q4_K_M 480p model, and in 19 minutes with
the Q4_K_M 720p model. I recommend that you use higher GPUs for bigger
models, longer videos, and faster generations.
As you will see later, there are different quantized models
that enable the Wan 2.1 model to run on google colab and your generations might
succeed or fail due to high memory requirements based on the model selected and
the GPU used. The free T4 GPU has less RAM & VRAM than the L4 and therefore
can only do a little above 5 seconds of a 480p video, and it is also much
slower. Generations on the L4 are on average about 2.4 times faster than the
T4.
- To
use a lora, put its huggingface or civitai download link in the lora_download_url textbox,
select the download_lora checkbox, and if using civitai, input
your civitai token before running the code to Prepare Environment.
Remember to describe the main subject of the image and include the trigger
words for the LoRA in the prompt. For the default walking forward lora
link in lora_2_download_url, the trigger word is 'walking to viewers.' You
can get LoRAs from this huggingface repository: https://huggingface.co/collections/Remade-AI/wan21-14b-480p-i2v-loras-67d0e26f08092436b585919b and
from civitai: https://civitai.com/models. In civitai, set the Wan
Video and LoRA filters to see the Wan LoRAs.
Low Rank Adaptation or LoRAs are what elevate Wan 2.1 above many
commercial models, and they significantly improve results. To fully understand the
use of LoRAs, you can watch the following video which introduces a notebook
which is similar but far less capable than this notebook, and it fully explains
how to get and use LoRAs: https://youtu.be/49NkIV_QpBM
- Generating
a video from this flux image (https://comfyanonymous.github.io/ComfyUI_examples/flux/)
with the settings (480x480, 20 steps, 65 frames) using the Q4 GGUF model
and the free T4 GPU took about 33 minutes with no Teacache i.e. rel_l1_threshless set
to zero in the Teacache settings, and less than 18 minutes with rel_l1_threshless set
to 0.275 with little loss in quality. Increase the value of rel_l1_threshless for
faster generation with a tradeoff in quality. To get much faster
generations, use the causvid, lightx2v or fusionx model LoRAs. It is
recommended that you set rel_l1_threshless to zero if using
these LoRAs.
Earlier notebooks used Teacache to reduce video generation
time, while this was okay for cartoons or anime, the quality of realistic style
videos suffered greatly. The speedup wasn’t worth it, except if subsequently
upscaled. Upscaling sometimes changes the face and features of characters, so
one would also have to do faceswap to maintain the original features of
realistic characters. But with the causvid, lightx2v or fusionx LoRA options in
this notebook, there’s a significant increase in both generation speed and
quality. It is recommended that teacache be disabled while using any of these
LoRAs by setting the rel_l1_threshless slider to zero.
- causvid
recommended settings : cfg_scale=1 , steps=4 ,sampler_name=uni_pc ,
sceduler=simple , flow_shift=5 , strength=0.8
- lightx2v
recommended settings : cfg_scale=1 , steps=4 ,sampler_name=LCM ,
sceduler=simple , flow_shift=8, strength=1
- fusionx
recommended settings : cfg_scale=1 , steps=6 ,sampler_name=uni_pc ,
sceduler=simple , flow_shift=5 , strength=1
- You
can enable both lightx2v & fusionx and adjust their strengths until
you get a desirable result. fusionx already contains the causvid LoRA, but
you can experiment with different combinations.
Lastly, the above part shows the recommended settings for each LoRA when used alone. This setting is based on many tests by the creator of the notebook. You can see that the strengths of most of the LoRAs is 1 or close to 1. But when LoRAs are combined, you can adjust the strengths as you like. The default setting enables the lightx2v and fusionx LoRAs with strengths of 0.5.
Section 2: Using the notebook
Before running the code to prepare the environment on the
free T4 GPU, check the following settings;
Use480p checkbox: You can select the “use480p” to
download a 480p model or leave it unchecked to download a 720p model.
Model_quant: You can select any of five models from
the smallest to the largest in terms of memory requirements: Q4_0, Q4_K_M, Q5_K_M,
Q6_K, or Q8_0. You can generate a 5 second (frames = 81) 480p video with model
quants Q4_0 to Q5_K_M, and sometimes Q6_K depending on the content of the
image. If you want to use the Q8_0, then you will have to reduce the number of
frames or else your T4 session will crash. For a 720p video, you can use the Q4_K_M
to generate a 4 second (frames = 65) video. You will have to reduce the number
of frames to use higher models.
download_loRA_1 & lora_1_download_url: You
can put the link to a LoRA from huggingface or civitai in the lora_1_download_url
textbox and select download_loRA_1 to download it. This also applies to download_loRA_2
& lora_2_download_url, and download_loRA_3 & lora_3_download_url,
enabling you to download up to 3 LoRAs.
token_if_civitai_url: If any of your LoRA links are
from civitai, then you will need to put your civitai token in this textbox. You
can learn how to create your civitai token from this video: https://youtu.be/49NkIV_QpBM
You can now click run to prepare the environment.
After the environment setup completes, you can run the above
part of the notebook to upload your image. If you want the image to be displayed,
then enable the display_upload checkbox. Note that only jpg, jpeg, or png
images will be displayed. Other image formats may or may not even work for
video generation.
There are lots of settings under Generate video. Let’s first
look at the above video settings.
positive_prompt: You can put a prompt in this textbox
to describe the video you want to generate from the image. Describing camera
views and motions are quite tricky with the Wan models. You can use examples
like: “The camera pushes in to focus on her face” for zooming in, “The camera pulls out to reveal more of the
scene” for zooming out, “The camera moves to a low angle shot to follow the
woman’s legs”, “The camera rotates to the left/right”, “the camera moves to the
right/left hand side of the frame” etc. The example prompt is, “The beautiful
woman walks forward and smiles as the camera pulls out to reveal more of the
scene. She is walking to viewers.” The last part of the prompt, “She is walking
to viewers” is the trigger word for the default LoRA in the lora_2_download_url
textbox. If the LoRA you want to use has trigger words, then remember to use it
in your prompt.
negative_prompt: The default negative prompt is in Chinese,
and you can leave it as it is. If the cfg_scale is 1, the negative prompt will
be ignored.
width & height: For a standard definition
or 480p video, set the width and height to 480 and 854 respectively for a
portrait video, and to 854 and 480 respectively for a landscape video. For a high
definition or 720p video, set the width and height to 720 and 1280 respectively
for a portrait video, and to 1280 and 720 respectively for a landscape video.
seed: The default seed is 0. This will result in the
generation of a random seed which will result in different videos whenever a
video is generated. If you want to see how changes in other settings would
affect the generated video, then change the seed to a value that is not 0. This
will keep the seed constant, and the same video will be generated every time if
all other settings remain the same.
steps: The default step is 20. The greater the value,
the higher the video quality, but the longer it will take to generate a video. The
video quality doesn’t improve beyond a certain value, which might be around 50
for the Wan 2.1 model. This value will be ignored if the causvid, lightx2v or fusion
LoRA is enabled.
cfg_scale: Unlike its predecessor which uses a cfg_scale
of 3, the default cfg_scale for this notebook is 1. This assumes that the causvid,
lightx2v or fusion LoRA will be enabled and the recommended cfg for these LoRAs
is 1. If none of these LoRAs are enabled, then you can increase the cfg_scale
for a normal video generation with the base Wan 2.1 model which also works with
the negative prompt.
sampler_name: There are many samplers to choose from.
Samplers like the default lcm require fewer steps to produce decent results. Different
samplers result in different video quality and details. You can try all of them
to see which gives the best result or follow the recommended ones listed in the
introduction of the notebook.
scheduler: The scheduler plays a key role in
maintaining temporal consistency (making sure things like objects, lighting,
and movement stay smooth and steady from one video frame to the next). It sets
the plan for how noise is removed over time, while the sampler follows that
plan to generate each frame from noise using the model. The default scheduler
is ‘simple,’ but you are free to experiment with others.
frames: 65 frames are equivalent to a 4 second 16fps
video. This is the recommended duration for a 720p video on the T4 GPU. Greater
values will likely crash the T4. You can set it to a value like 81 if you want
to generate a 5 second 480p video. You can try higher frame values to test the
limits for a 480p video.
overwrite_previous_video: you can find your generated
videos in the directory at the left side of the colab notebook – comfyui/output/.
If you don’t select the overwrite_previous_video checkbox, all the videos you
generate will be saved in the output folder. If you select it, then each
generation will replace the previous one, keeping only one video in the output
folder.
It's better you use the default model configuration
settings. But let’s see what each does.
use_sage_attention: selecting this enables the use of
sage attention which helps to speed up video generation.
flow_shift: this acts as a control knob for the video
generation process. By adjusting it, you can fine-tune the energy of the motion
in generated videos, and potentially enhance visual quality, especially for
faster generation with fewer steps. Lower values tend to lead to less movements
in the video, while higher values cause more energetic movements. 5.0 is recommended
for 720p videos, while 3.0 is recommended for 480p.
The above image shows the activation, strength, and steps
for the causvid, lightx2v and fusionx LoRAs which enable the generation of 480p
videos in less than 10 minutes, and 720p videos in around 20 minutes on the T4.
The causvid LoRA is disabled while others are enabled in the default setting.
This may not necessarily be the best setting overall, but it’s the best setting
I have tried. The number of steps taken will default to that of the enabled
LoRA that comes last from the top of the notebook. e.g. in the above setting,
the step is determined by the fusionx LoRA step which is 4. You can experiment
with different steps and LoRA combinations, and you might find a combination
that yields better results than the default.
The above image shows the LoRA configuration settings for activating
and setting the strengths of the LoRAs you may have downloaded in the Prepare
Environment section of the notebook. It’s advisable that you enable only one
LoRA for a generation except of course if you understand the LoRAs you
downloaded and know that combining them wouldn’t result in issues. To use the walking
forward loRA downloaded through the link in lora_2_download_url, activate it by
enabling use_lora2 before running the cell to generate a video.
The teacache settings enables you determine whether to use
teacache and the amount and time to use it. Setting rel_li_thresh to zero
disables teacache. Setting it greater than zero can skip one or more of the
steps in the video generation process depending on the magnitude. This can
speed up generation, but it can significantly reduce the quality of videos with
realistic styles. This might not be a problem if you have a very good video
upscale workflow. It is recommended that you disable teacache if using the causvid,
lightx2v or fusionx LoRAs which can produce good results in just 4 steps. Using
teacache was reasonable to use to speedup the process by skipping closely
similar noise removal steps when we couldn’t get good results below 20 steps.
Now that you understand and know how to use the notebook,
you can proceed with your free video generation.
Notebook 2
The second notebook is titled WAN2.1 IMAGE TO VIDEO WITH
CAUSVID, LIGHTX2V, & FUSIONX LoRAs (with Video upscale/enhancement and face
restoration). It is an extension of notebook 1 to include the extras
mentioned in the title. It enables the generation of up to 5 seconds of a 720p
video from a 480p video on the T4 GPU. You can also use it to upscale and do
face swaps for any of your images or videos and increase the fps of any video.
This notebook includes all the parts of the previous
notebook so you can study the details under the previous notebook before
learning of the additional parts of this notebook. But there’s a little
difference between the notebooks in the Generate Video section.
As you can see in the image above, there’s a prompt_assist
setting which was not available in the previous notebook. It gives you four
options: “none”, “walking to camera”, “walking from camera”, and “swaying”. If
you select “none”, then only your prompt and the LoRAs you select will be used
for video generation. If you select any of the other three options, then some
inbuilt LoRAs will modify the model, and the appropriate trigger word will be
added to your prompt to generate a video that adheres more to the prompt.
The above optional section follows from the Generate video section
under notebook 1. You can run this part to upload any image or video you want
to upscale and/or enhance. Note that you can only upscale 5 seconds videos to
720p or high definition on the T4 GPU, and only 5 seconds videos with 16fps have
been tested. You can enable the display_upload checkbox to see your uploaded
image or video. Make sure you run the Prepare Environment section before
running this section to upload your image or video.
Most of the settings in the Enhance & Upscale
Image/Video section shown above are like those shown under the Generate video
section. I will only describe those not previously described.
upscale_optional_image_or_video: If you uploaded an
image or a video with the optional ‘Upload Image/Video to be upscaled’ code
block, then enable this checkbox to upscale or enhance your uploaded image or
video. Remember to disable it if you want to upscale or enhance your generated
videos.
upscale_by: This is the factor that is used to upscale
your images and videos. The default value of 1.5 upscales a 480p to a 720p
video. Going up to a value of 2 for a
480p video will likely crash the T4.
denoise: This value determines how much the resulting
image or video changes from your uploaded image or video. The default value of
0.1 can be used together with the upscale_by value of 1.5 to get an upscaled
video without much change from the original. You can reduce this value to about
0.06 to focus on upscaling the video with little change from the original. But
if the original video is of very poor quality or contains lots of warping,
artifacts or deformities, then you can enhance the video by setting this
denoise value up to 0.4 or more. You might need to set it up to 0.7 for videos of
very low quality. If you only want to enhance a video or image, then set the
upscale_by value to 1 and set this value to 0.3 or more depending on the
quality of the initial video or image. Describing your image or video in the
positive_prompt textbox might make the generated video closer to the input video
in content even with high denoise values.
upscale_model: You can choose from 3 upscale
models. The first two (4x-UltraSharp.pth and 4x_foolhardy_Remacri.pth) are for
upscaling images and videos with realistic style. The 4x-AnimeSharp.pth is for
upscaling anime or cartoon images.
select_every_nth_frame: This setting applies only to
videos. If you have a video made of 200 frames, then a value of 1 will output
an upscaled and/or enhanced video with 200 frames. A value of 2 will output a
video with 100 frames which will be made up of the even number frames from the
initial video. A value of 3 will output a video with 66 frames… Setting higher
values are useful to speed up the upscaling or/and enhancement of a long video
without much motion and to avoid session crashes. The resulting video can then
be returned to its original number of frames through frame interpolation.
skip_first_frames: This setting, like the name implies,
skips the first frames from a video and the number of frames skipped depends on
the value used. A value of zero means no frame will be skipped. A value of 15
means the first 15 frames will not be included in the output video.
max_output_frames: This setting determines the
maximum number of frames to upscale and/or enhance. The resulting number of
frames will either be less or equal to the set value.
You can run the optional section shown above to upload any
image that contains a face you want to copy and transfer to a person in another
image or video. You can enable the display_upload checkbox to see your uploaded
image. Ensure you run the Prepare Environment section before running this code
block to upload your image.
You can run the optional section shown above to upload any
image or video that contains a face you want to change. You can enable the
display_upload checkbox to see your uploaded image or video. Ensure you run the
Prepare Environment section before running this section to upload your image or
video.
For the Restore/Change Face settings shown above, you don’t
have to make any change if you just want to restore or change the face of a
single person in an image. If you are working with a video, then you can refer
to the description of the video settings in the Enhance & Upscale
image/Video section. If you want to change or restore the faces of two or more
people, then assuming there are three people in the target or video you
generated, upscaled, or uploaded under the last optional section, and there are
two faces in the source or image uploaded in the ‘Upload a Reference Image to
use for face restoration/change’ section, then the leftmost image is assumed to
have the value of 0, the next 1 and so on. So, in the source_face_index, you
can write 0,1 to represent the two reference faces you want to copy. In the
target_face index, you can write 0,0,1. If you then run this section, the
leftmost face in the source image will be used to replace the first two faces
in the video, while the rightmost face in the source image will replace the
remaining face in the video.
If you set detect_gender_source to female, then only a
female in the source image will be detected and used. If you set
detect_gender_target to female, then the faces of all the females in the target
image or video will be replaced by the face in the source image.
If you want to increase the frames per second or fps of any
video, you can run the optional section shown above to upload the video. You can
enable the display_upload checkbox before running to see your uploaded video. Ensure
you run the Prepare Environment section before running this section.
You can run this last section to increase the fps of your
generated or uploaded videos and create slow or fast motion effects. If you
uploaded a video with the optional ‘Upload another Video for frame
interpolation’, then enable the interpolate_optional_video checkbox to interpolate
your uploaded video. Remember to disable it if you want to interploate your
generated videos. If your video is 16
fps, the default settings will change it to 30fps with the video slowed a
little bit. Increasing the FRAME_MULITIPLIER while keeping the vid_fps the same
will result in a slower and longer video. Increasing the vid_fps while keeping
the FRAME_MULITIPLIER the same will result in a faster and shorter video. The
crf_value determines the quality of the resulting video. Lower values create
higher quality videos but at the cost of larger file sizes.
Please ensure you read this guide properly before seeking support. If you need clarification on any part of this guide or the notebooks, then email me at isijomo@yahoo.com
Comments
Post a Comment