Wan 2.2 Google Colab Notebook for Text to Image/Video, Image to Video, and First-Last Frame to Video
A 1920x1080 image generated with wan2.2
This is a guide on how to use a wan2.2 google
colab notebook for highly impressive text to image, text to video, image to
video, and two images (first & last frames) to video generation that outshine
results from commercial platforms. Wan 2.2, which is an improvement on wan2.1,
is the top open-source video generation model at the time of writing this
guide, and the LoRAs built on wan2.1 can still work with wan2.2. This
significantly expands its image and video generation capabilities. You can
learn more about wan2.2 from this github repository:  https://github.com/Wan-Video/Wan2.2
This notebook enables you to generate at least 5 seconds of a
standard definition video from a single image for free in about 13 minutes
using the T4 GPU offered by google colab. It also enables you to generate a 4
second high-definition video for free in less than 35 minutes. You can also
generate 1920x1080 images in less than 4 minutes.
Now let’s proceed with using this notebook. 
Before running the code to prepare the environment on the
free T4 GPU, check the following settings.
Model_quant: You can select any of four pairs of
models from the smallest to the largest in terms of memory requirements: Q4_K_M,
Q5_K_M, Q6_K, or Q8_0. The pair is made of a high noise model and a low noise
model. The high noise model runs for a set number of steps, usually half of the
total number of denoising steps, and the low noise model completes the
denoising process with the remaining steps. You can generate a 5 second (frames
= 81) 480p video with model quants Q4_0 to Q5_K_M, and sometimes Q6_K depending
on the content of the image. If you want to use the Q8_0 model, then you will
have to reduce the number of frames or else your T4 session will crash. For a
720p video, you can use the Q4_K_M to generate a 4 & 1/2 second (frames = 73)
video. You will have to reduce the number of frames to use higher models.
download_loRA_1 & lora_1_download_url: You
can put the link to a LoRA from huggingface or civitai in the lora_1_download_url
textbox and select download_loRA_1 to download it. This also applies to download_loRA_2
& lora_2_download_url, and download_loRA_3 & lora_3_download_url,
enabling you to download up to 3 LoRAs.
token_if_civitai_url: If any of your LoRA links are
from civitai, then you will need to put your civitai token in this textbox. You
can learn how to create your civitai token from this video: https://youtu.be/49NkIV_QpBM
You can now click run to prepare the environment. 
After the environment setup is completed, you can run the
above sections of the notebook to upload image 1 and image 2. If you only want
to do text to image or video generation, then you can skip these sections. They
are useful if you want to do ‘image to video’ or ‘first-last frame to video’
generation. If you want to upload an image and you also want it to be
displayed, then enable the display_upload checkbox. Note that only jpg, jpeg,
or png images will be displayed. Other image formats may or may not work for
video generation.
There are lots of settings under Generate Image/Video. Let’s
first look at the above settings.
generate_image: Enabling this setting will result in
the generation of an image. The highest image resolution I have tried with
great results is 1920 x 1080.
text_to_video: Enabling this setting will result in
videos generated only from your prompt. This will be the case even if you
uploaded an image in the previous section.
use_image1_as_first_last: Enabling this setting will
make image 1, which may have been uploaded in the previous section, the first
and last frames of the generated video. The resulting video can be used as a
looping video. If no image was uploaded, then only your prompt will be used for
video generation.
use_image2_as_first_last: Enabling this setting will
make image 2, which may have been uploaded in the previous section, the first
and last frames of the generated video. The resulting video can be used as a
looping video. If no image was uploaded, then only your prompt will be used for
video generation.
disable_image1: Enabling this setting will ensure
that image 1 is not used in video generation. If image 2 was uploaded, then it
will serve as the last frame of the generated video.
disable_image2: Enabling this setting will ensure
that image 2 is not used in video generation. If image 1 was uploaded, then it
will serve as the first frame of the generated video. This is like a normal ‘image
to video’ generation.
If none of the settings above are enabled:
-         
If no image was uploaded, then only your prompt
will be used for video generation.
-         
If only image 1 was uploaded, then it will be
used as the first frame of the generated video. This is how this notebook is
used for ‘image to video’ generation.
-         
If only image 2 was uploaded then, then it will
be used as the last frame of the generated video.
-         
If both image 1 and image 2 are uploaded, then
image 1 will be used as the first frame of the video generated, while image 2
will be used as the last frame.
positive_prompt: You can put a prompt in this textbox
to describe the video you want to generate from the image. If the LoRA you want
to use has trigger words, then remember to use it in your prompt.
prompt_assist: This setting gives you five options:
“none”, “walking to camera”, “walking from camera”, “swaying”, and “improve
realism”. If you select “none”, then only your prompt and the LoRAs you select
will be used for video generation. If you select any of the other four options,
then some built-in LoRAs will modify the model, and the appropriate trigger
word will be added to your prompt to generate a video that adheres more to the
prompt. You may not need this setting for wan2.2 as it follows your prompts
much better than wan2.1.
negative_prompt: The default negative prompt is in Chinese,
and you can leave it as it is. If the cfg_scale is 1, the negative prompt will
be ignored.
width & height: For a high quality image,
you can set the width and height to 1920 and 1080 respectively. For a standard
definition or 480p video, set the width and height to 480 and 854 respectively
for a portrait video, and to 854 and 480 respectively for a landscape video.
For a high definition or 720p video, set the width and height to 720 and 1280
respectively for a portrait video, and to 1280 and 720 respectively for a
landscape video.
seed: The default seed is 0. This will result in the
generation of a random seed which will result in different images or videos. If
you want to see how changes in other settings would affect the generated video,
then change the seed to a value that is not 0. This will keep the seed
constant, and the same video will be generated every time if all other settings
remain the same.
High_noise_steps: The default value is 3. This means
that the high noise model will be used for the first 3 denoising steps while
the low noise model will be used for the remaining steps. This value is usually
set to half of the total steps.
steps: The default steps value is 6. This value is
usually set to 20 and the high_noise_steps is then set to 10 if the lightx2v
LoRAs are disabled. The greater the value, the higher the video quality,
but the longer it will take to generate a video. The video quality doesn’t
improve beyond a certain value. More tests will need to be conducted to know
this value.
cfg_scale: the default cfg_scale for this notebook is
1. This assumes that the lightx2v LoRAs will be enabled and the recommended cfg
for these LoRAs is 1. If none of these LoRAs are enabled, then you can increase
the cfg_scale for a normal video generation with the base Wan 2.1 model which
also works with the negative prompt.
sampler_name: There are many samplers to choose from.
Different samplers result in different video quality and details. You can try
all of them to see which gives the best result.
scheduler: The scheduler plays a key role in
maintaining temporal consistency (making sure things like objects, lighting,
and movement stay smooth and steady from one video frame to the next). It sets
the plan for how noise is removed over time, while the sampler follows that
plan to generate each frame from noise using the model. The default scheduler
is ‘simple,’ but you are free to experiment with others.
frames: 65 frames are equivalent to a 4 second 16fps
video. This is the recommended duration for a 720p video on the T4 GPU. Greater
values will likely crash the T4. You can set it to a value like 81 if you want
to generate a 5 second 480p video. You can try higher frame values to test the
limits for a 480p video.
match_colors: Sometimes the lighting and color of the
video might be a bit different from the input image. You can enable this
setting to make input and output similar in color and lighting.
overwrite_previous_video: you can find your generated
videos in the directory at the left side of the colab notebook – comfyui/output/.
If you don’t select the overwrite_previous_video checkbox, all the videos you
generate will be saved in the output folder. If you select it, then each
generation will replace the previous one, keeping only one video in the output
folder.
It's better you use the default model configuration
settings. But let’s see what each does. 
use_sage_attention: selecting this enables the use of
sage attention which helps to speed up video generation. 
flow_shifts: this acts as a control knob for the
video generation process. By adjusting it, you can fine-tune the energy of the
motion in generated videos, and potentially enhance visual quality, especially
for faster generation with fewer steps. Lower values tend to lead to less
movements in the video, while higher values cause more energetic movements. 8.0
is recommended for wan2.2 based on tests results.  flow_shift is for the high noise model, while
flow_shift2 is for the low noise model.
The above image shows the activation, strength, and steps
for the lightx2v LoRAs which enable the generation of 5 seconds of a 480p video
in about 13 minutes, and 4 seconds of a 720p video in around 28 minutes on the
T4. Without the LoRAs, the steps value would have been set to 20 for good
results. The lightx2v and lightx2v2 LoRAs with strengths of 3 and 1.5 are for
the high noise and low noise models respectively. This may not necessarily be
the best setting overall, but it’s the best setting many have tried.
The above image shows the LoRA configuration settings for
activating and setting the strengths of the LoRAs you may have downloaded in
the Prepare Environment section of the notebook. It’s advisable that you enable
only one LoRA for a generation except of course if you understand the LoRAs you
downloaded and know that combining them wouldn’t result in issues.
The teacache settings enables you determine whether to use
teacache and the amount and time to use it. Setting rel_li_thresh to zero
disables teacache. Setting it greater than zero can skip one or more of the
steps in the video generation process depending on the magnitude. This can
speed up generation, but it can significantly reduce the quality of videos with
realistic styles. This might not be a problem if you have a very good video
upscale workflow. It is recommended that you disable teacache if using the lightx2v
LoRAs which can produce good results in just 6 steps. Teacache was reasonable
to use to speed up the diffusion process by skipping closely similar noise
removal steps when we couldn’t get good results below 20 steps.
You can run this last section to increase the fps of your
generated videos and create slow or fast motion effects. The default settings
will change your video to 30fps with the video slowing a little bit. Increasing
the FRAME_MULITIPLIER while keeping the vid_fps the same will result in a
slower and longer video. Increasing the vid_fps while keeping the
FRAME_MULITIPLIER the same will result in a faster and shorter video. The
crf_value determines the quality of the resulting video. Lower values create
higher quality videos but at the cost of larger file sizes.
Now that you know how to use the notebook, you can proceed
with your free video generation.
The following are example Prompts you can try for ‘text to
Image’ generation.
A spacious, ultra-modern mansion living room with towering
transparent glass walls revealing a lush garden and infinity pool outside,
sunlight streaming in and casting soft reflections on polished marble floors,
furnished with luxurious leather sectional sofas, a massive wall-mounted 8K
television, a glass coffee table with a sculptural metal base, elegant
chandeliers, built-in bookshelves, indoor plants in artistic vases, and subtle
ambient lighting, hyperrealistic, 16mm wide-angle lens, ultra-detailed, photorealism.
A spacious, ultra-modern mansion living room at night, with
towering transparent glass walls revealing a moonlit garden and glowing
infinity pool outside, interior illuminated by warm recessed lighting and the
soft flicker of a modern fireplace, furnished with plush leather sectional
sofas, a massive wall-mounted 8K television displaying a paused movie scene, a
glass coffee table with a sculptural metal base, elegant crystal chandeliers
casting subtle reflections on polished marble floors, potted indoor palms, and
discreet high-end audio speakers, hyperrealistic, cinematic lighting,
ultra-detailed, photorealism.
A spacious, ultra-modern mansion living room at night, with
towering transparent glass walls revealing a moonlit garden and glowing
infinity pool outside, interior illuminated by warm recessed lighting and the
soft flicker of a modern fireplace, furnished with plush leather sectional
sofas, a massive wall-mounted 8K television displaying a paused movie scene, a
glass coffee table with a sculptural metal base, elegant crystal chandeliers
casting subtle reflections on polished marble floors, potted indoor palms, and
discreet high-end audio speakers, A man, his wife and two children sit on the
sofa watching the TV, hyperrealistic, cinematic lighting, ultra-detailed,
photorealism.
A massive steampunk city built on floating islands above a
raging storm, with intricate brass gears powering suspended bridges between
islands, glowing lanterns illuminating airships drifting past, and distant
lightning reflecting on the polished copper rooftops, cinematic wide-angle
view, ultra-detailed, 8K, by Greg Rutkowski and Simon Stålenhag.
A grand library carved inside the trunk of an ancient,
colossal tree, shelves spiraling up the hollow interior, glowing bioluminescent
plants intertwined with carved wooden ladders, scholars in flowing robes
studying magical scrolls, warm candlelight mixed with emerald-green ambient
light, fantasy realism, ultra-detailed, 16mm lens.
A cyberpunk samurai standing in a neon-soaked alley during a
heavy rainstorm, katana glowing with holographic runes, holographic koi fish
swimming through the air around him, steam rising from sewer grates,
reflections of skyscraper billboards on wet pavement, dramatic low-angle shot,
photorealistic.
A candid street photograph of an elderly woman selling
flowers at a busy European market, wrinkles and skin texture sharply detailed,
sunlight filtering through striped awnings casting soft shadows on her face,
vibrant bouquets of roses and tulips in the foreground, Leica M10 style,
shallow depth of field.
A close-up portrait of a firefighter emerging from a burning
building, soot and sweat on their face, oxygen mask pulled aside revealing
determined eyes, flickering orange light reflecting in the visor, background
blurred with bokeh sparks and smoke particles, cinematic realism.
A family gathered around a wooden dining table during a
rainy evening, warm light from a single hanging lamp illuminating faces in soft
gradients, realistic eye contact and subtle smiles, reflections of raindrops on
the window behind them, food half-eaten, Canon EOS R5, ultra-realistic skin
tones.
A marathon runner collapsing at the finish line, mud and
sweat streaked across their legs, exhausted expression with trembling hands
gripping knees, crowd cheering in the blurred background, overcast daylight,
authentic human proportions and posture, Sony Alpha 1, sports photography
realism.
A man gently holding his lover in his arms as they stand on
the edge of a skyscraper rooftop at dusk, both gazing down at the sprawling
city below, golden-orange evening light fading into the cool blues of
approaching night, skyscraper windows glowing with warm lights, subtle breeze
ruffling their hair and clothes, realistic human proportions and facial
expressions, cinematic composition with depth of field, ultra-detailed,
photorealistic.
A man and his wife sitting and watching TV in a spacious,
ultra-modern mansion living room with towering transparent glass walls
revealing a lush garden and infinity pool outside, sunlight streaming in and
casting soft reflections on polished marble floors, furnished with luxurious
white leather sectional sofas, a massive wall-mounted 8K television showing a
movie, a glass coffee table with a sculptural metal base, elegant chandeliers,
built-in bookshelves, indoor plants in artistic vases, and subtle ambient
lighting, hyperrealistic, 16mm wide-angle lens, ultra-detailed, photorealism.
Right-back view of a 2023 Ferrari SF90 Stradale in glossy
Rosso Corsa Red with black carbon-fiber accents, 20-inch forged alloy wheels
with diamond-cut finish and yellow brake calipers, aggressive aerodynamic
design with sculpted curves, sleek LED headlights, and side air intakes, moving
on a tarred road at sunrise, hyperrealistic, ultra-detailed, 8K resolution.
Please ensure you read this guide properly before seeking
support. If you need clarification on any part of this guide or the notebooks,
then email me at isijomo@yahoo.com  
Comments
Post a Comment