Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.
In image-to-video models, the image input primarily dictates the appearance of the generated videos.
Input image | "A white horse walking" | "A pink horse walking" |
For example, I2VGen-XL generates a video of a predominantly white horse from a white horse image, even when the input text specifies the horse’s color as "pink."
In image-to-video models, text/image embeddings significantly influence the generated motions.
CLIP embedding | ||||
---|---|---|---|---|
Real horse | Toy horse | |||
Image latent | Real horse | |||
Toy horse |
Swapping the CLIP image embeddings of a real horse and a toy horse in Stable Video Diffusion results in a swap of the motions in the output videos. This suggests that the real horse’s embedding encodes a walking motion, while the toy horse’s embedding encodes camera motion without object movement.
The baseline image-to-video diffusion model, Stable Video Diffusion in our case, inputs the first frame in two places: as image (latent) concatenated with the noisy video and as image embedding. We propose to replace the image embedding \(\mathbf{e}\) with a learned motion-text embedding \(\mathbf{m}^*\). The motion-text embedding is optimized directly with a regular diffusion model loss on one given motion reference video \(\mathbf{x}_0\) while keeping the diffusion model frozen.
Note that the diffusion process operates in latent space in practice, and other conditionings and model parameterizations are omitted for clarity.
For best results, the motion-text embedding is inflated prior to optimization to \((F+1) \times N\) tokens, where \(F\) is the number of frames and \(N\) is a hyperparameter, while keeping the embedding dimension \(d\) the same to stay compatible with the pre-trained diffusion model. For more details regarding the motion-text and cross-attention inflation, please refer to sections 3.4 and B.4 of the paper.
We compare our method to SVD = Stable Video Diffusion (baseline, no motion input), VC = VideoComposer, and MD = MotionDirector for three different motions and target images: full-body reenactment, face reenactment, and camera motion.
Our method uniquely preserves the input image’s appearance and layout while successfully transferring the semantic motion of the video.
Our learned motion-text embeddings do not only store the rough motion category but also the style of the motion.
Motion reference videos | Generated videos |
Here, we apply two different gaits to the same target image: a horse trot (smooth) and a canter (rocking). The resulting videos for the cartoon dog are not only showing the dog moving, but their motions also closely match the motion reference video's gait style. Furthermore, the extreme cross-domain examples with the boat, car, and cereal box show that the essence of the motion style is preserved even across completely different objects.
Our learned motion-text embeddings store the semantic motion (animal moving in the direction it is facing and moving its head down) rather than the spatial motion (animal moving from right to left and left part is going down).
Motion reference video | Generated video for regular horse image input | Generated video for flipped horse image input |
This can be seen in the above example where we apply the same learned motion-text embedding to a flipped input image, and our method produces semantically similar results.
We generate videos using our optimized motion-text embedding for a "jumping jacks" motion (same reference as from ablation below) both with the image input (conditional) and without (unconditional) after a different number of optimization iterations.
Iteration | 0 | 200 | 500 | 1000 | 2000 |
---|---|---|---|---|---|
Conditional | |||||
Unconditional (motion visualization) seed 0 |
|||||
Unconditional (motion visualization) seed 1 |
Note how the appearance of the unconditional generations differs from the motion reference video and varies with different seeds. Further observe that our method effectively generates similar semantic motions without needing or enforcing spatial alignment.
Our proposed motion-text embedding inflation is crucial for successful motion transfer.
Reference | \(F'=1, N=1\) | \(F'=1, N=15\) | \(F'=15, N=1\) | \(F'=15, N=5\) (Default) | \(F'=15, N=15\) |
---|---|---|---|---|---|
Significant improvement |
While adding more tokens (increasing \(N\)) improves the results already, the biggest gain comes from having different tokens for each frame (where \(F' = F+1 = 15\)).
Our method is limited by the priors and quality of the pre-trained image-to-video model, which may lead to artifacts (e.g., identity changes as head moves in first example). Furthermore, there may be some structure leakage in some cases, leading to certain characteristics from the motion reference video being visible (e.g., human-like legs on a kangaroo in second example). Lastly, our method struggles to transfer spatially fine-grained motion at times (e.g., typing motion not transferred to dinosaurs in third example).
@article{kansy2024reenact,
title={Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion},
author={Kansy, Manuel and Naruniec, Jacek and Schroers, Christopher and Gross, Markus and Weber, Romann M},
journal={arXiv preprint arXiv:2408.00458},
year={2024}
}