Muybridge Derby: Bringing Animal Locomotion Photographs to Life with AI
Background
I'm sure you've seen the series of images of a galloping horse by 19th-century English photographer Eadweard Muybridge. As a refresher, here is a GIF Animation that shows one of his more famous photo series.
And here's a portrait of Muybridge with an illustration of the apparatus he built to capture the photo series.
Eadweard Muybridge
Muybridge was a nature photographer commissioned by the Governor of California, Leland Stanford, to document his mansion and possessions. Stanford posed an exciting challenge to Muybridge: could he take clear pictures of a galloping horse?
1872 was the year that Muybridge began his zealous involvement with motion photography. He was commissioned by Governor Leland Stanford to photograph the moving gait of his racehorse, Occident. Until this time the gait of a moving horse had been a mystery. When did the feet touch the ground? Did all four feet ever leave the ground at the same time? Painting the feet of the galloping horse had been an unsolved problem for artists. … [He used] 12 cameras, each hooked to an electrical apparatus that would trip the shutters as the horse galloped past. … Muybridge invented the zoopraxiscope in 1879, a machine that allowed him to project up to two hundred single images on a screen. In 1880 he gave his first presentation of projected moving pictures on a screen to a group at the California School of Fine Arts, thus becoming the father of motion pictures.- Vi Whitmire [1]
And Muybridge didn't just take pictures of moving horses. He created similar sequences of moving cats, dogs, buffaloes, ostriches, people, etc.
Muybridge Derby
For this project, I wanted to see if I could use AI systems to transform Muybridge's animal locomotion photographs into high-resolution, full-color videos. After experimenting with various techniques, I changed the original sequences to be more realistic using a combination of Midjourney to create reference frames from text prompts and RunwayML's Gen-1 Video Generator. For fun, I made a short animation, "Muybridge Derby," showcasing the work. Here it is.
In the following sections, I will describe how I transformed the locomotion sequences, generated the background scroll, and combined the elements to create the animation.
Using Midjourney to Generate Reference Frames
As a prerequisite for transforming a Muybridge photo series into a high-def video, I generated a high-resolution reference frame using one of the original series's photos and a text prompt in Midjourney.
For example, here is the prompt I used for generating the reference frame of the horse and jockey, "a man wearing a blue cap, blue jacket, white pants, and black boots riding a brown horse with a white background –– ar 4:3." **** Note that the –ar 4:3 parameter indicates the aspect ratio of 4:3.
I pasted in a link to the image of Muybridge's frame number 2 along with the prompt into Midjourney, and it produced four thumbnail images. All four generated images were pretty good. I liked the details and texture of the images, including the look of the jockey's clothes and the shininess of the horse's coat. None of them matched the original pose of the horse exactly, but I found out that it doesn't matter when stylizing a video. The video stylizer in RunwayML only picks up on the general look of the image. I chose the thumbnail image at the lower right (outlined in green) and made some edits in Photoshop; I flipped the image horizontally, changed the hue of the horse to brown, and changed the style of the jockey's cap.
I repeated this process for the other four animals in the animation, a cat, a buffalo, an elephant, and an ostrich. Here are the results. You can see an image from one of Muybridge's photo series in the left column below. The middle column shows the results from Midjourney using the Muybridge image and the text, like "a full-color photo of a cat running on a dirt track, side view, — ar 4:3." The selected thumbnail is outlined in green. The right column shows the selected image, cleaned up a bit and flipped horizontally if needed, in Photoshop.
The Midjourney system did a great job generating the reference images. The details of the animals are amazing. You can click on any of the images to zoom in and see. Again, it didn't precisely match the pose in the reference image, but the overall quality of the renderings was excellent. For more information on Midjourney, you can check out my earlier article here.
Next, I will show you how I used reference images to transform the photo series with RunwayML.
RunwayML
Runway is a start-up company in New York City that researches and provides media creation and editing services that use Machine Learning (ML.) They are known as RunwayML because of the URL for their website, runwayml.com. They offer a range of subscription tiers for their services at various price points: free, $12 per month, $28 per month, etc.
Here is a list of some of the services they provide:
- Super-Slow Motion – Transform video to have super smooth motion
- Video-to-Video Editing – Change the style of a video with text or images
- Remove Background – Remove, blur, or replace the video background
- Text-to-Video Generation – Generate videos with text prompts
- Image-to-Image Editing – Transform images with text prompts
I used the first three to stylize the Muybridge sequences in my video.
RunwayML's Video-to-video Editing Model
RunwayML's video-to-video editing service allows users to upload an input video and provide either text or a reference image as a prompt. The ML model will then "edit" the footage by imposing the style specified in the prompt while keeping the primary elements of the input video intact. The process is written up in their paper, Structure and Content-Guided Video Synthesis with Diffusion Models [2].
In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. … We find that depth estimates extracted from input video frames provide the desired properties as they encode significantly less content information compared to simpler structure representations. – P. Esser et al, RunwayML
Note that "monocular depth estimates" refers to a depth map, where the values of the pixels indicate the distance from the camera to the surface of the objects depicted in the scene. To get the depth estimates, they used another ML model from a group of European researchers [3]. The model is called MiDaS (which, I guess, is a backronym for Monocular Depth eStimator?) The MiDaS system was trained on a dataset of 3D movie scenes, like the shots below.
You can see how the light yellow colors in the depth map show the closer points in the scene, and the dark blue colors show the more distant points in the background. The trained MiDaS model can estimate depth maps from any input model. Here are some results from the paper.
You can see how the MiDaS model does an excellent job of estimating depths. For example, you can see how the dog's tail is well-defined as it stands out from the stream behind it.
RunwayML's video-to-video model uses the predicted depth maps of the input video to condition a diffusion video generation model directed by a text prompt or reference image.
Our latent video diffusion model synthesizes new videos given structure and content information. We ensure structural consistency by conditioning on depth estimates while content is controlled with images or natural language. Temporally stable results are achieved with additional temporal connections in the model and joint image and video training. – P. Esser et al, RunwayML
You can see some video editing results with text prompts from the paper below.
The various styles, pencil sketch, anime, and low-poly render, transform the input video to create the output. And you can see the consistency of applying the style from frame to frame. Below are some examples from the paper that use image prompts to stylize video.
Again, you can see how the color palette and the look of the prompt image transform the video to the indicated style. And details from the generated frames are rendered consistently for the final video.
Using RunwayML's Video-to-video Editing Service
To use the service, I created an account and logged in. As mentioned above, you can use the free version, which has limitations, like generating videos up to only four seconds. I opted to pay US$12 per month, which allows me to create videos up to 15 seconds and other benefits.
I shot a brief shadow puppet clip of a rabbit to test the system, cleaned it up with an editing system, and uploaded it to RunwayML. Then I chose the Gen-1: Video to Video tool. I loaded the clip, typed in the prompt, "photorealistic rabbit with floppy ears in a field," and hit the Preview Styles button. The system thought about it a bit and rendered four thumbnails. You can see them at the bottom of the screenshot below.
All four thumbnails looked good. They all followed the form of the shadow puppet but with a realistic rabbit entering the frame. I chose the third one and hit Generate Video. It took about 20 minutes to render the video. I also created one with the prompt, "2D animated rabbit with floppy ears in a field." You can see the results below, the original shadow puppet video, and my cleaned-up version for reference.
The generated videos came out nice! The photorealistic one in the lower left looks the best, with lovely details in the rabbit's eyes, ears, and nose. The 2D animation render is a bit off. The system seems confused about which ear is which, and the background is less interesting. Next, I tried the same experiment with two reference images generated in Midjourney.
These came out nice, too. They both picked up the style from the reference image while following the shapes and motion in the original shadow puppet video. The one on the right has a strange effect coming in from the right, however. It almost looks like a sun flare. Notice how both generated animations show details from the background in the reference frames, like the nice clouds on the right. But the foreground forms from the reference frames are missing, like the wheat grains on the left and the tree on the right. This is probably due to the use of depth images in the training data that RunwayML used. It shows my hand movements transformed into bunnies as the foreground imagery but kept elements of the reference image for the background, like the field and sky.
Bringing Muybridge's Photos to Life
I used the RunwayML technique described above with a minor variation to transform the original image sequences from Muybridge to high-res versions.
Super Slomo
Because the animals ran quickly in Muybridge's experiments, there is a lot of motion between frames. For example, here are three frames of the horse sequence.
Notice how much motion is seen in the horse's legs between frames. The results were not very good when I experimented with the video-to-video stylization with fast-moving animation. My solution was first to slow down the motion by a factor of 2 using RunwayML's Super-Slow Motion feature, then apply the transformation, and finally speed the resultant video up by a factor of two.
Here's what the slowed-down video looks like.
You can see less motion between frames, especially with the horse's legs. Here's what the original horse sequence looks like compared to the 50% slow-motion version.
The system did an excellent job with motion interpolation. In general, the motion is smoother with the RunwayML's super-slow motion. There is a little hiccup in the action when the sequence resets, but it will get masked when I speed the transformed videos up by a factor of two.
Video-to-Video Transformations
I first uploaded the slowed-down horse animation into RunwayML to create the transformed video and then chose the Gen-1: Video to Video tool. I selected the Image Style reference and uploaded my reference frame for the horse created with Midjourney. Several settings are available for the transformation, including the following.
- Style: Structural consistency – Higher values make the output more structurally different from the input video.
- Style: Weight – Higher values emphasize matching the style rather than the input video.
- Frame Consistency – Values below 1 give decreased consistency across time; values above 1 increase how closely frames relate to prior frames.
You can see some examples of varying these settings on RunwayML's help page. I experimented with these settings but used the defaults, Structural Consistency of 2, Weight of 8.5, and Frame Consistency of 1.
I then clicked Preview styles, and it displayed four options at the bottom.
I chose the third preview and hit the Generate video button. Here is the reference image, the original horse sequence, and the stylized animation sped up by a factor of two to match the initial speed.
This came out well! You can see how the style from the reference image got imposed on the original Muybridge animation while keeping the motion of the horse and jockey intact. The system also performed an ML-based video resize to bring the final video up to 640×480, which brought in some nice details. Note that the system has an Upscale setting, which would double the resolution horizontally and vertically.
I performed the same operations on the other four image sequences. You can see the results below, with the reference frames from Midjourney, Muybridge's original animal photo sequences, and the stylized videos by RunwayML.
These look great, too! Like the horse animation, the RunwayML model picked up the textures and coloring from the reference image and applied them to the original animations while keeping the motion intact. The backgrounds in the new animations didn't scroll right to the left, however. But this was not a problem. You can see in the next section how I created an "alpha mask" to keep just the foreground imagery and composite the animals over a new background.
Removing Background Imagery
I used RunwayML's Remove Background feature to replace the running animal clips' backgrounds. I loaded the original video clip from Muybridge's photos and used the cursor to select two points, the horse and the jockey's leg. The system thought a bit, then showed the chosen area in green, as you can see in the screenshot below.
The system shows how it selected the foreground for all of the frames in the video, and I could play it as a preview. It did an excellent job that didn't require much work on my part. I then saved the alpha matte as a video for my compositing app.
I created a still image of a derby stadium in Midjourney and used it as a scrolling background for my animation. Wherever the matte is black, it will show the background (the stadium); wherever it is white, it will show the foreground (the horse and jockey.) Here is the stylized clip of the horse, the alpha matte, and the final result.
In my compositing program, I had to clean up the alpha matte a bit. For example, I blurred the tail to make it look more like hair and not a solid object. You can see how the scrolling background helps sell the effect that the horse is running forward, which you don't see in the original stylized animation.
Here is the final animation again, this time a bit bigger, so you can check out the details.
If you want to see the animation on a large screen, it will be shown at The Next Big Thing, August 5 to September 30, 2023, at the Studio Channel Islands Art Center in Camarillo, CA.
Final Thoughts
I enjoyed working with the Muybridge images and using Midjourney and the tools at RunwayML to generate and modify media for this project. If you are familiar with my writing on Medium, you know I like to try new production methods, but I don't always create a finished piece. So it was satisfying for me to bring multiple elements together. As a "deep cut," I used a song I generated with AI for a previous article as the music played over the credits. It's called "I'll Get There When I Get There," which is kinda appropriate for a derby race.