Balancing bats 🦇
Author: Jonas Bengt Carina Håkansson, PhD| University of Colorado, Colorado Springs (UCCS) | Date: 20220831 | Written as part of the DeepLabCut AI Residency Program, 2022 🎉
Introduction
I work as a postdoctoral associate in Dr. Aaron Corcoran’s lab, the Sensory & Aerial Ecology lab. My research, in short, is about tracking wing movements of bats as they fly. For this, we use DeepLabCut.
This post documents my experience in using data augmentation to improve the accuracy of DeepLabCut in tracking bodyparts of the bats during my time as a DeepLabCut AI Resident during the summer of 2022.
The dataset
This dataset consists of videos of bats flying freely at the Austin Bat Refuge. The videos were shot by Dr. Aaron Corcoran in 2020. The dataset is comprised of 26 video triplets. Each triplet consists of three videos depicting the same bat flight from different angles.
The locations of the labels/bodyparts on the bats.
Structuring the project for testing
Since bats are such a challenging animal to automatcially digitize, in my lab, we’ve been relying on what I call “refining” to improve DLC accuracy. This just means that for each video that we want to analyze, we first digitize a few frames from it and include those frames as training data for the DeepLabCut network.
We typically digitize approximately one frame per wingbeat. For a wingbeat frequency of 15 and a framerate of 800, this means approximately every 50th frame. For a lower framerate, say 400, we instead need to digitize every 50th frame. This is obviously less labour than digitizing every frame, but the workflow still scales poorly with increased acquisition. Therefore, I want to reduce the required amount of manual digitization. The challenge I’ve set myself is therefore this — to use augmentation to match or beat the accuracy of refining.
Test data
I randomly chose seven digitized triplets of videos for testing. Every 50th frame of these will be used for testing.
Treatments
I will use two “treatments” for testing. Firstly, I simply test how increasing the amount of training data affects accuracy. Secondly, I test how adding refining data affects accuracy. The result is a four permutations of treatments in total, sorted into four shuffles as shown in table 1.
Refining
The seven testing video triplets have had every 25th frame digitized. Frames 50, 100, 150, … is used for testing. In this treatment, frames 25, 75, 175, … will be used for used as training data, i.e. refining. I call the networks not trained on refining data OOD for out of domain and the ones trained on refining data Ref for refining.
Amount of base training data
Subtracting the seven testing video triplets, I’m left with 19 (57 videos in total). I then randomly split these into two groups of 9 and 10. I will either train on only 10 of these video triplets, or on all 19. I call these treatments “half” and “full”.
This treatment might not seem relevant for figuring out how augmentation affects performance, and in truth, I chose to perform this test to answer another question, but you will see that the results of this test are very relevant to the challenge written in bold above.
I will refer to training data from the 19 videos not tested on as “base training data”.
Table 1: The two treatments result in four permutations, organized in four shuffles:
Performance with default augmentation
Accuracy of the four shuffles, refining seems crucial for driving down the pixel error. Left: performance without filtering out low-confidence predictions, righ: Predictions with a confidence below 60% have been filtered out.
Looking at the plot, a few things seem clear, let’s start with the obvious and least surprising. By training a network on a subset of frames of the videos to be analyzed, we really improve accuracy. When doing this, i.e. when refining, increasing the ammount of training data, i.e. frames not belonging to the videos to be analyzed, has little effect on accuracy.
But surprisingly, when not refining, it appears that increasing the amount of training data can worsen performance. What’s going on?
Moving forward, I will only consider the last snapshot, i.e. the performance after taining for 150,000 iterations. Furthermore, I will only consider performance after low-confidence predictions have been filtered out.
To get a better idea of why performance was so bad without refining, I plotted the accuracy per video.
Per video performance for the four shuffles. Note that there are some videos on which all shuffles perform well no matter refining.
Compare “easy” and “difficult” videos
The un-refined shuffles perform rather poorly on videos 1, 2, 4, and 5but it does well on 3, 6, and 7. Why is this? To get an idea of if the videos on which the network does well differs from the one on which it does poorly I inspected the performance on individual frames.
Inspecting and comparing frames reveal an interesting relationship. It appears the network does well when the bat is flying left-right, and bad when the bat is flying right to left.
The two following example images reveal what I mean.
The bat is flying from the right to the left, and the network has mistaken the left side of the bat for the right side.
*The bat is flying from the left to the right, and the network is accurately predicting all points.
The inspection of the predicted bodypart locations reveal that the low accuracy for some videos is caused by the network mixing up left and right. This could also explain why the performance goes down when we add more data. In this case, it appears that the added data results in the network becoming biased towards one flight direction.
Augment to reduce left-right bias
Our evaluation showed that our network has a bias towards bats flying in one direction. In an attempt to reduce that bias and make the network better at telling left from right independent of flight direction, I used a left-right flipping augmentation called fliplr
This randomly flips some training frames left and right during training. Now, this alone would lead to worse performance, since we would be training the network to think that the left wing is the right wing, and vice versa. But this augmentation method takes that into account, and in addition to mirroring the image, it also swaps the labels from left to right and vice versa. To accomplish this, we need to supply the pose_cfg.yaml file with a list of which points should be swapped.
Since the challenge I set is to use augmentation to match or beat the accuracy of refining, moving forward, I’ll stick to comparing the baseline refined network (shuffle 4) to the augmented un-refined network (shuffle 3 with augmentations applied).
Interpret performance of network after fliplr
augmentation
Accuracy of the refined baseline (default augmentation, no flipping) and the un-refined fliplr
augmented networks. The augmentation has significantly improved the performance but we are not beating the refined baseline quite yet.
That is a significant improvement in accuracy. The fliplr
augmentation seems to prevent the network from becoming biased to one flight direction. It is however, not quite beating the performance of the baseline refined case (shuffle4: full, Ref).
There is another augmentation that, in our case, is related to the flight direction, namely the rotation augmentation. The idea here is that in order to prevent the network from associating absolute in-image spatial orientation with positions for certain labels, we want to randomly rotate some of the training data. The degree to which this makes sense is context dependent. If you are analyzing a horse walking in a horizontal direction from the camera’s point of view, large degrees of rotation make little sense as the network will never encounter a horse walking upside down. But if you are filming an animal from above, or below if the animal is flying or swimming, such that the animal is free to move in any direction, then this type of augmentation makes sense.
By default, DeepLabCut applies a 25° rotation augmentation, I tried increasing that, first to 90°, then to 180°. I will only cover the 180° case here as the 90° case is exactly the same, only with slightly worse performance.
Performance with fliplr
and 180° rotation augmentation
Accuracy of the refined baseline (default augmentation, no flipping), and the un-refined fliplr
and 180° rotation augmented networks. With this augmentation we are beating the baseline refined performance.
With this, it appears that we have reached our goal, which was to use augmentation to match or beat the accuracy of refining. But let’s also take a look at the accuracy per video. I will also plot the baseline, un-refined full network accuracy here to highlight how much the augmentation has improved the performance of the network.
*Per video accuracy of the refined baseline (default augmentation, no flipping), and the un-refined fliplr
and 180° rotation augmented networks. I have also included the baseline un-refined network to highlight how much the augmentation has improved the performance.
The results look very promising. Video 1 keeps being difficult. This is likely due to this video depicting a bat flying in a manner that is atypical for the dataset used here. Most videos in the dataset depict rather straight-forward flybys, but video 1 shows a bat slowly ascending while turning. Even so, the augmentation has lowered the pixel error for video 1 compared to the baseline, and for all other videos, the augmentation has led to performance close to, or better than, the refined case.
Conclusion
What follows are some of my conclusions from using augmentation to improve performance when tracking wing movement of bats using DeepLabCut.
Where scalability is not important, consider refining
We saw from our baseline accuracy tests that when refining, there was no gain from increasing the amount of base training data. In practical terms this means that if you know which videos you want to analyze, and when you do not plan on using the network for analyzing other videos in the future, then it makes sense to prioritize digitizing frames from the videos to be analyzed.
Inspect bodypart predictions and augment with purpose
At first, I tried a couple of permutations of different image augmentations and saw no or little performance improvement. I tried augmentations related to scale, brightness, blurriness, and probably some other that slip my mind at the moment. I more or less concluded that augmentations have little effect and that for a given set of training and test data, DeepLabCut’s default training parameters are close to optimal and little can be gained by augmenting the training data. But then I looked closer at the the performance on individual frames and realized that the high pixel errors were mainly caused by the network getting left and right mixed up and augmented accordingly. Doing that, I saw an average accuracy improvement of more than 10 pixels. For this dataset, 10 pixels is quite big, for comparison, on the largest image in the test dataset, the forearm of the bat is less than 70 pixels long.