Creating Infinite Bad Guy

Machine learning for cover song analysis

The Task

  1. Estimate the per-beat alignment between the cover and the original.
  2. Visually classify the video.
  3. Identify the most visually similar videos.

Audio Alignment

  1. Copies of the original. This includes dance covers, workout routines, lyric videos and other visual covers.
  2. Covers with digital instrumental backing tracks. There are only a handful of popular instrumental and karaoke covers of Bad Guy, and these are often used as a backing track for a family of string musicians, violinists, or vocal covers.
  3. Full-band covers, which are typically not recorded against a click track and can vary significantly in song structure. Like this Irish Reggae Trio, or this metal cover (though metal covers usually have impeccable timing).
  4. Remixes and parodies that reference a very short portion of the original track, or drastically transform the original track. This includes multi-million-view parodies like Gabe The Dog and Sampling with Thanos, but also the Tiësto remix.
  5. Purely acoustic solo covers, which have similar challenges to full-band covers but fewer musical indicators useful for automated analysis. For example these covers on clarinet, on guitar in a bathtub in a monkey onesie, or any of the many ukulele covers.
The spectrogram of the original song and the spectrogram of a cover, slightly offset from each other.
Finding the ideal offset by testing multiple values and looking for a peak. The small local peaks correspond to local alignment but not global alignment.

Dynamic Time Warping and Chord Recognition

Dynamic time warping solution as a white line superimposed on a modulo’d distance matrix.
Automatic chord recognition of a cover song. Most Am chords correspond to a new section of the song.
Automatic chord recognition for “Bad Guy” by Billie Eilish.

A New Approach

  1. Determining the accuracy of the alignment algorithm efficiently. Heuristics give fast results, and can be tuned easily, but without ground truth labels each of those results must be hand-checked.
  2. Handling the more than 1 in 7 covers that repeat a section, or the many other covers where DTW or offset alignment would fail.

Dataset Building

Interface for validating cover song annotations.
Screenshot of spectrogram view in Audacity with labels along the bottom.
Alternating section change annotations in red/black across 100 tracks, sorted by the start time of the first beat.

Recurrent Neural Network Alignment

Song structure of the musical portion of “Bad Guy”.
CQT (top) and chroma features (bottom) for a cover song.
RNN prediction with cover song beats on x-axis and original song beats on y-axis.
Grid of predictions and ground truth for 100 covers in the validation set.
Graph of section changes where darker lines indicate more likely changes.
  1. Newer unsupervised algorithms (including newer versions of CPC).
  2. Replacing beat tracking with fixed-sized 100ms chunks.
  3. Label smoothing to afford more continuous timing predictions.

Visual Analysis

Spreadsheet for tracking manual labels across a small set of 200 cover songs.
  • There are a few shot-for-shot remakes and aesthetically inspired remakes. This metal cover, this Indonesian cover, frat guy, the Otomatone cover, and many composited versions. These are some of the most carefully crafted cover videos, but also the hardest to automatically distinguish from re-uploads of the original video. We wanted to make sure we caught as many of these as possible.
  • Dance videos. We initially trained a network to identify these, and discussed using pose estimation as an input to the classifier, but ultimately found that Google had a better internal dance detection algorithm.
  • Videos with still graphics or lyrics combined with a karaoke track, or screen recordings of digital audio workstations for music creation tutorials. We identified these to help us prioritize covers over tools and tutorials.


Screenshot of video labeling interface with videos blurred out.

Visual Similarity

  1. Select a small k evenly spaced feature vectors from the original nxm matrix. This captures a “summary” of the shots in a video, and is most helpful for identifying remakes and copies of the original.
  2. Take the mean of the absolute value of the intra-frame differences for all frames. We call this “activity” and it captures the presence of short-timescale changes like camera moves and cuts.
  3. Take the max of the absolute value of the intra-frame differences for only the selected k frames. We call this “diversity” and it describes roughly how many kinds of shots there are throughout the video, and what those shots capture.
UMAP embedding of video fingerprints with some ground truth labels overlaid.
CinemaNet labels for a cover video with seconds on the x axis.


  • librosa has a lovely example of Laplacian segmentation, a kind of structural analysis, designed to automatically identify song sections based on local similarity and recurrence. It cannot provide a dense alignment, but if we found it sooner we could have used it to make section-change proposals for manual validation — and possibly as a “confidence check” on dense alignment. Some covers that are not sonically similar to any other covers are still self-similar and structurally similar.
  • Taking a similar approach as “Pattern Radio”, we tried using UMAP with HDBSCAN to cluster beats. This was one of the most promising experiments, and might be explored as a preprocessing step for other low-dimensional analysis.
  • When images are resized without antialiasing, a convolutional network trained on antialiased images may perform poorly. It is preferable to rectify this during training if possible, because it can be difficult to identify during integration for production.





  1. There is a sixth category, which includes covers that do not map cleanly to this beat-for-beat paradigm, of which we only found a couple examples, including this slow jazzy cover in 3/4.
  2. Yes, we still used the same labels even for covers without lyrics, and for covers based on Justin Bieber’s version where he sings “gold teeth” instead of “bruises”.
  3. The best open source solutions for beat tracking is madmom, but due to its non-commercial use restriction we integrated the beat tracking from librosa.
  4. In an early iteration of this work we considered using similarity alone as a guide for the experience, but eventually decided that a user-interpretable approach was more compelling.
  5. For image similarity it is often more useful to use the output from the second-to-last layer as an “embedding” rather than the final classification probabilities, because the final classification probabilities lose some of the expressivity of the network’s analysis.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store