New Frontiers in Audio Machine Learning
Not that long ago, any workflow that involved processing audio files—even a fairly simple task like transcribing a podcast episode—came with a set of tough choices. You could go manual (and waste hours, if not days, in the process), rely on a few clunky and ultimately underwhelming apps, or patch together a Frankenstein's monster-equivalent of tools and code.
Those days are behind us. The rise of powerful models and accessible AI interfaces has made working with Audio and music exponentially more streamlined, and new horizons continue to open up every day. To help you catch up with some of the recent advances in audio-focused machine learning, we've collected a few standout articles from the past few weeks, covering a wide range of approaches and use cases. Tune out the noise and dive in!
- A look inside the black box of music tagging. With thousands of songs added to platforms like Spotify and Apple Music every day, have you ever wondered how these services know which musical genre to assign to each one? Max Hilsdorf‘s fascinating project leverages Shapley values to determine how the presence of specific instruments shapes the way AI systems tag new tracks.
- Explore a deep learning approach to identifying bird calls. Leonie Monigatti‘s recent contribution covers last year's BirdCLEF2022 Kaggle competition, where participants were tasked with creating a classifier for bird-song recordings. Leonie walks us through a neat approach that converts audio waveforms into mel spectrograms so a deep learning model can approach them the same way it does images.
- Get the gist of recorded conversations, lectures, and interviews. If you're a consummate optimizer, you'll appreciate Bildea Ana‘s streamlined process for transcribing audio with OpenAI's Whisper model on Hugging Face, and then summarizing it using the open-source BART encoder. You could apply this method to your own recordings and voice memos, or to any other audio file (as long as its owners allow it, of course—always double-check the copyright and license status of any data you'd like to use).
- Taking transcription to the next level. Luís Roque‘s latest project follows a parallel path to Ana's, up to a point. It also relies on Whisper to transcribe audio files, but then explores a different direction altogether by deploying PyAnnotate for speaker diarization, "the process of identifying and segmenting speech by different speakers."
Please don't stop the music, you say? We're happy to oblige – here are some of our favorite recent articles on non-audio-related topics. Enjoy!
- _"Learning neural networks should not be an exercise in decoding misleading diagrams,"_ say Aaron Master and Doron Bergman, who propose a constructive, novel approach to creating better and more accurate ones.
- From promotion design to inventory analysis, Idil Ismiguzel demonstrates the power of association rule mining: a technique that empowers data professionals to find frequent patterns in a dataset.
- For a hands-on approach to unsupervised learning and K-means clustering, don't miss Nabanita Roy‘s new tutorial, which focuses on the use case of grouping image pixels by color.
- If you find the intersection of AI, government regulations, and the intricacies of Canadian bureaucracy fascinating (who wouldn't?), Mathieu Lemay‘s deep dive is the one article you absolutely shouldn't miss this week.
- As the role of synthetic data continues to evolve (and grow) in numerous sectors, Miriam Santos‘ practical guide to generating it with CTGAN is as timely and useful as ever.
- We couldn't possibly go an entire week without a GPT-themed pick; if you haven't read it already, we highly recommend Henry Lai‘s overview of the data-centric AI concepts behind these ever-popular models.
Thank you for tuning in to The Variable this week! If you enjoy the articles you read on TDS, consider becoming a Medium member—and if you're a student in an eligible country, don't miss a chance to enjoy a substantial discount on a membership.
Until the next Variable,
TDS Editors