Daily coverage of WWDC21.
A Swift by Sundell spin-off.

Roll your own Shazam with the new ShazamKit framework

Published at 17:00 GMT, 09 Jun 2021
Written by: Gui Rambo

You’re probably familiar with Shazam. The app was such a huge success that “shazam” became a verb and Apple acquired the company behind it, integrating most of its functionality into their operating systems. This year, Apple is introducing ShazamKit, a new framework that enables apps to recognize songs that are playing near the user’s devices. Besides recognizing songs from Shazam’s huge catalog, we can now create our own custom catalogs, which can then enable incredible audio experiences within our apps.

As you can see in the video that accompanies this article, I was able to create a simple app that can detect episodes of different podcasts, something that Shazam can’t do on its own. This is all thanks to the ShazamKit framework, which lets us create a custom Shazam Catalog containing audio signatures for the different types of content that we’d like to detect in our apps.

ShazamKit vs. MLSoundClassifier

You might be wondering what the difference is between ShazamKit’s ability to recognize custom audio signatures and what you can already do using certain machine learning tools, such as MLSoundClassifier. The way I see it, ShazamKit was designed to recognize very specific audio signatures, such as a specific episode of a podcast, a given song, or even a specific clip within a long episode of a TV show or a podcast.

That means that, in order for a given podcast to be recognized in our example, the catalog must contain the signature for the specific episode that we’ll be playing, or something that’s repeated in every episode, such as an intro music or the podcast host saying “Hello, welcome back to such and such show” (as long as they say it in a similar way every time).

If we wanted to do a more fuzzy match, such as recognizing which host that is speaking during a specific portion of the episode, then we’d be better off using MLSoundClassifier with a custom machine learning model that we have trained to recognize the different voices.

A really cool application for ShazamKit would be as a companion app for video content that the user watches on another device. Let’s say that you have a video-based course that people usually watch on their Mac or their Apple TV, and that you’d like to offer a second screen experience that the user can access from their iPhone, synced with the video that’s playing on the TV. You could create a custom catalog that’s able to recognize different parts of the video, displaying additional information on screen based on what the iPhone’s microphone is picking up.

Let’s see how we can use all of this in practice with the podcast example that I gave earlier.

Creating a custom Shazam catalog

It all starts with the creation of a custom catalog that can be used to recognize the audio signatures of different podcast episodes. To create this catalog, I wrote a very simple app in SwiftUI that just lets me start capturing audio from the microphone, and then allows me to export the captured signature using the standard system share sheet.

If you’re going to do the same, then don’t forget to add the Privacy - Microphone Usage Description key to your app’s Info.plist, so that the system can ask the user for permission when first using the microphone.

// Used to get audio from the microphone.
private lazy var audioEngine = AVAudioEngine()

// Used to generate an audio signature from the audio input.
private lazy var generator = SHSignatureGenerator()

func startRecording() {
    // Create an audio format for our buffers based on the format of the input, with a single channel (mono).
    let audioFormat = AVAudioFormat(
        standardFormatWithSampleRate: audioEngine.inputNode.outputFormat(forBus: 0).sampleRate,
        channels: 1
    )
    
    // Install a "tap" in the audio engine's input so that we can send buffers from the microphone to the signature generator.
    audioEngine.inputNode.installTap(onBus: 0, bufferSize: 2048, format: audioFormat) { [weak generator] buffer, audioTime in
        // Whenever a new buffer comes in, we send it over to the signature generator.
        try? generator?.append(buffer, at: audioTime)
    }
    
    // Tell the system that we're about to start recording.
    try? AVAudioSession.sharedInstance().setCategory(.record)
    
    // Ensure that we have permission to record, then start running the audio engine.
    AVAudioSession.sharedInstance().requestRecordPermission { [weak self] success in
        guard success, let self = self else { return }
        
        try? self.audioEngine.start()
    }
}

Notice how I’m using try? above in order to completely ignore any errors that might be thrown. This is done for brevity, but you should use the do/catch pattern in order to handle errors appropriately in a production scenario.

In my test app, I added a simple button that calls this startRecording method to initiate a recording. While recording, I then played the first 15 seconds of an episode of the podcast that I wanted to include in my catalog. This process must be repeated for each audio signature that you’d like to capture, stopping in between to export the signature that’s been collected.

To stop recording, all that we have to do is tell the audio engine to stop:

func finishRecording() {
    audioEngine.stop()
}

After doing that, the SHSignatureGenerator that we fed our audio buffers into will be ready to give us an audio signature. The audio signature will be of the type SHSignature, and we can use SHCustomCatalog in order to collect multiple signatures and associate them with their respective metadata.

Ideally, you’ll have your “training” app provide an interface where multiple signatures can be captured during a single app session, using a custom model object to collect the audio signatures alongside the metadata that you’d like to associate with them. That model could be similar to this:

struct ReferenceSignature: Hashable {
    let id: UUID
    let podcastName: String
    let artist: String
    let signature: SHSignature
}

You’d store the signatures in a collection, then offer an “export” option that would then generate the catalog and allow the user to export it:

func createCatalog() -> SHCustomCatalog? {
    let catalog = SHCustomCatalog()
    
    do {
        // capturedSignatures is an array of [ReferenceSignature], our custom model type.
        try capturedSignatures.forEach { reference in
            try catalog.addReferenceSignature(reference.signature, representing: [reference.mediaItem])
        }
    } catch {
        print("Something went wrong: \(error)")
    }
    
    return catalog
}

Notice the mediaItem property that I’m accessing from my custom model type. That is just a convenience property that creates an object of type SHMediaItem, which is how ShazamKit represents media such as music. This object has several standard properties for common things such as title and artist name, but you can also create your own. Here’s how I implemented mediaItem in my custom ReferenceSignature model:

extension ReferenceSignature {
    var mediaItem: SHMediaItem {
        SHMediaItem(properties: [
            .artist: artist,
            .title: podcastName
        ])
    }
}

Since we’re now in possession of an SHCustomCatalog, we can now export it to a file:

func export(_ catalog: SHCustomCatalog) -> URL? {
    let tempURL = URL(fileURLWithPath: NSTemporaryDirectory())
        .appendingPathComponent(UUID().uuidString)
        .appendingPathExtension("shazamcatalog")
    
    do {
        try catalog.write(to: tempURL)

        return tempURL
    } catch {
        print("Export error: \(error)")
        return nil
    }
}

That file can then be embedded into the app that we ship to our users, or we might even update it remotely by storing it on CloudKit or on our own servers.

All of the sample code that I’ve shown so far would run in the context of an internal app that your team uses in order to generate the catalog, which is then used in the app that ships to customers. There are no rules against allowing your users to create their own catalogs, though, so if you have an app idea that would involve users training their own Shazam catalog right within the app, then feel free to do it.

Using the catalog to identify media

Now that we have a trained Shazam catalog for the podcast that we would like our app to be able to recognize, we can finally use it. The first step is very similar to capturing audio signatures, because we once again need to set up an audio engine and feed buffers to an object provided by ShazamKit:

private var audioEngine = AVAudioEngine()

private lazy var session = SHSession()

func start() {
    let catalog = SHCustomCatalog()
    try? catalog.add(from: Bundle.main.url(forResource: "Podcasts", withExtension: "shazamcatalog")!)
    
    // The session is what we use to recognize what's playing.
    session = SHSession(catalog: catalog)
    // The delegate will receive callbacks when the media is recognized.
    session.delegate = self
    
    audioEngine = AVAudioEngine()
    
    // Create an audio format for our buffers based on the format of the input, with a single channel (mono).
    let audioFormat = AVAudioFormat(
        standardFormatWithSampleRate: audioEngine.inputNode.outputFormat(forBus: 0).sampleRate,
        channels: 1
    )
    
    // Install a "tap" in the audio engine's input so that we can send buffers from the microphone to the session.
    audioEngine.inputNode.installTap(onBus: 0, bufferSize: 2048, format: audioFormat) { [weak session] buffer, audioTime in
        // Whenever a new buffer comes in, we send it over to the session for recognition.
        session?.matchStreamingBuffer(buffer, at: audioTime)
    }
    
    // Tell the system that we're about to start recording.
    try? AVAudioSession.sharedInstance().setCategory(.record)
    
    // Ensure that we have permission to record, then start running the audio engine.
    AVAudioSession.sharedInstance().requestRecordPermission { [weak self] success in
        guard success, let self = self else { return }
        
        try? self.audioEngine.start()
    }
}

Most of this code is setting up the audio engine, just like we did in the first example. The difference here is that we’re now using an SHSession, which is the object provided by ShazamKit that will enable us to recognize media based on our custom catalog. The session is initialized with a catalog, which we have instantiated, and then used the add(from:) method to add the audio signatures from the catalog that we embedded in the app.

All that’s missing now is the implementation of the SHSessionDelegate protocol, which we’ll use in order to receive information about the media that’s been recognized:

extension PodcastRecognizer: SHSessionDelegate {
    func session(_ session: SHSession, didFind match: SHMatch) {
        DispatchQueue.main.async { self.state = .matchFound(match) }
    }
}

If you’re curious about the state property that’s referenced above, that’s a custom enum that I’ve created for my app, and it looks like this:

enum State {
    case idle
    case matching
    case matchFound(SHMatch)
    case failed
}

Having an @Published property with that enum type in an ObservableObject is a really simple way to hook this up to a SwiftUI view. However, creating the user interface for this app is left as an exercise to the reader 😉.

So now, whenever we start our audio engine and play something that our catalog has been trained to recognize, we’ll get a call to session(_:didFind:). The match has an array of mediaItems, of the type SHMatchedMediaItem. This type inherits from SHMediaItem, which means that we can access properties like title and artist, or even custom properties that we’ve created, using subscripts.

The SHMatchedMediaItem adds some additional information related to the match, such as the frequency skew and the predicted timecode in the audio signature that matches the current media that’s playing.

Conclusion

I hope that you enjoyed this exploration of ShazamKit. It’s a really fun API, and I can see many different types of experiences that could take advantage of the new framework. If you do anything cool with ShazamKit, be sure to let me know, I’d love to see it.

Written by: Gui Rambo
BitriseBitrise

Automatically build, test and distribute your app on every Pull Request, and use a powerful new suite of add-ons to visualize your test results, to ship your app with ease, and to add crash and performance monitoring to your project. Get started for free.