The Siren Song – what happens when machines start singing?

A change is coming
A change is coming

Published by:

Eran Soroka

What do sirens have to do with it?

AnnA released its first single today, and fittingly for a bot, it’s not a man-made creation: The lyrics were completely generated by an NLG algorithm, and the singing is performed in AnnA’s “regular” voice.

The name of this meetup, where this song was introduced, is derived from Greek mythology. Sirens, in the tale, were these mythological half women half-bird creatures. They would lure nearby sailors with their enchanting music and singing voices, to shipwreck on the rocky coast of their island.

AnnA’s song is half machine-made. It wouldn’t drown any ships, but it does “drown” the idea that songs, passionate, fun, and catchy songs, can only be written by humans.

a Change is coming – AnnA’s song

The creators behind the creation – interview with Chen Buskila and Assaf Talmudi

The humans behind the first NLG protest song are Chen Buskila, CTO and co-founder of CoCoHub and Assaf Talmudi, Multiple Gold Record winning music producer and musician, and machine learning enthusiast.

A bit about this unique project?

Assaf: “Chen and I both started playing with the idea of using AI as a way of generating songs, song lyrics, probably 2 years ago. We presented an interesting project at a conference and then Jason asked us to do something like this with AnnA. Considering the zeitgeist and the interesting times we’re living in, Jason and Chen chose to focus on protest songs. Chen did his magic of scanning databases of lyrics of such songs and then letting an artificial neural net which was trained on these lyrics to generate a new song. They generated a couple of nice lyrics. Then we chose one, we used Anna’s voice like the way we use her voice to speak with users. We kind of gave it a catchy 80’s style melody which we did not do by AI. It’s so much better to do by hand”.

How was the song generated?

Chen: “The first step was to collect relevant examples. You actually specie to the model you want to get at the end. So at first, we wanted to go with the billboard 100 as the list of songs, but Jason had the better idea of going with protest songs, so Jason gave us a Spotify playlist with hundreds of protest songs.

We crawled the internet in order to get the lyrics for each one of them and also some related songs too because we need huge amounts of data to train it. In the end, it was thousands of song lyrics to train on. After that, we used open AI to transform the model to what’s called open AIGPT2 to generate the actual music. The model’s based on a very varied and huge amount of data, we just took it and just fine-tuned it. We let it through all of the lyrics we crawled before that”.

Assaf: “So you start with a general language model that you get as open source. Something that went through a lot of texts, and you give special importance to a small dataset”.

“Even 500 songs would be a small dataset”

Chen: “And after it was trained on huge amounts of text from the internet, we let it go over the lyrics dataset and let it learn it about. It kind of forgets what it learned before”. 

Assaf: “Chen means that you can’t learn something from a small dataset, and even 500 song lyrics would be small. So basically you start with training a general language model, it’s not about the kinds of text you’re interested in. Then you kind of get the structure of the language within the model in a way. Then you train it on a specific dataset you wanted”.

Chen: “All this wasn’t enough so another trick we did to get it to to generate the song in the exact format we wanted is called conditioning. You can take a prompt, a sentence, or a few words and give it to the model as a starting point. Then, it generates from this point forward. The lyrics’ data are all structured in the same way – song name, artist, and the lyrics itself. So we prompted the model with this data and the song was generated”.

“Sound maybe the special sauce your voice experience need” – Benjamin McCulloch’s ‘sound advice’

Benjamin McCulloch, Audio Specialist and Conversation Designer gave a talk about the importance of sound in the overall experience. “First, what is immersion? Have you ever been so absorbed in a film that you forgot where you were? You became so involved in the characters and the story that you stopped thinking about your real surroundings? That was immersion. It is often called ‘suspension of disbelief’ by filmmakers. The audience stops the fictional story that they are experiencing and accept it as some kind of reality. If the audience doesn’t believe what they are experiencing, they are likely to lose interest and just go and find something else to do”, explained Benjamin.

“Audio plays a huge rule in creating an immersive experience. Sound surrounds you. In modern movie theaters, the speakers are mounted all around the viewers, but there is only one screen. If you were to watch a film with the sound muted, what do you experience? Well, you’ll discover that a lot of details, the mood and the context of the event that accrue has been lost. Usually, sound departments go to great lengths to make the soundtrack is as real as possible, even if the story occurs in an imaginary world. Think of characters such as Gollum, Yoda, or C-3PO. The looked other-worldly, but the filmmakers used real voices to make those characters believable”, he added.

More great reading

How to humanize your bot?
That’s how a voice interaction can help you learn an instrument
The conversational AI and chatbots’ glossary

“So the million-dollar question is how does this relate to voice assistants? Well, voice assistants often use a completely different approach. Characters in voice experiences commonly have no visuals at all, unless it’s multi-model. The voice is usually synthetic, such as text to speech. Does that allow for immersion? I’m not so sure. I’m not aware of any single user that believed that when they heard text to speech they are listening to a real person”, said Benjamin. “The role of sound in creating immersive experiences goes beyond voices. Your user should believe that they are in the world that you want them to believe they are in. For example, if your experience is set in Antarctica, you should have the sound of a brutally cold icy wind that makes them feel like they are there”.

So, is this relevant to you?

“I think that if you’re making an experience that’s based around a narrative, the answer is definitely yes”, said Benjamin. “You’re competing against all the other story-based media that uses a sound team. There are conventions for working with sound, that have been refined over decades – and they work. If you create something that competes against people who use those conventions, you have to make sure that your work stands, so you should listen to them for inspiration”, he added. “But – is this only relevant to experiences with a fictional narrative? Maybe not. I think it needs to be tested. Maybe your users do need to be properly immersed into the experience, and believe it enough to properly engage with it. So to conclude – sound really helps with immersion. It might just be the special sauce that your voice experience needs”.