Shimon1, a battle-rapping robot, interacts with human rappers in real-time by responding with AI-generated raps, voices, and gestures, taking the human’s raps as input.
The system involves analysis of the human’s audio and lyrics, generation of lyrics accordingly, censorship, rhyme-based filtering of the lyrics, conversion of the text to rhythm and voice, and synchronisation of gestures.
Audio and Text Analysis
MaxMSP, an audio recording software, breaks down incoming human audio into chunks and sends them in real-time to a Python program that calls the Google Cloud Speech API to convert the speech to text.
The TextRank algorithm2 is used to identify important keywords from this text. Wordnet is used to gather the synonyms and antonyms for keywords.
Dataset
A web scraper called Verse Scraper is used to consolidate lyrics from individual artists. A hip-hop dataset of 25k songs and a metal dataset of 15k songs are used for generating lyrics.
Phoneme Embedding and Deep Learning
A phoneme represents the smallest unit of vocalisation; groups of phonemes create syllables and syllables create words. Phoneme encoding is important for hip-hop lyric generation as the genre involves different pronunciations of the same words to achieve flow.
A pronouncing dictionary is used as a base to map a syllable with its varying pronunciations. As new combinations of text and phonemes are encountered, they are added to this dictionary.
An RNN-LSTM deep learning model, with a phoneme-embedding layer, is used for generating text. From this, lyrics that involve synonyms or antonyms of keywords are chosen automatically.
Censorship
Lyrics were censored during post-processing. A list of 28 inappropriate words are filtered out from the generated text and a new generation is added as replacement.
Rhyme Detection and Lyric Selection
The lyrics generated contain internal rhymes due to the use of phonemes in the machine learning model. Two types of rhymes–perfect rhymes, where the vowels and consonants match, and slant rhymes where words have identical sounds– are scored.
Longer words that rhyme are given more weight when filtering the lyrics. Below is a set of generated lines with the scores below the rhyming words.
Text to Rhythm
The system generated could map the lyrics to any tempo ranging from 80-160 BPM. This involves emphasising rhyming words, placing different lengths of silences after them, and filling the gaps with non-rhyming words.
Rhythm to Voice
Google’s text-to-speech system is used in conjunction with the Speech Synthesis Markup Language (SSML) to change the intonations of rhyming words. Audio plugins are used to compress and filter the sound.
Gesture Synchronisation
The robot’s mouth, head, and neck movements are synchronised to the raps. Mouth movements are based on each syllable’s start and end time. Its eyebrows are programmed to raise when enunciating rhyming words.
Shimon nods its head to the beat when listening to the human rapper. While performing, it moves side to side and up and down with its head firm so its mouth is visible.
Watch Shimon rap here.
References:
1. Savery et al, 2020
2. Mihalcea, Tarau, 2004
Quite an interesting piece!!Keep up the good work!!
Good write up Alind. I see your hobby is enhanced by your professional domain. Keep it up.