Top music generation datasets in 2022 – Analytics India Magazine

Artificial intelligence has been tapped for synthetic music generation for some time now. However, the watershed moment came when music informatics met AI. Now, music AI researchers are taking advantage of the latest AI, ML, and analytics developments to develop models to create music on par with human composers. 
The strength of any AI/ML model is predicated on the data it’s fed. Below, we look at the major music generation datasets doing the rounds in 2022.
The dataset probably contains the largest instrumental notes at 305,979 musical notes, including unique pitch, timbre and envelope. The musical notes were collected from 1,006 instruments from commercial sample libraries and are annotated based on (acoustic, electronic or synthetic) instrument family and sonic qualities. Instruments including bass, flute guitar, keyboard, mallet, organ, reed, string, synth lead and vocal have been used in this dataset.
The MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organisation) dataset has over 200 hours of paired audio and MIDI recordings of International Piano-e-Competition in the past ten years. The MIDI data has key strike velocities, including sustain/sostenuto/una corda pedal positions. 
URMP (University of Rochester Multi-Modal Musical Performance) 
URMP introduced a dataset for facilitating audio-visual analysis of musical performances. The dataset contains several simple multi-instrument musical pieces assembled from separately recorded performances of individual tracks. For each piece, a musical score is provided in MIDI format.
Bach Doodle Dataset
The dataset consists of 21.6 million harmonisations submitted from the Bach Doodle and metadata about the composition, like country of origin and the feedback. It also has MIDI of user-entered melody and MIDI of the generated harmonisation. An exploration of the melodies in the dataset contains top repeated melodies from each country or the regional hits.
The Lakh MIDI Dataset v0.1
This dataset contains 176,581 unique MIDI files, 45,129 matched and aligned to the Million Song Dataset entries. The dataset mainly facilitates large-scale music information retrieval, both symbolic and audio content-based. 
Music 21
Music21 contains music performances from 21 categories. It is a set of tools to help scholars quickly find answers to questions like, “I wonder how often Bach does that” or “Which band used these chords for the first time? Or “How to know more about Renaissance counterpoint or the Indian ragas or post-tonal pitch structures or the form of minutes”. 
Datasets for Indian Music
CompMusic catalogues datasets for Indian art music. The website aims to advance the automatic description of music by emphasising cultural specificity carrying research in music information processing with a domain knowledge approach. The project mainly focused on five music traditions of the world: Hindustani (North India), Carnatic (South India), Turkish-makam (Turkey), Arab-Andalusian (Maghreb), and Beijing Opera (China).
Indian Music Tonic Dataset
The dataset contains 597 commercially available audio music recordings of Indian art music, both Hindustani and Carnatic music, where each recording is manually annotated with the tonic of the lead artist.
Carnatic Varnam Dataset
The dataset has 28 solo vocal recordings recorded to be researched on the intonation analysis of Carnatic ragas. In addition, the dataset contains audio recordings, time aligned tala cycle annotations and swara notations in a machine-readable format.
Carnatic Music Rhythm Dataset
This dataset is a sub-collection of 176 excerpts in four taalas of Carnatic music with audio, tala related metadata and time aligned markers to indicate the progress through the tala cycles.
Hindustani Music Rhythm Dataset
The dataset is a sub-collection of 151 in four taals of Hindustani music which includes audio, taal related metadata and time aligned markers to indicate the progress through the taal cycles.
Mridangam Stroke Dataset
The dataset contains 7,162 audio examples of individual strokes of the Mridangam in various tonics and ten different strokes played on Mridangams with six tonic values.
Mridangam Tani-avarthanam Dataset
The dataset is a transcribed collection of two tani-avarthanams played by Mridangam maestro Padmavibhushan Umayalpuram K. Sivaraman. The audio of the dataset was recorded at the IIT Madras and annotated by Carnatic percussionists. The dataset contains 24 minute of audio and 8,800 strokes.
Saraga: Research datasets of Indian Art Music
The repository contains time aligned melody, rhythm and structural annotations for two large open datasets of Indian Art Music (Carnatic and Hindustani music). 
Tabla Solo Dataset
The dataset is a transcribed collection of Tabla solo audio recordings spanning compositions from six different Gharanas of Tabla, played by Pt. Arvind Mulgaonkar. It consists of audio and time aligned bol transcriptions.
Regular passes expiring on 23rd Dec
Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023
Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023
Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023
Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023
Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023
Discover special offers, top stories, upcoming events, and more.
Stay Connected with a larger ecosystem of data science and ML Professionals
Stay up to date with our latest news, receive exclusive deals, and more.
© Analytics India Magazine Pvt Ltd 2022
Terms of use
Privacy Policy


About Kalika Ayuna

Check Also

The Quiet Juan Speaks | Lifestyle.INQ –

Although he has reached the Beatles’ theoretical age of supposed decrepitude, Wally Gonzalez—yes, two Z’s—is …

Leave a Reply

Your email address will not be published. Required fields are marked *