My current hobby is creation of Shazam like service for bird songs. I started with looking for a birds songs dataset. Fortunately there is a Xeno Canto site which is crowd-sourcing platform for building up bird songs and calls database.
I decided to start with classification of 30 most commonly recorded European birds.
The Xeno Canto data
Through Xeno Canto API I downloaded all available top quality recording for these 30 birds (9875 separate mp3 files 20Gb in total).
Some peculiarities
The most common species among top quality recording is Great Tit. It has 903 files with total duration of more than 13 hours.
The longest total duration of the recordings holds Common Blackbird with more than 27 hours of recordings.
The longest single recording (1 hour 18 minutes) is Marsh Warbler recorded by Volker Arnold in Heide, Germany.
The shortest single recording is of 0.764 seconds (Red Crossbill).
The average duration of the recording is 1 min 35 seconds.
95% of the files are shorter then 4:10.
And 1% of the files are longer then 9:45.
Some technical details
All of the files are mp3.
They vary in sampling rate. The most abundant is 44.1 kHz.
Most of the files are stereo while some are not.
The findings
The API does not provide an ability to extract other species that may present within each recording, however site itself give that ability. It’s a pity because it could bring serious impediments to the machine learning process. Hope, I’ll find the way to cope with it.