Techniques used
Mel-frequency spectrogram
The mel-fequency spectrogram is a representation of the spectrum of a signal as it varies over time. It is derived from the traditional spectrogram, which displays the frequency content of a signal over time. However, instead of linearly spaced frequency bins, the mel spectrogram uses frequency bins that are spaced according to the mel scale, which is a perceptual scale of pitches based on human hearing. This scaling is designed to better represent how humans perceive differences in pitch.
Steps to get the Mel spectrogram:
2. Convert frequencies to Mel scale; Choose the number of mel bands and construct mel filter banks, which is now applied to the spectrogram
Convolutional Neural Network
CNNs, or Convolutional Neural Networks, are deep learning architectures particularly effective for image processing tasks. They consist of layers that apply convolution operations to capture features like edges and textures, pooling layers to reduce spatial dimensions, activation functions for non-linearity, and fully connected layers for classification or regression. CNNs excel at automatically learning hierarchical representations from raw data, making them invaluable for tasks such as image classification, object detection, and segmentation, where they have achieved state-of-the-art performance.
Dataset
The dataset we are working with consists of Indian bird calls. The dataset used is taken from a larger dataset consisting of a large number of classes. Each class is a separate genus, so the bird calls are more differentiable. There are 5 different classes we are working with(Liocichla phoenicea, Dicrurus andamanensis, Cyornis poliogenys, Arborophila torqueola, Alcippe cinerea) each having between 20 to 30 audio .wav files. Overall, the dataset is balanced and evenly distributed, but there were some problems with the data, including noise and the uneven length of samples making it difficult to extract mel spectrograms of the same shape. Each sample was truncated to 15 seconds to solve the issue of uneven length, but the noise was not addressed.
Data preprocessing
Audio files are loaded using a specified sampling rate and a duration of 15 seconds. The loaded audio is then split into smaller chunks of 5 seconds each ensuring a consistent signal length for further processing. Next, each 5 second section is converted into a mel spectrogram, which can be considered a visual representation of the audio's frequency spectrum. Moreover, we also convert them to decibel scale and normalize them.
Architecture
The noisy nature of the data in audio classification tasks are often a cause for overfitting. As a result, a simple but efficient feature extractor, in this case a CNN is employed. It's structure consits of 3 convolutional layers each followed by a maxpooling layer. Finally a dense layer is used with a 50% dropout to help deal with overfitting followed by a softmax output layer.
Results
On the left is the detailed classification report for the task with random initialization. We get a test accuracy of 83.33% and a test loss of 0.4696. The graphs show the accuracy and loss through epochs. With a training accuracy of 80.87% and a test accuracy of 83.33% we can rule out overfitting, which is, as previously mentioned, prevalent in audio classification.
Conclusion
The deep learning model developed in this project successfully classified different species of birds based on their vocalizations. Using a dataset from Kaggle containing audio recordings of five bird species, we processed the audio data with the Python library librosa, converting the recordings into log mel-spectrogram images to capture the time-frequency characteristics of the bird calls. Finally, we achieve a test accuracy of approximately 83.33%, indicating the effectiveness of using deep learning with audio spectrograms for bird species classification.
Mentors:
Vaibhav Santhosh
Aryan Herur
Mentees:
Rudra Gandhi
Guhan Balaji
Yash Kedia
Prakhyath Sai V
Meet link:https://meet.google.com/otz-monc-prq
Report prepared on May 9, 2024, 4:16 p.m. by:
Report reviewed and approved by Aditya Pandia [CompSoc] on None.