Bird Call Classification

This project aimed to develop a deep learning model for classifying different species of birds based on audio recordings of their vocalizations. The dataset was obtained from Kaggle and consists of 5 different species of audio samples each representing a single class. Python’s librosa library was used to process the audio samples and extract log mel-spectrogram image representations from the audio files. These 2D spectrograms, which encode the time-frequency patterns of the bird vocalizations, were then normalized. The normalized spectrogram images served as input to a convolutional neural network (CNN) model built using the TensorFlow framework. After training for 15 epochs, the test accuracy was about 83.33%; the trained CNN demonstrates the feasibility of using deep learning on audio spectrograms for acoustic bird species classification. Potential improvements could involve data augmentation, regularization, and ensemble methods to better generalize the model's performance across diverse recording conditions.

Techniques used

Mel-frequency spectrogram

The mel-fequency spectrogram is a representation of the spectrum of a signal as it varies over time. It is derived from the traditional spectrogram, which displays the frequency content of a signal over time. However, instead of linearly spaced frequency bins, the mel spectrogram uses frequency bins that are spaced according to the mel scale, which is a perceptual scale of pitches based on human hearing. This scaling is designed to better represent how humans perceive differences in pitch.

Steps to get the Mel spectrogram:

The Short Time Fourier Transform is calculated, amplitude is converted to decibels.

2. Convert frequencies to Mel scale; Choose the number of mel bands and construct mel filter banks, which is now applied to the spectrogram

Convolutional Neural Network

CNNs, or Convolutional Neural Networks, are deep learning architectures particularly effective for image processing tasks. They consist of layers that apply convolution operations to capture features like edges and textures, pooling layers to reduce spatial dimensions, activation functions for non-linearity, and fully connected layers for classification or regression. CNNs excel at automatically learning hierarchical representations from raw data, making them invaluable for tasks such as image classification, object detection, and segmentation, where they have achieved state-of-the-art performance.

Dataset

The dataset we are working with consists of Indian bird calls. The dataset used is taken from a larger dataset consisting of a large number of classes. Each class is a separate genus, so the bird calls are more differentiable. There are 5 different classes we are working with(Liocichla phoenicea, Dicrurus andamanensis, Cyornis poliogenys, Arborophila torqueola, Alcippe cinerea) each having between 20 to 30 audio .wav files. Overall, the dataset is balanced and evenly distributed, but there were some problems with the data, including noise and the uneven length of samples making it difficult to extract mel spectrograms of the same shape. Each sample was truncated to 15 seconds to solve the issue of uneven length, but the noise was not addressed.

Data preprocessing

Audio files are loaded using a specified sampling rate and a duration of 15 seconds. The loaded audio is then split into smaller chunks of 5 seconds each ensuring a consistent signal length for further processing. Next, each 5 second section is converted into a mel spectrogram, which can be considered a visual representation of the audio's frequency spectrum. Moreover, we also convert them to decibel scale and normalize them.

Architecture

The noisy nature of the data in audio classification tasks are often a cause for overfitting. As a result, a simple but efficient feature extractor, in this case a CNN is employed. It's structure consits of 3 convolutional layers each followed by a maxpooling layer. Finally a dense layer is used with a 50% dropout to help deal with overfitting followed by a softmax output layer.

Results

On the left is the detailed classification report for the task with random initialization. We get a test accuracy of 83.33% and a test loss of 0.4696. The graphs show the accuracy and loss through epochs. With a training accuracy of 80.87% and a test accuracy of 83.33% we can rule out overfitting, which is, as previously mentioned, prevalent in audio classification.

Conclusion

The deep learning model developed in this project successfully classified different species of birds based on their vocalizations. Using a dataset from Kaggle containing audio recordings of five bird species, we processed the audio data with the Python library librosa, converting the recordings into log mel-spectrogram images to capture the time-frequency characteristics of the bird calls. Finally, we achieve a test accuracy of approximately 83.33%, indicating the effectiveness of using deep learning with audio spectrograms for bird species classification.

Mentors:

Vaibhav Santhosh

Aryan Herur

Mentees:

Rudra Gandhi

Guhan Balaji

Yash Kedia

Prakhyath Sai V

Meet link:https://meet.google.com/otz-monc-prq

Virtual Expo 2024

Abstract

Abstract

Report Information

Team Members

Team Members

Report Details

Report Details

Explore More Projects