Preface

This project utilizes various Machine Learning (ML) models in tandem with various python libraries such as NumPy, Pandas, Matplotlib, and others, to forecast the Air Quality Index (AQI) of a selected location.

[Google Meet Link for Virtual Expo 2024:

P0- Forecasting of Air Quality using Machine Learning
Friday, May 10 · 6:00 – 8:00pm
Time zone: Asia/Kolkata
Google Meet joining info
Video call link: https://meet.google.com/bcm-zhdv-jaq]

Contributors

Subhodeep Dey
G. B. Naren
Dhanush Shetty

Acknowledgements

We would like to thank the IEEE Student Branch for conducting Envision 2024.

Aim

To predict the subindices of various components that influence air quality by leveraging ML models for the same and thus accurately predict the Air Quality Index (AQI).

Introduction

The data used in this project has been sourced from the Central Pollution Control Board (CPCB) website.

Selenium, a tool for web automation, was helpful in scouring the web and extracting hourly data containing concentrations of various gases, humidity levels, and other atmospheric parameters at a specific place. Collected data is then analyzed and used to train models, eventually allowing us to forecast our desired variable.

Approach

Data Acquisition

The data was extracted from the Central Pollution Control Board (CPCB) website using Selenium.

Data Preprocessing

The data was loaded from a .csv file to a Pandas data frame. It was then followed by formatting the dataframe, accounting for redundant attributes and entries in the dataframe. Null values were handled by replacing them with the arithmetic means of their respective attributes.

Model Building and Forecasting

Below we provide a brief description of every ML model we have used for forecasting.

FB Prophet Model

FB Prophet, usually just called Prophet, is an open-source software tool designed for forecasting time series data. It is particularly well-suited for situations where there's a trend and seasonal patterns like weekly or yearly cycles.

This model has significaantly better than all of its other counterparts.

Prophet follows the sklearn model API. We create an instance of the Prophet class and then call its fit and predict methods.

We convert 'From Date' column’s datatype from object to datetime.

The entire dataset is split into Test and Training datasets [we have used the ratio 85%-15% (Training%-Testing%)].

We then train the model.

After this, forecasts are made.

Then the accuracy is checked with the Testing data.

Plotting

Forecast of PM 2.5 is plotted using the plot function.

Plots below show the forecasts in different scales.

The same thing is repeated for other parameters to forecast their future values.

After this the subindices of each parameter is calculated using the obtained forecasts and the standard formula (with benchmark values).

Finally, out of all the subindices, the worst (highest) value gives us the Air Quality Index (AQI).

SARIMA Model

SARIMA, which stands for Seasonal AutoRegressive Integrated Moving Average, is an extension of the ARIMA model specifically designed for time series data with seasonal patterns. SARIMA explicitly considers seasonal effects in the data, allowing for more accurate forecasts when there are predictable patterns across specific periods. SARIMA provides a more comprehensive approach to time series forecasting by combining the strengths of ARIMA for general trends with the ability to capture predictable seasonal variations in the data.

Training and fitting the model with the same Training set used in FB Prophet.

Predicting and plotting the result.

Then the accuracy is checked with the Testing data.

The same thing is repeated for other constituents to evaluate other parameters.

Support Vector Regression (SVR) Model

SVR is a type of machine learning algorithm used for regression tasks. Unlike traditional regression models that aim to minimize the overall error between predicted and actual values, SVR focuses on finding a hyperplane that fits the data points within a specific margin. This hyperplane acts as the decision boundary for predicting continuous output values.

Firstly we convert the 'From Date' columns data type from object to string.

Then we split the datetime in 'From Date' into 2 columns 'Date' and 'Time' respectively

Then we break the 'Date' and 'Time' into its furthur subsequent columns i.e. 'Date' into 'Year','Month','Date' and 'Time' into 'Hour','Minute','Second' respectively

We then split the entire dataset into Training and Testing dataset

Followed by training the model and making predictions of te constituent concentration

Then the accuracy is checked with the Testing data.

The same thing is repeated for other constituents to evaluate other parameters.

Long Short-Term Memory (LSTM) Model

LSTMs are a type of recurrent neural network (RNN) architecture specifically designed to overcome limitations in traditional RNNs. RNNs struggle with capturing long-term dependencies in sequential data due to the vanishing gradient problem. LSTMs address this by introducing a memory cell that can store information for longer periods and selectively update it over time.

Converting 'From Date' columns data type froim object to datetime object and then into ordinal representation

Defining the independent and dependent variables

Normalizing the dependent variable

Splitting the entire dataset into Training and Testing dataset

Preparing, Building,Compiling and Training the model

Evaluating the model and Plotting the result

The same thing is repeated for other constituents to evaluate other parameters.

Random Forest Model

Random Forests are a supervised learning technique that falls under the category of ensemble learning. They are widely used for both classification and regression tasks. They work by combining the predictions of multiple decision trees, resulting in a more robust and accurate overall model.

Firstly converting 'From Date' columns data type from object to datetime object

Splitting the entire dataset into Training and Testing dataset on te basis of time based evaluation

Defining the various features and the target variable

Building and Evaluating the model, Predicting the concentration of PM 2.5

Then the accuracy is checked with the Testing data

Plotting the Result

The same thing is repeated for other constituents to evaluate other parameters.

Conclusion

This project successfully forecasted AQI of a specific region using different ML models. Through evaluation measures, it was found out that FB Prophet provided the best results, after tuning the hyperparameters multiple times. The forecast graphs provide a visual representation of AQI variation over the prolonged period of time. The findings of this project contribute to a better understanding of pollution, the Air Quality and support informed decision-making in Pollution Control and Air Quality Management. It also enables us to study the effects of different concentrations of atmospheric substances and pollutants on buildings, materials and on the environment as a whole.

Recommendations

Based on the projects outcomes, the folloing recommendations are proposed:

Instead of just predicting AQI values, predicting the AQI categories (Good, Moderate, Unhealthy) would be more effecient as categories gives a better understanding of the pollutant issue.
Developing a real time foreasting system which in tandem with sensors can give us the real time alert of when the AQI was at its worst.