This project utilizes various Machine Learning (ML) models in tandem with various python libraries such as NumPy, Pandas, Matplotlib, and others, to forecast the Air Quality Index (AQI) of a selected location.
[Google Meet Link for Virtual Expo 2024:
P0- Forecasting of Air Quality using Machine Learning
Friday, May 10 · 6:00 – 8:00pm
Time zone: Asia/Kolkata
Google Meet joining info
Video call link: https://meet.google.com/bcm-zhdv-jaq]
Subhodeep Dey
G. B. Naren
Dhanush Shetty
We would like to thank the IEEE Student Branch for conducting Envision 2024.
To predict the subindices of various components that influence air quality by leveraging ML models for the same and thus accurately predict the Air Quality Index (AQI).
The data used in this project has been sourced from the Central Pollution Control Board (CPCB) website.
Selenium, a tool for web automation, was helpful in scouring the web and extracting hourly data containing concentrations of various gases, humidity levels, and other atmospheric parameters at a specific place. Collected data is then analyzed and used to train models, eventually allowing us to forecast our desired variable.
The data was extracted from the Central Pollution Control Board (CPCB) website using Selenium.
The data was loaded from a .csv file to a Pandas data frame. It was then followed by formatting the dataframe, accounting for redundant attributes and entries in the dataframe. Null values were handled by replacing them with the arithmetic means of their respective attributes.
Below we provide a brief description of every ML model we have used for forecasting.
FB Prophet, usually just called Prophet, is an open-source software tool designed for forecasting time series data. It is particularly well-suited for situations where there's a trend and seasonal patterns like weekly or yearly cycles.
This model has significaantly better than all of its other counterparts.
Prophet follows the sklearn model API. We create an instance of the Prophet class and then call its fit and predict methods.
We convert 'From Date' column’s datatype from object to datetime.
The entire dataset is split into Test and Training datasets [we have used the ratio 85%-15% (Training%-Testing%)].
We then train the model.
After this, forecasts are made.
Then the accuracy is checked with the Testing data.
Forecast of PM 2.5 is plotted using the plot function.
Plots below show the forecasts in different scales.
The same thing is repeated for other parameters to forecast their future values.
After this the subindices of each parameter is calculated using the obtained forecasts and the standard formula (with benchmark values).
Finally, out of all the subindices, the worst (highest) value gives us the Air Quality Index (AQI).
SARIMA, which stands for Seasonal AutoRegressive Integrated Moving Average, is an extension of the ARIMA model specifically designed for time series data with seasonal patterns. SARIMA explicitly considers seasonal effects in the data, allowing for more accurate forecasts when there are predictable patterns across specific periods. SARIMA provides a more comprehensive approach to time series forecasting by combining the strengths of ARIMA for general trends with the ability to capture predictable seasonal variations in the data.
Training and fitting the model with the same Training set used in FB Prophet.
Predicting and plotting the result.
Then the accuracy is checked with the Testing data.
The same thing is repeated for other constituents to evaluate other parameters.
SVR is a type of machine learning algorithm used for regression tasks. Unlike traditional regression models that aim to minimize the overall error between predicted and actual values, SVR focuses on finding a hyperplane that fits the data points within a specific margin. This hyperplane acts as the decision boundary for predicting continuous output values.
Firstly we convert the 'From Date' columns data type from object to string.
Then we split the datetime in 'From Date' into 2 columns 'Date' and 'Time' respectively
Then we break the 'Date' and 'Time' into its furthur subsequent columns i.e. 'Date' into 'Year','Month','Date' and 'Time' into 'Hour','Minute','Second' respectively
We then split the entire dataset into Training and Testing dataset
Followed by training the model and making predictions of te constituent concentration
Then the accuracy is checked with the Testing data.
The same thing is repeated for other constituents to evaluate other parameters.
LSTMs are a type of recurrent neural network (RNN) architecture specifically designed to overcome limitations in traditional RNNs. RNNs struggle with capturing long-term dependencies in sequential data due to the vanishing gradient problem. LSTMs address this by introducing a memory cell that can store information for longer periods and selectively update it over time.
Converting 'From Date' columns data type froim object to datetime object and then into ordinal representation
Defining the independent and dependent variables
Normalizing the dependent variable
Splitting the entire dataset into Training and Testing dataset
Preparing, Building,Compiling and Training the model
Evaluating the model and Plotting the result
The same thing is repeated for other constituents to evaluate other parameters.
Random Forests are a supervised learning technique that falls under the category of ensemble learning. They are widely used for both classification and regression tasks. They work by combining the predictions of multiple decision trees, resulting in a more robust and accurate overall model.
Firstly converting 'From Date' columns data type from object to datetime object
Splitting the entire dataset into Training and Testing dataset on te basis of time based evaluation
Defining the various features and the target variable
Building and Evaluating the model, Predicting the concentration of PM 2.5
Then the accuracy is checked with the Testing data
Plotting the Result
The same thing is repeated for other constituents to evaluate other parameters.
This project successfully forecasted AQI of a specific region using different ML models. Through evaluation measures, it was found out that FB Prophet provided the best results, after tuning the hyperparameters multiple times. The forecast graphs provide a visual representation of AQI variation over the prolonged period of time. The findings of this project contribute to a better understanding of pollution, the Air Quality and support informed decision-making in Pollution Control and Air Quality Management. It also enables us to study the effects of different concentrations of atmospheric substances and pollutants on buildings, materials and on the environment as a whole.
Based on the projects outcomes, the folloing recommendations are proposed:
Literature:
Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis
Air Quality Index prediction using machine learning for Ahmedabad city
Optimized machine learning model for air quality index prediction in major cities in India
Other sites:
World's Air Pollution: Real-time Air Quality Index
Information on Air Quality Index and method to calculate it
Documentation on implementation of FB Prophet model in Python
Report prepared on May 6, 2024, 10:36 p.m. by:
Report reviewed and approved by Nikesh Shetty [Piston] on May 10, 2024, 7:02 a.m..