Vision Kinect

AIM:

We aim to transform Tetris gameplay by allowing players to control the game using hand gestures detected through advanced computer vision technology, integrated with a web application developed using React.js.

Technologies used:

React.js
Object detection algorithms (YOLO)
OpenCV
Python3

Literature survey:

Object detection:

Object detection in deep learning refers to the task of identifying and localizing objects within an image or video frame. It involves using neural networks to analyze the visual content and outputting bounding boxes around objects along with their corresponding class labels. The process begins with an input image or video frame, which is then passed through a convolutional neural network (CNN) for feature extraction.

The CNN extracts features from the input image at different levels of abstraction, capturing essential information such as edges, textures, and shapes. These features are then used to predict the presence of objects within the image. Object detection involves not only identifying objects but also precisely localizing them by predicting bounding boxes. These bounding boxes represent the predicted locations and sizes of the detected objects. Along with localization, the neural network also performs object classification. It assigns a class label to each detected object, indicating the type of object it belongs to (e.g., car, person, cat). This is typically achieved by predicting the probability distribution over a predefined set of classes.

Since multiple bounding boxes may overlap or cover the same object, a technique called non-maximum suppression is applied to eliminate redundant detections. This process ensures that only the most confident bounding box for each object is retained while discarding the rest. The final output of the object detection process consists of a set of bounding boxes, each associated with a class label and a confidence score. Object detection in deep learning has seen significant advancements thanks to algorithms and architectures such as YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), and Faster R-CNN (Region-based Convolutional Neural Network). These advancements have made object detection more accurate and efficient, enabling a wide range of real-world applications across various industries.

For our project, we will be using YOLO algorithm for hand gesture recognition to play Tetris hands-free.

You Only Look Once (YOLO) Algorithm:

You only look once(YOLO) is a state-of-the-art, real-time object detection system. It is so fast, that it has become the standard way of detecting objects in the field of computer vision. The algorithm was introduced in 2015 by Joseph Redmon. Ever since it came out it has surpassed other algorithms such as sliding window object detection, R CNN, Fast R CNN, Faster R CNN, etc. Prior detection systems use classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High-scoring regions of the image are considered detections. We use a different approach. We apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

The algorithm works based on the following four approaches:

Residual blocks
Bounding box regression
Intersection Over Unions or IOU for short
Non-Maximum Suppression.

Residual Blocks:

This first step starts by dividing the original image (A) into NxN grid cells of equal shape. Each cell in the grid is responsible for localizing and predicting the class of the object that it covers, along with the probability/confidence value.

Bounding box regression

Bounding box regression is a technique used in object detection tasks to predict the coordinates of a bounding box that tightly encloses an object of interest within an image. YOLO determines the attributes of these bounding boxes using a single regression module in the following format, where Y is the final vector representation for each bounding box.

Y = [pc, bx, by, bh, bw, c1, c2]

This is especially important during the training phase of the model.

pc corresponds to the probability score of the grid containing an object.
bx, by are the x and y coordinates of the center of the bounding box to the enveloping grid cell.
bh, bw correspond to the height and the width of the bounding box for the enveloping grid cell.
c1 and c2 correspond to the two classes Player and Ball. We can have as many classes as your use case requires.

Intersection Over Union (IOU):

Intersection over Union, commonly referred to as IOU is a metric used to evaluate the overlap between two bounding boxes or regions of interest. It quantifies the similarity or agreement between the predicted bounding box and the ground truth bounding box. IOU is calculated as the ratio of the intersection area to the union area of the two bounding boxes. It is often used as a criterion for evaluating the performance of object detection algorithms, where a higher IOU indicates better detection accuracy.

Non-Max Suppression or NMS

Setting a threshold for the IOU is not always enough because an object can have multiple boxes with IOU beyond the threshold, and leaving all those boxes might include noise. Here is where we can use NMS to keep only the boxes with the highest probability score of detection.

Architecture:

This architecture takes an image as input and resizes it to 448x448 by keeping the aspect ratio same and performing padding. This image is then passed in the CNN network. This model has 24 convolution layers, 4 max-pooling layers followed by 2 fully connected layers. For the reduction of the number of layers (Channels), we use 1x1 convolution that is followed by 3x3 convolution. Notice that the last layer of YOLOv1 predicts a cuboidal output. This is done by generating (1, 1470) from final fully connected layer and reshaping it to size (7, 7, 30). This architecture uses Leaky ReLU as its activation function in whole architecture except the last layer where it uses linear activation function. Batch normalization also helps to regularize the model. Dropout technique is also used to prevent overfitting.

Hagrid Dataset:

HaGRID - HAnd Gesture Recognition Image Dataset is a large image for hand gesture recognition (HGR) systems. You can use it for image classification or image detection tasks. Proposed dataset allows to build HGR systems, which can be used in video conferencing services (Zoom, Skype, Discord, Jazz etc.), home automation systems, the automotive sector, etc. The dataset contains 37,583 unique persons and at least this number of unique scenes. The subjects are people from 18 to 65 years old. The dataset was collected mainly indoors with considerable variation in lighting, including artificial and natural light. Besides, the dataset includes images taken in extreme conditions such as facing and backing to a window. Also, the subjects had to show gestures at a distance of 0.5 to 4 meters from the camera.

Implementation:

We trained on different YOLO versions with various parameter sizes and we trained a YOLOv5 model using ultralytics using the HAGRID dataset which gave us 98.7% precision,97.2% recall, and mAP-50(mean average precision, which is a standard metric used for object detection algorithms) being 98.6. We resized the images in the HAGRID Dataset to 416 pixels where each class has around 1700 images. With a batch size of 16, we trained a small (7.2M params) model for 25 epochs and a medium-sized model(21.2M params) for 50 epochs. the models were pre-trained for around 300 epochs on the COCO val2017 set.

We deployed a React app using React.js using which we made the Tetris game including the Tetris board, a webcam to capture the gestures, Tetris board, and tetromino using hooks, states, and react functional components.

These are the Game KeyBinds:

Enter: Start game (only on the home screen)

ArrowUp: Rotate tetromino clockwise

ArrowDown: Slow-drop tetromino by 1

ArrowLeft: Move tetromino horizontally left by 1

ArrowRight: Move tetromino horizontally right by 1

KeyQ: Quit game

KeyP: Pause game

Space: Slow-drop tetromino to 'ghost' position

We integrated the trained YOLOv5 model and the react game where the game keys are replaced by hand gestures to make the game hands-free. After the integration with the YOLOv5 model, we decided to change the game binds to the following:

one: slow-drop tetromino
dislike: fast-drop tetromino
fist: rotate the tetromino
like: start game
peace: move right
peace_inverted: move left
stop: quit the game
stop_inverted: pause game

These are some images of our work:

Results

Through a series of experiments, YOLOv5 worked very well in real-time capturing of hand gestures. The integration of React.js with YOLOv5 in a Tetris game marks a significant advancement in interactive gaming technology. This innovative approach enables players to manipulate Tetris blocks using hand gestures, fostering a more immersive and intuitive gameplay experience. Through seamless gesture detection and real-time responsiveness, the game achieves a harmonious blend of classic gameplay mechanics and cutting-edge computer vision technology.

Meeting Link:

Google Meet:

To join the video meeting, click this link:https://meet.google.com/wwp-huvw-dbb

Mentors:

Vishal Kamath
Hayden Soares
Naga Mukesh

Mentees:

Pranav Vinodh
Rahul Saravanan
Varshini Adurti
Atharva Rege
Ayush
Chaitanya
Paluvadi Dinesh Manideep

GitHub Repository

All the files and source codes can be found here.

Acknowledgment

As executive members of IEEE NITK, we are incredibly grateful for the opportunity to learn and work on this project under the prestigious name of the IEEE NITK Student Chapter. We want to extend our heartfelt thanks to IEEE for providing us with the support and guidance we needed to complete this project successfully.

Virtual Expo 2024

Abstract