Steganography is the art of hiding messages so that no person can find out about the presence of a message. Compared to cryptography, the practice of protecting the contents of a message alone, steganography is concerned with concealing the fact that a secret message is being sent and its contents. The project aimed to make a command line steganography tool for Linux-based operating systems that is capable of encrypting and hiding messages/files within other files (images, audio files) using C++.
The main components of the project include LZW Compression and Decompression, Steganography in WAV files and PNG files, and the AES Encryption Scheme, which have been discussed in this section.
LZW compression is widely used in formats like GIF, PDF, and TIFF for its ability to save space by identifying and encoding repeated patterns. It works by grouping symbols into strings and converting them into codes, which take up less space than the original strings. A code table is used, typically with 4096 entries. Codes 0-255 represent single bytes, while codes 256-4095 represent sequences of bytes. As encoding progresses, LZW adds repeated sequences to the code table. Decoding involves translating each code to find the corresponding character or characters. This method extends the library beyond the typical 8-bit ASCII characters, allowing for up to 12 bits per character. While it may not compress short, diverse strings efficiently, it excels at compressing redundant data. Additionally, LZW can compress and decompress data without saving the dictionary separately.
During LZW decompression, the decompressor reconstructs the same string table that was used during compression. It initialises the first 256 table entries to single characters. As it processes each character in the input stream, except the first one, it updates the string table accordingly. Decoding occurs by reading codes from the compressed data and translating them through the evolving code table. This process ensures that the original data is faithfully reconstructed during decompression, allowing for lossless recovery of the compressed information.
AES-128, also known as the Rijndael Cipher, was introduced as a replacement for the earlier-used Data Encryption Standard (DES), which became vulnerable to brute force attacks due to advancements in computational power. Recognising that 56-bit keys were increasingly insecure, NIST organised a competition in which the Rijndael cipher was selected as the winner. To date, AES remains one of the most secure encryption algorithms available, even in the face of potential quantum computer threats. This is due to its symmetric architecture. According to NIST, given certain mitigating factors, it is highly probable that Grover's algorithm will offer minimal or no benefit in compromising AES. Consequently, AES-128 is expected to maintain its security for many years ahead.
WAV files consist of multiple segments of data, each with a specific function. The primary segment for steganography within WAV files is the audio data segment, which contains the actual sound waveform. Modifying this segment enables the concealment of information within the audio data without perceptibly altering the audible characteristics of the sound.
Hiding information in WAV files involves altering the least significant bits of the audio samples to encode the hidden message.
PNG files comprise multiple chunks of data, each serving a specific purpose. The critical chunk for steganography within PNG files is the IDAT (Image Data) chunk, which contains the actual pixel information that forms the image. Manipulating the IDAT chunk allows for the concealment of information within the image data without noticeably altering the visual appearance of the image.
The tool was implemented as an integration of 8 main subsections, i.e., Compression and Decompression of the message, Encryption and Decryption of the messages, Hiding and Extraction of the message from both wav files and png files, each of which are explained below.
The data from messageBuffer is obtained by reading fixed 12 bits. The idea of the compression algorithm is as follows: as the input data is being processed, a dictionary keeps a correspondence between the longest words encountered and a list of code values. The words are replaced by their corresponding codes and so the input file is compressed. Therefore, the algorithm's efficiency increases as the number of long, repetitive words in the input data increases.
The code takes encoded data from messageBuffer and turns it back into the original message using a dictionary named dictionary. It starts by reading numbers that represent parts of the message and storing them in a vector named codes_of_dictionary. These codes are read from messageBuffer by reading 12 fixed bits. These numbers are like codes that stand for different parts of the message. Then, it looks up each code in the dictionary to find the corresponding part of the message. As it goes through the data, it keeps adding new parts to the dictionary to understand the message better. After decoding, it puts the original message back into messageBuffer, giving us the uncompressed message.
The AES encryption segment was implemented by writing functions for each step involved, which were essentially helper functions to the main AES_encrypt function. Within the main AES_encrypt function, each of these functions was called sequentially as per the algorithm.
The plaintext was first obtained from the message buffer and then padded using the PKCS#7 padding scheme. The resulting padded plaintext was broken into blocks of 16 bytes each and stored in a 2D array. Each member of this array is a message block of 16 bytes, and each of these blocks is made to undergo encryption.
The key expansion step is also an elaborate process. First, a 44 * 4 array was created, and the key given as input was filled into this array, i.e. the first four rows were filled with the given key.
This key is made to undergo operations that create the remaining 10 keys. Finally, once all 44 rows are filled, the array is flattened to a 1D array of 176 elements and an 11 * 16 array is created and filled with the elements from the 1D array, thereby giving an array of 11 keys.
After this step, the remaining operations are carried out as per the encryption algorithm, and the resulting ciphertext is pushed back into the message buffer.
The AES-128 Decryption algorithm has been implemented with a Decryptor class containing all the functions and instance variables meant for storing data. Each instance of this Decryptor encapsulates all the information and functions required for decryption. Every function implemented is a utility function called in the specific order dictated by the AES-128 Decryption algorithm.
Initially, the message is read through the message_buffer and divided into blocks of 16 bytes. These blocks are then decrypted individually using the Cipher Block Chaining method. The final block is depadded of the PKCS#7 padding applied during encryption. Subsequently, the decrypted text is fed back into the message_buffer for further use by other programs.
Many steps of the decryption process closely resemble those of encryption. The Key expansion step remains identical to the encryption process. The remaining functions effectively achieve the exact inverse of their encryption counterparts. During key expansion, an additional 10 keys (each of 4 words where 1 word is equivalent to 4 bytes) are generated. These words are stored in a 2D array and recalled using a utility function. These keys are crucial during decryption and are fed into the algorithm in reverse order (Key 10 to Key 0).
The remaining functions are called sequentially according to the decryption algorithm, with each decrypted block added to the message buffer as well.
The algorithm for reading a message from a WAV file was implemented by iterating through the input buffer until the data subchunk was reached. Initially, the 12 bytes representing the WAV header were skipped. Then, 8 bytes were read to identify the type and size of each subchunk. If the type was not "data," the algorithm skipped ahead by the size of the subchunk. Upon reaching the data subchunk, the first data byte was read, and encryption and compression statuses were determined based on the least significant bits. The size of the message was then extracted from the subsequent 16 bytes. For each set of 4 input bytes, the algorithm isolated the last two bits from each byte and constructed the message byte accordingly, shifting it left by 2 bits after each extraction. This process was repeated sz times to obtain the complete message. Finally, each message byte was stored in the message buffer.
The code for hiding a message in WAV files is implemented by initializing a pointer variable to keep track of the current position within the input data. It appends the first 12 characters of the input buffer to the output buffer. It then enters a while loop that continues until the pointer reaches the end of the input buffer. Within each iteration, it reads the subchunk_id and subchunk_size from the input buffer. If the subchunk_id is not "data", it copies it along with its size and content directly to the output buffer. If the subchunk_id is "data" (indicating the data subchunk), it proceeds with embedding the message into this subchunk. It first writes "data" to the output buffer followed by the modified subchunk_size. It then sets the first_byte based on encryption and compression flags. Next, it calculates the size of the message to be embedded and writes it to the output buffer. Then it iterates over each byte of the message, extracting two bits at a time and embedding them into the least significant bits of each byte of the audio file. After processing each subchunk, the pointer is updated to move to the next subchunk.
A separate section of the code is designed to extract a hidden message from a PNG file. The first encryption bit and compression bit are initialised to 0 to store information about the encryption and compression of the extracted message. Once that is done, the function iterates through the PNG file, which is represented as a string (input_str) starting from index 8, which typically indicates the beginning of the PNG data chunks. Then, each chunk's size is extracted from the PNG file. If the chunk is of type "IDAT" (image data), the function proceeds to extract the hidden message. Once that is done, the function extracts the size of the hidden message encoded within the PNG file. This size is stored in messageSize. Then, using nested loops, the function decodes each byte of the hidden message from the image data. Within the inner loop, it extracts 2 bits from each byte of the image data and assembles them into a message byte. Each decoded message byte is stored in a buffer (message_buffer) for later retrieval. Once the entire hidden message is extracted, the loop breaks and the function exits.
The code for hiding a message in a given PNG file is implemented by maintaining a pointer in the input buffer to keep track of the data read till that point. The first 8 bytes (PNG header) are written directly to the output buffer. The following 8 bytes, which signify the size and type of the chunk, are read. If the type is not of the "IDAT" type, the entire chunk is written to the output buffer without any modifications. If the chunk is of type "IDAT", then the data is hidden in the data portion of the chunk. The 2 least significant bits of the first byte are modified to indicate whether the message has been encrypted or compressed respectively. The following 16 bytes of the data section are hidden with the size of the message which is being hidden. Then, using a loop to iterate over the entire message or the data size, 1 byte of the message buffer is hidden in 4 input bytes. Once this is done, the CRC of the file is updated to match the new contents of the file, and the function exits once this is completed.
The source codes for the Command-Line Interface tool can be found here.
This project provided a great opportunity for the mentees to learn about steganography, AES (a popular encryption scheme), compression and decompression of files and hiding and extracting contents from files. The CLI tool made during the project encrypts and hides the user-given message in the audio/image file provided by the user. The tool can also extract and decrypt the message hidden in any given audio/image file.
Possible future work in this area can be applying the same techniques to file types other than .wav and .png files, and trying different encryption schemes for protecting the data.
Report prepared on May 9, 2024, 11:24 p.m. by:
Report reviewed and approved by Aditya Pandia [CompSoc] on None.