Welcome to week 7! This week we will be implementing a reconfigurable compression scheme that adapts to characteristics present in information. This project builds upon the hard work you did last week, constructing and operating on binary search trees. For this week, we will be implementing a compression scheme that assigns unique codes to words present in a text file, according to the frequency with which those words show up. For more frequent words, the assigned code will be small, and for less frequent words the code size gets larger. The purpose behind a compression scheme like this is simultaneously to create an encoding scheme that allows for unique decoding of encoded files, and to minimize the total size of the compressed file altogether.
The homework will span the next two weeks, and will consist of two main parts. The first requires processing the text files into a binary search tree sorted by words (the focus of last week) and converting that tree into another tree sorted by frequency. Next, we will implement an encoding scheme that reflects our goals above, and use this scheme to encode text files. Additionally, we will build a decoder that allows for encoded text files to be decoded accurately. The process is described at a high level below.
In order to create a configurable encoding scheme, we need some information about the frequency with which words show up in our text files. Luckily, we built a mechanism last week that enables us to do just this! The first thing we will do is, relying upon the binary search tree that we built last time, build an iterator that can pull items out of the tree. Once this is done, the items we pull out of the tree will be inserted into a new binary search tree, and sorted by frequency (count). After this is done, we will use an iterator to print out all of the words and associated frequencies from our text file, sorted in non-ascending order. We now have a binary search tree sorted appropriately, and we can begin encoding!
The next phase is the implementation of an encoding scheme that utilizes this word frequency information. To do this, we will utilize a min-heap implemented as a priority queue. As such, the first two tasks are the creation of appropriate heap functions. Next, we will build a min heap using the frequency information, and acting appropriately on this heap we will develop unique codes for each word based on the frequency with which they occur. Once we do this, we will replace each word in our text file with the corresponding code, and create our compressed file. Building an appropriate decoder is the last step, and we have a complete, dynamic compression technique!