Start the Lab | Math Alive | Welcome Page |
In this problem set we will illustrate two compression schemes. The first one is (a very simple version of) the Lempel-Ziv algorithm. This is a scheme that can be used to compress any type of file (text or other), and it can be implemented in many ways, and using many different alphabets. We shall work with normal text and learn about a naive LZ algorithm. We start with some exercises to understand how it works.
Problem 1. Lempel-Ziv Parsing.
The first step in the algorithm is the parsing process.
The parsing is explained on the Lempel-Ziv
parsing webpage, with a worked example, followed
by an example for you to work through where you
get an explanation if you have any problems.
Can you parse the following sentence?
can_you_can_a_can_as_a_canner_can_can_a_can?
We have replaced the spaces in the sentence with the "_" character to make it easier to keep track of the spaces and we copied the sentence below, in the Answer so that all you have to do is add the parsing commas.
Answer:
can_you_can_a_can_as_a_canner_can_can_a_can?
Problem 2. Lempel-Ziv Compression.
The next step in the Lempel-Ziv algorithm is the coding of individual strings which is explained on the Lempel-Ziv coding webpage.
The example for which parsing is worked out is:
If_one_doctor_doctors_another_doctor, _does_the_doctor_who_doctors_the_doctor_ doctor_the_doctor_the_way_the_doctor_he_is_doctoring_doctors? _Or_does_he_doctor_the_doctor_the_way_the_doctor_who_doctors_doctors?which parses as follows:
,I,f,_,o,n,e,_d,oc,t,or,_do,c,to,r,s,_a,no,th,er,_doc,tor,,, _doe,s_,the,_doct,or_,w,h,o_, d,oct,ors,_t,he,_docto,r_, do,ct,or_t,he_,doc,tor_,the_,wa,y,_th,e_,doct,or_h,e_i,s_d, octo,ri,ng,_doctor,s?, _O,r_d,oe,s_h,e_d,octor,_the,_doctor_,the_w,a,y_, the_d,octor_, wh,o_d,octors,_doctors,?
To LZ compress this sentence, we number these strings. The first string, before the first comma, is the empty string (we always start with this), and we give it label 0. We now number all the following strings consecutively. The following table shows the compression process for the first few strings. This last column is the one that is important for us as it contains the LZ compressed sentence. On the webpage, you can practice completing the compression process for this example sentence.
String | Position Number of this string | Prefix | Position Number of Prefix | Coded String |
---|---|---|---|---|
I | 1 | empty | 0 | 0I |
f | 2 | empty | 0 | 0f |
_ | 3 | empty | 0 | 0_ |
o | 4 | empty | 0 | 0o |
n | 5 | empty | 0 | 0n |
e | 6 | empty | 0 | 0e |
_d | 7 | _ | 3 | 3d |
oc | 8 | o | 4 | 4c |
Complete now the following compression table for the sentence you parsed in Problem 1 (we have filled in the first two rows to get you started).
Answer:
String | Position Number of this string | Prefix | Position Number of Prefix | Coded String |
---|---|---|---|---|
c | 1 | empty | 0 | 0c |
a | 2 | empty | 0 | 0a |
  | 3 |   |   |   |
  | 4 |   |   |   |
  | 5 |   |   |   |
  | 6 |   |   |   |
  | 7 |   |   |   |
  | 8 |   |   |   |
  | 9 |   |   |   |
  | 10 |   |   |   |
  | 11 |   |   |   |
  | 12 |   |   |   |
  | 13 |   |   |   |
  | 14 |   |   |   |
  | 15 |   |   |   |
  | 16 |   |   |   |
  | 17 |   |   |   |
  | 18 |   |   |   |
  | 19 |   |   |   |
  | 20 |   |   |   |
  | 21 |   |   |   |
  | 22 |   |   |   |
  | 23 |   |   |   |
Write here the entire compressed sequence.
Answer:
Problem 3. Huffman Coding.
At Generique U. students get the following grades with the following frequencies:
Grade Frequency A 15% of the time B 30% Of the time C 15% of the time D 5% of the time P(pass) 20% of the time F(fail) 5% of the time I(incomplete) 5% of the time NA(not assigned) 5% of the time
Historically, the grades are stored into the system using the following code:
Grade Codeword A 000 B 001 C 010 D 100 P(pass) 011 F(fail) 101 I(incomplete) 110 NA(not assigned) 111
(a) How many bits per grade do you expect the historical code to use (on average)?
(b) Now you decide to use your knowledge of grade frequency for data compression. Design a Huffman code for { A, B, C, D, P, F, I, NA } using the given frequencies. You may draw a labeled tree or write out the code.
(c) How many bits per grade do you expect your new code to use (on average)? What was the savings?