Quantization in DSP - Truncation and Rounding

The designing of a digital filter means figuring out its coefficients. A fact we first saw in the designing of IIR filters. We store the values of these coefficients in binary registers. These registers are just digital memories in the DSP system.

Generally, we use infinite precision arithmetic for describing filter coefficients in the interest of accuracy. But practically, it is not possible to store a large chain of bits in a register. Thus we need to find a way to pack these filter coefficients into a fixed word size register.
So we ditch infinite precision arithmetic and go with the fixed-point representation of binary numbers. In the fixed-point representation, the number of digits before and after the decimal point is fixed. In this representation, the MSB is said to represent the sign of the number.

Within fixed-point representation, we have three different ways of representing numbers. This is just for your information. The three different formats are:
- Sign magnitude: The leading binary digit represents the sign. (If MSB = 1, number negative. If MSB = 0, number positive).
- 1s complement: All the bits are complemented.
- 2s complement: ‘1’ is added to 1s complement.
Now that we have a fixed number of bits, we need to make sure that these bits match the word-size of the register memory we use to store the coefficient values. If the number of bits is more, we quantize them.
Thus, quantization is the process of reducing the number of bits to ensure the storage of the filter coefficients in the Digital Signal Processing system’s register.

In this post, we will study two types of Quantization methods:
- Truncation
- Rounding

What is Truncation?

Truncation is a type of quantization where extra bits get ‘truncated.’
Basically, in the truncation process, all bits less significant than the desired LSB (Least Significant Bit) are discarded.
For example, suppose we wish to truncate the following 8-bit number to 4-bits.
- X = 0.01101011 truncates to X = 0.0110
- Converting the above to decimal we can see that there is a large change in value. (0.01101011 equals 0.418 and 0.0110 equals 0.375).
- Thus, truncation is an inferior method of quantization since it has a high margin for error.

The error from quantization using truncation is given by the formula:
- ${ -2 }^{ -b }\quad \le \quad e\quad \le \quad 0$ (For a positive number/2s complement)
- $0\quad \le \quad e\quad \le \quad { 2 }^{ -b }$ (For a negative number/1s complement)

Rounding is a quantization method where we ’round-up’ a particular number to the desired number of bits.
Basically, rounding is the process of reducing the size of a binary number to some desirable finite size. This is done in such a way that the rounded off number is as close to the original unquantized number as possible.
Interestingly, the rounding process is a combination of truncation and addition.

In rounding a number to say b-bits, first, the number is truncated to the desired number of bits. Then depending on the number that existed next to the LSB before truncation, an addition to the LSB is performed.
If that particular number (previously next to the LSB) was 0, then 0 is added to the LSB. If that number was 1, then a 1 is added to the LSB.
Consider the same example as above, suppose we wish to truncate the following 8-bit number to 4-bits.
- X = 0.01101011 truncates to X = 0.0110
- Since the number next to the current LSB was 1, we add 1 to the current LSB.
- Thus X is now 0.0111
- Converting both the unquantized and rounded off numbers to decimal, we notice that the magnitude of error is less relative to truncation. (0.01101011 equals 0.418 and 0.0111 equals 0.438).
Thus rounding is preferable than truncation.
The magnitude of error in rounding is given by the formula:
- $\frac { { -2 }^{ -b } }{ 2 } \quad \le \quad e\quad \le \quad \frac { { 2 }^{ -b } }{ 2 }$