View Course Path

Quantization in DSP – Truncation and Rounding

  • The designing of a digital filter means figuring out its coefficients. A fact we first saw in the designing of IIR filters. We store the values of these coefficients in binary registers. These registers are just digital memories in the DSP system.
  • Generally, we use infinite precision arithmetic for describing filter coefficients in the interest of accuracy. But practically, it is not possible to store a large chain of bits in a register. Thus we need to find a way to pack these filter coefficients into a fixed word size register.
  • So we ditch infinite precision arithmetic and go with the fixed-point representation of binary numbers. In the fixed-point representation, the number of digits before and after the decimal point is fixed. In this representation, the MSB is said to represent the sign of the number.
  • Within fixed-point representation, we have three different ways of representing numbers. This is just for your information. The three different formats are:
    • Sign magnitude: The leading binary digit represents the sign. (If MSB = 1, number negative. If MSB = 0, number positive).
    • 1s complement: All the bits are complemented.
    • 2s complement: ‘1’ is added to 1s complement.
  • Now that we have a fixed number of bits, we need to make sure that these bits match the word-size of the register memory we use to store the coefficient values. If the number of bits is more, we quantize them.
  • Thus, quantization is the process of reducing the number of bits to ensure the storage of the filter coefficients in the Digital Signal Processing system’s register.
  • In this post, we will study two types of Quantization methods:
    • Truncation
    • Rounding

What is Truncation?

  • Truncation is a type of quantization where extra bits get ‘truncated.’
  • Basically, in the truncation process, all bits less significant than the desired LSB (Least Significant Bit) are discarded.
  • For example, suppose we wish to truncate the following 8-bit number to 4-bits.
    • X = 0.01101011 truncates to X = 0.0110
    • Converting the above to decimal we can see that there is a large change in value. (0.01101011 equals 0.418 and 0.0110 equals 0.375).
    • Thus, truncation is an inferior method of quantization since it has a high margin for error.
  • The error from quantization using truncation is given by the formula:
    • { -2 }^{ -b }\quad \le \quad e\quad \le \quad 0(For a positive number/2s complement)
    • 0\quad \le \quad e\quad \le \quad { 2 }^{ -b }(For a negative number/1s complement)
Truncation - Quantization in DSP
Truncation

What is Rounding?

  • Rounding is a quantization method where we ’round-up’ a particular number to the desired number of bits.
  • Basically, rounding is the process of reducing the size of a binary number to some desirable finite size. This is done in such a way that the rounded off number is as close to the original unquantized number as possible.
  • Interestingly, the rounding process is a combination of truncation and addition.
  • In rounding a number to say b-bits, first, the number is truncated to the desired number of bits. Then depending on the number that existed next to the LSB before truncation, an addition to the LSB is performed.
  • If that particular number (previously next to the LSB) was 0, then 0 is added to the LSB. If that number was 1, then a 1 is added to the LSB.
  • Consider the same example as above, suppose we wish to truncate the following 8-bit number to 4-bits.
    • X = 0.01101011 truncates to X = 0.0110
    • Since the number next to the current LSB was 1, we add 1 to the current LSB.
    • Thus X is now 0.0111
    • Converting both the unquantized and rounded off numbers to decimal, we notice that the magnitude of error is less relative to truncation. (0.01101011 equals 0.418 and 0.0111 equals 0.438).
  • Thus rounding is preferable than truncation.
  • The magnitude of error in rounding is given by the formula:
    • \frac { { -2 }^{ -b } }{ 2 } \quad \le \quad e\quad \le \quad \frac { { 2 }^{ -b } }{ 2 }
Rounding - Quantization in DSP
Rounding

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.