JANUARY 2016 VOLUME -4 ISSUE-5 Page:7880-88

# Analysis of Efficient CRC Implementation Configurations

KrishnaveniVajrala
Department of ECE,
Sasi Institute of Technology and
Engineering, Tadepalligudem.
kkrishnaveni55@gmail.com

V.V.N Sujit
Assistant Professor
Department of ECE,
Sasi Institute of Technology and
Engineering, Tadepalligudem.
vvnsujit@sasi.ac.in

J.Lakshmi Narayana
Assistant Professor
Department of ECE,
Sasi Institute of Technology and
Engineering, Tadepalligudem.
JL@sasi.ac.in

Abstract-A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data entering these systems get a short check value attached, based on the remainder of a polynomial division of their contents; on retrieval the calculation is repeated, and corrective action can be taken against presumed data corruption if the check values do not match. Error correction codes provides a mean to detect and correct errors introduced by the transmission channel. A high-speed parallel cvclic redundancy check implementation based on unfolding, pipelining, and retiming algorithms. CRC architectures are first pipelined to reduce the iteration bound by using novel look-ahead pipelining methods and then unfolded and retimed to design high-speed parallel circuits. The study and implementation using Verilog HDL. Modalism Xilinx Edition will be used for simulation and functional verification. Xilinx ISE will be used for synthesis and bit file generation. The Xilinx Chip scope will be used to test the results on Spartan 3E.

**Keywords**—Cyclic redundancy checksum (CRC), digital logic, error detection, parallel, programmable.

#### **I-INTRODUCTION**

Cyclic redundancy check is commonly used in data communication and other fields such as data storage, data compression, as a vital method for dealing with data errors. Usually, the hardware implementation of CRC computations is based on the linear feedback shift registers (LFSRs), which handle the data in a serial way. Though, the serial calculation of the CRC codes cannot achieve a high throughput. In contrast, parallel CRC calculation can significantly increase the throughput of CRC computations. For example, the throughput of the 32-bit parallel calculation of CRC-32 can achieve several gigabits per second however that is

still not enough for high speed application such as Ethernet networks. A possible solution is to process more bits in parallel; Variants of CRCs are used in applications like CRC-16 BISYNC protocols, CRC32 in Ethernet frame for error detection, CRC8 in ATM, CRC-CCITT in X-25 protocol, disc storage, SDLC, and XMODEM.

## II-CRC'S IN HIGH SPEED WIRELESS LAN

networking environments, the redundancycheck (CRC) is widely utilized to determine whether errors have been introduced during transmissions over physical links. In this paper, we focus on the CRC calculation in WLAN where the packet size is huge and hence slow CRC calculation may become bottleneck for communication process. Based on this concept, paper present a novel implementation of the CRC implementation through multiple execution units that calculate CRC on different part of packet and then combine the result to get the final CRC. Consequently, the number of cycles utilized to recalculate the CRC codes is dramatically reduced. Furthermore, estimation on the maximum throughput is made based on synthesis results of our implementation and under that assumption, calculate actual CRC operation has been done. A performance study of the project is done by calculation of CRC with 2, 4, 8 execution units on same data block. There are several techniques for generating check bits that can be added to a message. Perhaps the simplest is to append a single bit, called the -parity bit. Which makes the total number of 1-bits in the code vector.(message with parity bit appended) even (or odd). If a single bit gets altered in transmission, this will change the parity from even to odd (or the reverse). The sender generates the parity bit by simply summing the message bits modulo 2—that is, by exclusive or'ing them together. It then appends the parity bit (or its complement) to the message. The receiver can check the message by summing all the message bits modulo 2 and checking that the sum agrees with the parity bit. Equivalently, the receiver can sum all the bits (message and parity) and check that the result is 0 (if even parity is being used). This simple parity technique is often said to detect 1-bit errors. Actually it detects errors in any odd number of bits (including the parity bit), but it is a small comfort to know you are detecting 3-bit errors if you are missing 2-bit errors.



Fig.1. XOR Tree

Other techniques for computing a checksum are to form the exclusive or of all the bytes in the message, or to compute a sum with end-around carry of all the bytes. In the latter method the carry from each 8-bit sum is added into the least significant bit of the accumulator. It is believed that this is more likely to detect errors than the simple exclusive or, or the sum of the bytes with carry discarded. A technique that is believed to be quite good in terms of error detection, and which is easy to implement in hardware, is the cyclic redundancy check. This is another way to compute a checksum, usually eight, 16, or 32 bits in length that is appended to the message. We will briefly review the theory and then give some algorithms for computing in software a commonly used 32-bit CRC checksum. The CRC is based on polynomial arithmetic, in particular, on computing the remainder of dividing one polynomial in GF (2) (Galois field with two elements) by another. It is a little like treating the message as a very large binary number, and computing

the remainder on dividing it by a fairly large prime such as intuitively, one would expect this to give a reliable checksum. A polynomial in GF (2) is a polynomial in a single variable x whose coefficients are 0 or 1. Addition and subtraction are done modulo 2—that is, they are both the same as the exclusive and operator. For example, the sum of the polynomials x3 + x + 1 and x4 + x3 + x2 + x is as is their difference. These polynomials are not usually written with minus signs, but they could be, because a coefficient of -1 is equivalent to a coefficient of 1. Multiplication of such polynomials is straightforward. The product of one coefficient by another is the same as their combination by the logical and operator, and the partial products are summed using exclusive or. Multiplication isnot needed to compute the CRC checksum. Division of polynomials over GF(2) can be done in much the same way as long division of polynomials over the integers. Below is an example.

$$\begin{array}{r}
x^{4} + x^{3} + 1 \\
x^{3} + x + 1 \overline{\smash) \, \begin{array}{rrrr} x^{7} + x^{6} + x^{5} + & x^{2} + x \\
\underline{x^{7} + & x^{5} + x^{4}} \\
x^{6} + & x^{4} + x^{3} \\
\underline{x^{6} + & x^{4} + x^{3}} \\
x^{3} + x^{2} + x \\
\underline{x^{3} + & x + 1} \\
x^{2} + & 1
\end{array}}$$

Fig.2. Division operation

The reader might like to verify that the quotient of multiplied by the divisor of x3+x+1 plus the remainder of equals the dividend. The CRC method treats the message as a polynomial in GF (2). For example, the message 11001001, where the order of transmission is from left to right (110...) is treated as a representation of the polynomial  $x^7 + x^6 + x^3 + 1$ . The sender and receiver agree on a certain fixed polynomial called the generator polynomial. For example, for a 16bitCRCtheCCITT(ComitéConsultatifInternationaleTélé graphique ET Téléphonique) 1 has chosen the polynomial x16 + x12 + x5 + 1 which is now widely used for a 16-bit CRC checksum. To compute an r-bit CRC checksum, the generator polynomial must be of degree r. The sender appends r 0-bits to the m-bit message and divides the resulting polynomial of degree m + r - 1 by the generator polynomial. This produces a remainder polynomial of degree r - 1 (or less). The remainder polynomial has r coefficients, which are the checksum. The quotient polynomial is discarded. The

JANUARY 2016 VOLUME -4 ISSUE-5 Page:7880-88

data transmitted (the code vector) is the original m-bit message followed by the rbit checksum.

## III-HARDWARE FEEDBACK SHIFT REGISTER



Fig.3. Feedback Shift Register

Initialize the CRC register to all 0-bits. Get first/next message bit m. If the high-order bit of CRC is 1, Shift CRC and m together left 1 position, and XOR the result with the low-order r bits of G. Otherwise, Just shift CRC and m left 1 position. If there are more message bits, go back to get the next one. CRC is playing a main role in the networking environment to detect the errors. With challenging the speed of transmitting data, to synchronize with speed, it's necessary to increase speed of CRC generation. Most electrical and computer engineers are familiar with the cyclic redundancy check (CRC). Many know that it's used in communication protocols to detect biterrors, and that it's essentially a remainder of the modulo-2long division operation. Some have had closer encounters with the CRC and know that it's implemented as a linear feedback shift register (LFSR) using flip-flops and XOR gates. They likely used an online tool or an existing example to generate parallel CRC code for a

In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

Retiming is the technique of moving the structural location of latches or registers in a digital circuit to improve its performance, area, and/or power characteristics in such a way that preserves its functional behavior at its outputs.

Unfolding is a transformation technique of duplicating the functional blocks to increase the throughput of the DSP program in such a way that preserves its functional behavior at its Outputs. Unfolding in general program is as known as Loop unrolling. Unfolding has applications in designing high-speed and low-power ASIC architectures. One application is to unfold the program to reveal hidden concurrency so that the program can be scheduled to a smaller iteration period, thus increasing the throughput of the implementation. Another Application is parallel processing in word level or bit level. Therefore these transformed circuit could increase the throughput and decrease the power consumption.

## (a) Fast CRC:



Fig.4.Fast CRC Update Architecture

Our fast CRC update method is extended from the parallel CRC calculation and can adapt to a number of bits processed in parallel. The method can also reduce the data traffic and power consumption of the CRC calculation unit.

#### (b) LFSR in CRC:

In traditional hardware implementations, a simple circuit based on shift registers performs the CRC calculation by handling the message one bit at a time. A typical serial CRC using LFSRs is it illustrates one possible structure for CRC32. There are a total of 32 registers the middle ones are left out for brevity. The combinational logic operation in the figure is the XOR operation. One bit is shifted in at each clock pulse. This circuit operates in a fashion similar to manual long division. The XOR gates hold the coefficients of the divisor corresponding to the indicated powers of x. Although the shift register approach to computing CRCs is usually implemented in hardware, this algorithm can also be used in software when bit-by-bit

processing is adequate.



Fig.5.Serial CRC circuit using LFSRs

The proposed design starts from LFSR, which is generally used for serial CRC. An unfolding algorithm is used to realize parallel processing. However, direct application of unfolding may lead to a parallel CRC circuit with long iteration bound, which is the lowest achievable CP.



Fig.6.F-matrix algorithm

Two novel look-ahead pipelining methods are developed to reduce the iteration bound of the original serial LFSR CRC structures; then, the unfolding algorithm is applied to obtain a parallel CRC structure with low iteration bound. The retiming algorithm is then applied to obtain the achievable lowest CP.

Parallelly data input is processed; it is ANDed with the F-matrix generation from the generated polynomial. Result of that will XORed with present state CRC checksum. The final result will obtained after (k+m)/w cycles. Generation of F-matrix: F-matrix generated from the generated polynomial, matrix form can be represented as:

$$F = \begin{bmatrix} P_{m-1} & 1 & 0 & 0 & 0 \\ P_{m-2} & 0 & 1 & 0 & 0 \\ P_{m-3} & 0 & 0 & 1 & 0 \\ P_{m-4} & 0 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ P_0 & 0 & 0 & 0 & 0 \end{bmatrix}$$

Where {p0.....pm-1} is generator polynomial. For example, the generator polynomial for CRC4 is {1, 0, 0, 1, and 1} and w-bits are parallel processed.

$$\mathbf{F} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{bmatrix} \quad \mathbf{F}^4 = \begin{bmatrix} 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix}$$

Where {p0.....pm-1} is generator polynomial. For example, the generator polynomial for CRC4 is {1, 0, 0, 1, and 1} and w-bits are parallel processed.

Here w=m=4, for that matrix calculated as follow. Parallel architecture: Parallel architecture based on F-matrix"d" is data that is parallel processed (i.e. 32bit), 'X is next state, X is current state(generated CRC), F (i) (j) is the ith row and jth column of FW matrix. If  $X = [xm1 \dots x1 \ x0]T$  is utilized to denote the state of the shift registers, in linear system theory, the state equation for LFSRs can be expressed in modular 2 arithmetic as follow.  $Xi' = (P0 \otimes Xm-1) \oplus d$  Where, X(i) represents the state of the registers, X(i + 1) denotes the state of the registers, d denotes the one bit shift-in serial input. F is an m x m matrix and G is a 1 x m matrix.  $G = [0 \ 0 \ -----0 \ 1]$  T. the final representation  $X' = Fw \otimes X \otimes d$ .

If W-bits are parallel processed, the result of the CRC will generated after (k+m)/w cycles.

## IV- EXPERIMENT RESULTS

Each architecture is coded in Verilog and simulated. The simulation results and the net List simulation are verified for each architecture. For the message bits: 11010011101100 And for the generator polynomial 1+Y+Y2+Y3 i.e., 1011.

JANUARY 2016 VOLUME -4 ISSUE-5 Page:7880-88

## (a) Black Box View





Fig.7.Black Box View

## (b) Simulation Results:













Fig.8.Simulation Results

## (c) Table: Schematic, timing report, hardware resources

| Device utilization<br>summary                       | TIMING REPORT                                                           | HDL Synthesis Report            |              |
|-----------------------------------------------------|-------------------------------------------------------------------------|---------------------------------|--------------|
| Selected Device :<br>3s50pq208-5                    | clk<br>BUFGP   43                                                       | # Registers<br>: 43             |              |
| Number of Slices:<br>76 out of 768<br>9%            | Speed Grade: -5                                                         | Flip-Flops<br>: 43              |              |
| Number of Slice Flip<br>Flops: 43 out<br>of 1536 2% | Delay: 5.047ns<br>(Levels of Logic = 3)                                 | Design Statistics<br># IOs : 53 | 3            |
| Number of 4 input<br>LUTs: 142<br>out of 1536 9%    | Clock period: 5.047ns<br>(frequency:<br>198.151MHz)                     | #BELS :1                        | 56           |
| Number of IOs:<br>53                                | Total number of paths /<br>destination ports: 274 /<br>43               | #FlipFlops/Latches 43           | 3            |
| Number of bonded<br>IOBs: 53<br>out of 124 42%      | Total 5.047ns (2.239ns logic, 2.808ns route) (44.4% logic, 55.6% route) | FDPE : 43                       |              |
| Number of GCLKs:<br>1 out of 8 12%                  | Offset: 6.633ns                                                         | # BUFGP : H IO Buffers : 1      | 1<br>52<br>2 |

## **V-CONCLUSION**

Generally when high-speed data transmission is required serial implementation is not preferred because of slow throughput. So parallel implementation is preferred which takes less time. CRC-32 requires 17 clock cycles to transmit 64bytes of data. But CRC-64 needs 9 clock cycles to transmit the same data. So, it drastically reduces computation time to 50% and same time increases the throughput.

#### References:

- [1] Peterson, W. W. and Brown, D. T. (January 1961).
  "Cyclic Codes for Error Detection". Proceedings of the IRE 49 (1):228–235. doi: 10.1109/JRPROC. 1961. 287814.
- [2] Ritter, Terry (February 1986). "The Great CRC Mystery". Dr. Dobb's Journal 11 (2): 26–34, 76–83. Retrieved 21 May 2009.
- [3] Stigge, Martin; Plötz, Henryk; Müller, Wolf; Redlich, Jens-Peter (May 2006). Reversing CRC Theory and Practice. Berlin: Humboldt University Berlin. p. 17. Retrieved 4 February 2011."The presented methods offer a very easy and efficient way to modify your data so that it will compute to a CRC you want or at least know in compute to a CRC you want or at least know in advance."
- [4] Cam-Winget, Nancy; Housley, Russ; Wagner, David; Walker, Jesse (May 2003).

  "Security Flaws in 802.11 Data Link Protocols".

  Communications of the ACM 46 (5): 35–39. doi: 10.1145/769800. 769823.
- [5] Williams, Ross N. (24 September 1996). "A Painless Guide to CRC Error Detection Algorithms V3.00". Retrieved 5 June 2010.
- [6] WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 22.4 Cyclic Redundancy and Other Checksums". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press.ISBN 978-0-521-88068-8.
- [7] Koopman, Philip; Chakravarty, Tridib (June 2004). "Cyclic Redundancy Code (CRC) Polynomial Selection for Embedded Networks". The International Conference on Dependable Systems and Networks: 145 154. doi: 10.1109/DSN. 2004.1311885. ISBN 0-7695-2052-9. Retrieved 14 January 2011.
- [8] Cook, Greg (6 July 2012). "Catalogue of

JANUARY 2016 **VOLUME-4** ISSUE-5 Page:7880-88

parametrised CRC algorithms". Retrieved 7 July 2012.

Krishnavenivajrala pursuing her M.Tech in Very Large Scale Integration and Embedded Systems From Sasi Institute of Technology and Engineering, West Godavari District, Tadepalligudem, Andhra Pradesh 534101. Kkrishnaveni55@gmail.com

V.V.N Sujit Working as Assistant Professor, from Electronics and communication Engineering in Sasi Institute of Technology and Engineering, West Godavari District, Tadepalligudem, Andhra Pradesh 534101. vvnsujit@sasi.ac.in

J.Lakshmi narayana Working as Assistant Professor, from Electronics and communication Engineering in Sasi Institute of Technology and Engineering, West Godavari District, Tadepalligudem, Andhra Pradesh 534101.JL@sasi.ac.in



International Journal of Engineering In Advanced Research Science and Technology ISSN: 2278-256

JANUARY 2016 VOLUME -4 ISSUE-5 Page:7880-88

