SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                 and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.412-419
                      Optimized Elliptic Curve Cryptography
                         A.Durga Bhavani*, P.Soundarya Mala**
                  Dept of Electronics and Communication Engg,GIET College,Rajahmundry.


Abstract—
          This paper details the design of a new          number of required cycles. In particular, the works
high-speed      pipelined       application-specific      which have duplicate arithmetic blocks to exploit the
instruction set processor (ASIP) for elliptic             parallelism in the underlying operations. Yet for most
curve cryptography (ECC) technology. Different            of these implementations, efforts are concentrated on
levels of pipelining were applied to the data path        algorithm optimization or improved arithmetic
to explore the resulting performances and find            architectures and rarely on a processor architecture
an optimal pipeline depth. Three complex                  particularly suited for ECC point multiplication.
instructions were used to reduce the latency by           Some of the design techniques used in modern high
reducing the overall number of instructions, and          performance processors were incorporated into the
a new combined algorithm was developed to                 design of an application-specific instruction set
perform point doubling and point addition using           processor (ASIP) for ECC. Pipelining was applied to
the application specific instructions In the present      the design, giving improved clock frequencies. Data
work, by using pipeline techniques are optimized          forwarding and instruction reordering were
the point multiplication speed by implementing on         incorporated to exploit the inherent parallelism of the
modern Xilinx Virtex-4 and the same to be                 Lopez and Dahab point multiplication algorithm,
compare with the Xilinx Virtex-E.                         reducing pipeline stalls.
Key Terms—Complex instruction set, efficient                     In this paper, thorough treatment is given to
hardware      implementation,   elliptic       curve      the design of an ASIP for ECC, yielding a new
cryptography (ECC), pipelining.                           combined algorithm to perform point doubling and
                                                          point addition based on the instruction set
I. INTRODUCTION                                           developed, further reducing the required number of
          FIRST introduced in the 1980s, elliptic curve   instructions per iteration. The same data path is used,
cryptography (ECC) has become popular due to its          but the performance is explored for different levels
superior strength- per-bit compared to existing public    of pipelining, and a superior choice of pipeline
key algorithms. This superiority translates to            depth is found. The resulting processor has high
equivalent security levels with smaller keys,             clock frequencies and low latency, and it has only a
bandwidth savings, and faster implementations,            single instance of each of the arithmetic units. An
making ECC very appealing. The IEEE proposed              FPGA implementation over GF 2163is presented,
standard P1363-2000 recognizes ECC-based key              which is by far the fastest implementation reported in
agreement and digital signature algorithms and a list     the literature to date.
of secure curves is given by the U.S. Government                 In Section II, a background is given in terms of
National Institute of Standards and Technology            elliptic curve operations, Galois fields arithmetic and
(NIST).                                                   the state-of-the-art hardware implementations.
     Intuitively, there are numerous advantages of        Section III develops the ASIP architecture,
using field- programmable gate-array (FPGA)               beginning with the application-specific instructions
technology to         implement in hardware the           and a new combined algorithm to perform point
computationally intensive operations needed for           doubling and point addition (crux ECC operations)
ECC. Indeed these advantages are comprehensively          based on the new algorithms. In Section IV, the
studied and listed by Wollinger, et al. In particular,    FPGA implementation of the processor is described,
performance, cost efficiency, and the ability to easily   and the performance is analyzed and compared to
update the cryptographic algorithm in fielded devices     the state-of-the-art in Section V. This paper is
are very attractive for hardware implementations for      concluded in Section VI.
ECC.
   Numerous ECC hardware            accelerators and      II. BACKGROUND
cryptographic processors have been presented in the       A. ECC
literature. More recently, these have included a                   ECC is performed over one of two
number of FPGA architectures, which present               underlying Galois fields: prime order        fields or
acceleration techniques to improve the performance of     characteristic two fields. Both fields are considered
the ECC operations.                                       to provide the same level of security, but arithmetic
                                                          in GF(2m) will be the focus of this paper because it
   The optimization goal is usually to reduce the         can be implemented in hardware more efficiently
latency of a point multiplication in terms of the         using modulo-2 arithmetic. An elliptic curve over


                                                                                                 412 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                  and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.412-419
the field is the set of solutions to the equation         division architectures exist in the literature, but they
                                                          are usually expensive in terms of area. In- version using
          y2+xy = x3+ax2+b                  (1)           Fermat’s little theorem, as in the Itoh–Tsujii algorithm,
                                                          consists of multiplication and squaring only, so can
where a,b € GF(2m),b≠0                                    often be implemented without additional hardware
                                                          resource. It has been shown that similar performance
                                                          to an EEA-based inverter can be achieved using the
                                                          Itoh–Tsujii algortihm if the multiplier latency is
                                                          sufficiently low and repeated squaring can be
                                                          performed efficiently. Hence, multiplication is
                                                          considered to be the most resource-sensitive operation
                                                          in the field.

                                                          Algorithm 1: Lopez-Dahab Point Multiplication



           Hence, when       =       we have the point-
doubling operation (DBL), and when           ≠      we
have the point-adding operation         (ADD). These
operations in turn constitute the crux of any ECC-
based algorithm, known as point multiplication or
scalar multiplication. Due to the computational
expense of inversion compared to multiplication,
several projective coordinate methods have been              Multipliers are usually implemented using
proposed, which use fractional field arithmetic to        one of three approaches: bit-serial, bit-parallel, or
defer the in- version operation until the end of the      digit-serial. Bit-serial multi- pliers are small, but
point multiplication; In this paper, the high-            they require m steps to perform a multiplication. Bit-
performance, generic projective-coordinate algorithm      parallel multipliers are large but perform a
proposed by Lopez and Dahab is used, which is an          multiplication in a single step. Digit-serial
efficient implementation of Montgomery’s method for       multipliers are most commonly used in cryptographic
computing kP. No precomputations or special               hard- ware, as performance can be traded against
field/curve properties are required.                      area. Recently, in a drive for increased speed, some
    In this, procedures to perform DBL and ADD are        methods have been proposed, and implementations
derived from efficient formulas which use only the        presented , for ECC hardware based on bit-parallel
coordinate of the points. In the projective coordinate    multipliers.
version of the formulas, the x-coordinate of         is      Squaring is a special case of multiplication, and it
represented by Xi/Zi, for i € {1,2,3};                    is only a little more complex than reduction modulo
   The corresponding DBL and ADD computations are         the irreducible polynomial. Refer to finite field
shown respectively, and are used in the projective        multiplier for efficient architectures to perform
coordinate point multiplication algorithm shown in        polynomial basis squaring, which only require
Algorithm 1                                               combinational logic.

                                                          C. Previous Work
                                                                    Several recent FPGA-based hardware
                                                          implementations of ECC have achieved high-
                                                          performance throughput. Various acceleration
                                                          techniques have been used, usually based on
                                                          parallelism or pre computation.
                                                            The work introduced by Orlando and Paar[2] is
B. Arithmetic Over GF(2163) for ECC
                                                          based on the Montgomery method for computing kp
         Addition and subtraction in the Galois
                                                          developed by Lopez and Dahab and operates over a
field GF(2m) are equivalent, performed by modulo-2
                                                          single field. A point multiplication over GF(2167) is
addition, i.e., a bit-XOR operation. As a result,
                                                          performed in 210 micro seconds, using a Galois field
arithmetic in the field is implemented more
                                                          multiplier with an eleven-cycle latency. This provides
efficiently because it is carry free.
                                                          an excellent benchmark for all high-performance
   Inversion is the most computationally expensive
                                                          architectures.
operation in the field, based either on the extended
Euclidean algorithm (EEA) or Fermat’s little
theorem. Many efficient EEA-based inversion and



                                                                                                   413 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                  and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.412-419
Acceleration techniques based on precomputation can           Algorithms
be particularly effective for a class of curves known                 Complex instruction set computers (CISCs)
as anomalous binary curves (ABCs) or Koblitz                reduce the number of instructions, and consequently
curves. The implementation from Lutz and Hasan              the overall latency, by performing multiple tasks in a
implements some of these techniques achieving a             single instruction. It was shown that complex
point-multiplication time of 75 µs over, using a            instructions could be used to reduce latency in the
Galois field multiplier with a four-cycle latency.          design of an ASIP for ECC. Three new instructions
How- ever, efforts here will be concentrated on high-       were introduced, which will now be described and
performance architectures for generic curves, where         their use justified.
such acceleration techniques cannot be used. An                         Considering the projective-coordinate
important result of , relevant regardless of the point-     formula for point addition, an obvious instruction to
multiplication algorithm used, is that efficiently          combine operations is a multiply and accumulate
performing repeated squaring operations greatly             instruction (MULAD), which will save two
reduces the cost of multiplicative inversion, which is      instruction executions. Also, rewriting the projective-
required at the end of the point multiplication.            coordinate formula for point doubling , we have
         An ECC processor capable of operating over
multiple Galois fields was presented by Gura, et al. ,
which performs a point multiplication over
GF(2163)in 143 µs, using a Galois field multiplier
with a three-cycle latency. Gura, et al. stated two         so a multiply- and- square operation (MULSQ)
important conclusions: the efficient implementation         would be beneficial, saving three instruction
of inversion has a significant impact on overall            executions.
performance and, as the latency of multiplication is           These two extra instructions reduce the number of
improved, system tasks such as reading and writing          instructions executions required to perform the point-
have a significant impact on performance. The work          adding and point- doubling algorithms from 14 to
introduced by Jarvinen, et al. uses two bit-parallel        9—a 35.7% reduction.
multipliers to perform multiplications concurrently.           The Itoh–Tsujii inversion algorithm is based on
The multipliers have several registers in the critical      Fermat’s little theorem and is used to compute the
path in order to operate at high clock frequencies,         inversion at the end of the point multiplication. The
but the operations are not pipelined, resulting in long     algorithm uses addition chains to reduce the
latencies (between 8 and 20 cycles). Hence, while this      required number of multiplications, but it contains
implementation offers high performance, it is not the       exponentiations that must be performed through
fastest reported in the literature but it is one of the     repeated squaring operations—as many as 64 repeated
largest. Rodriguez, et al. introduced an FPGA               squaring operations for the field GF(2163). It was
implementation that performs DBL and ADD in                 shown, that relatively low latencies can be achieved
parallel, containing multiple in- stances of circuits to    using this algorithm if repeated squaring can be
perform the arithmetic functions. It would appear that      performed efficiently. Hence, a third application-
the inversion required for coordinate conversion at         specific instruction will be used to perform
the end of the point multiplication is not performed,       repeated squaring (RESQR) in order to accelerate the
and that the quoted point multiplication time does          Itoh–Tsujii inversion algorithm.
not include the co- ordinate conversion, but the               Using these instructions, a combined algorithm to
stated point multiplication time is one of the fastest in   perform point doubling and point addition can be
the literature. However, the complex structure of the       developed, which has only nine arithmetic
multiplier has a long critical path, and as a result the    instructions, shown in Algorithm 2.
overall performance is let down by quite a low clock
frequency (46.5 MHz).                                       B. Pipelined Data Path
     More recently, Cheung, et al. presented a                       A data path capable of performing the new
hardware design that uses a normal basis                    application specific instructions is required.
representation. The customizable hardware offers a             Any of the approaches to implementing
trade between cost and performance by varying               multiplication mentioned in the background section
the level of parallelism through the number of              could be used, but because this work develops high-
multipliers and level of pipelining. To the authors’        throughput hardware a bit-parallel multiplier is used.
knowledge, this is the fastest implementation in the        The data path must be constructed such that the result
literature performing a point multiplication in             of a multiplication can be squared or added to the
approximately 55 µs, although once again the low            previous result. Furthermore, feedback is required so
clock frequency (43 MHz) limits the potential               that the result can be repeatedly squared. This
performance.                                                functionality is implemented using the SQR/ADD
                                                            block, described in Section III-D.
III. PIPELINED ASIP FOR ECC                                    To maintain high-throughput performance, high
 A.Application-Specific Instruction Set and                 clock     frequencies are required and one



                                                                                                   414 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                 and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.412-419
instruction/cycle or more is desirable. Therefore, the       notation as
data path must be pipelined. The pipelined data path
is shown in Fig. 1. The performance-critical
component is the multiplier, so to achieve high clock
frequencies,    we    must     sub      pipeline   the
multiplier.
                                                                      The         Z matrix is referred to as the
                                                             product matrix and the functions f i,j are linear
                                                             functions of the components of A(x) .The columns of
                                                             represent the consecutive states of the Galois type
                                                             LFSR with the initial state given by the multiplicand
                                                             A(x), i.e., the jth column is given by xj A(x) mod G(x).
                                                             Therefore, the product matrix can be realized by
                                                             cascading instances of the logic to perform a left shift
                                                             modulo G(x). The circuit to perform this modulo shift
                                                             is referred to as an alpha cell. The same notation will
                                                             be used here.
                                                                The product matrix can be divided into a number
                                                             of smaller submatrices that calculate the partial
                                                             products, which give the final product when
                                                             summed together.




 Fig. 1. Pipelined ASIP data path.

 C. Subpipelined Bit-Parallel Multiplier                     Fig. 2. Pipeline stage of subpipelined bit-parallel
          Some recent hardware implementations of            multiplier.
 elliptic-curve accelerators, in addition to some
 proposed architectures, have used bit-parallel
 multipliers. Numerous architectures for bit-parallel
 multiplication exist in the literature. However, very
 few      actual     implementations of bitParallel
 multiplication have been presented for the large
 fields used in public-key cryptography. Where they
 have been implemented, poor performance has often
 been reported either in terms of clock frequency or
 latency. This is not surprising given the deep sub
 micrometer fabrics used in modern FPGA devices.
 The major component of delay is due to routing, and
 the number of interconnections in a bit-parallel            More formally, the product matrix can be divided
 multiplier implemented for ECC will be in the region        into D submatrices, which will contain at most =[m-
 of many tens of thousands or more making routing            1/D] columns,i.e,
 very complex. Therefore, to improve routing                   Hence, the multiplier can be easily pipelined into
 complexity and to reduce the amount of logic in the         D stages, each computing the sum of at most d+1
 critical path, a bit-parallel multiplier that is amenable   terms, which are the outputs of alpha cells and the
 to pipelining is desirable.                                 input from the previous multiplication pipeline
                                                             stage. As a result the logic in the critical path is
       The well-known Mastrovito multiplier is such          reduced and the routing is simplified. Note that the
an architecture; the reader is referred to the original      number of logic gates in the critical path of each
work for full details.           The multiplication          stage will be similar to that of a digit serial multiplier
C(x)=A(x)B(x) mod G(x) is expressed in matrix                with equivalent , but the routing is simpler as no
                                                             feedback is required. The resulting architecture


                                                                                                      415 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                        and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                            Vol. 2, Issue 5, September- October 2012, pp.412-419
of the i th pipeline stage 0≤i≤D-1, is shown in Fig. 2;  IV. FPGA IMPLEMENTATION
  When the alpha input is A(x) and the sum input is                The processor was implemented over the
A(x)b0, when i = D-1 the sum output is the                smallest field recommended by NIST [8] GF(2163)
multiplication result and the alpha output is             using     the     irreducible    polynomial       given,
unconnected.                                              G(x)=x163+x7+x6+x3+1.Two FPGA devices were
                                                          used for implementation: the older Virtex-E de- vice
D. SQR/ADD Block                                          (XCV2600E-FG1156-8),for fair comparison with the
          The extra functionality required for the new    other architectures presented in the literature, and the
instructions MULAD, MULSQ, and RESQ is                    more modern Virtex-4 (XC4VLX200-FF1513-11) to
provided by the SQR/ADD block, which is                   demonstrate the performance of the proposed ASIP
appended to the multiplier output as shown in Fig.        with modern technology. Each component was
1. A block diagram showing the functionality of the       implemented and optimized individually, particularly
SQR/ADD block is shown in Fig. 3; note that the           the multiplier, to examine the optimal pipeline depth
flip-flops shown in Fig. 3 are the pipeline flip-         before the complete processor was implemented.
flops placed after the SQR/ADD block in the data          Aggressive optimization for speed was performed,
path diagram shown in Fig. 1.                             including timing-driven packing and placement. No
    Selection 0 on the MUX shown in Fig.3 sets the        detailed floor planning was performed, only a few
data path to perform standard multiplication over         simple constraints were applied.
GF(2m). Selection 1 sets the data path to perform
MULSQ. Selection 2 sets the data path to perform            A. Register File
MULAD. Selection 3 sets the data path to perform                    Xilinx FPGA devices support three main
RESQ; note that RESQ can be performed after any           types of storage element: flip-flops, block RAM and
other operation, e.g., MULAD-RESQ is a valid              distributed RAM. Block RAM is the dedicated
instruction sequence.                                     memory resource of FPGA devices, which has
                                                          different sizes and locations depending on the device
                                                          being used.
                                                                              TABLE I
                                                                 DECOMPOSITION OF POINT
                                                             MULTIPLICATION TIME IN CYCLES




                 Fig.3. SQR/ADD block.
                                                              Distributed RAM uses the basic logic elements of
  E. Data Forwarding                                      the FPGA device, lookup tables (LUTs), to form
          The optimal depth of a pipeline is a trade      RAMs of the desired size, function and location,
 between clock frequency and instruction throughput.      making it far more flexible. On Virtex-E devices,
 Ideally, one instruction per cycle (or more) will be     Block RAMs are 4096 bits each, which increase to
 performed, while sufficient pipelining to achieve the    18 k bits if the Virtex-4 is used. So the use of block
 desired clock frequency is implemented. However,         RAM to store relatively small amounts of data is
 as the depth of the pipeline is increased, data          inefficient. The processor presented in this paper
 dependencies can lead to pipeline stalls and even        requires only 13 storage locations including all
 pipeline flushes, resulting in less than the ideal one   temporary storage, so distributed RAM was used.
 instruction/cycle being performed.                       On Xilinx FPGA devices, 16-bits of storage can
      Data forwarding is a common technique in            be gained from a single LUT. However, as reading
 processor design, used to avoid pipeline stalls that     from two ports and writing to a third port is
 occur due to data dependencies. Typically, data is       required independently and concurrently, a single
 forwarded to either input of the arithmetic logic unit   dual-port distributed RAM is not sufficient.
 (ALU) from all subsequent pipeline stages. The           Therefore, the register file utilized in this design is
 scheme used here differs slightly because such           formed by cross-connecting two dual-port
 generality is not necessary: the order of the            distributed RAMs. This approach is still far more
 instructions and the instances when forwarding is        efficient than using block RAMs or flip-flops.
 required are known and do not change. Therefore, the
 control signals may be explicitly defined for each       B. Subpipelined Multiplier
 instruction, and, as can be seen in Fig. 3, the data                 The multiplier dominates the processor,
 path can be simplified because data forwarding to        both in terms of area (accounting for over 95% of
 both ALU inputs from all sub- sequent stages is not      the total) and clock frequency, as the critical path of
 required.


                                                                                                  416 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                 and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.412-419
the other components is considerably shorter than            architecture has a critical path similar to a digit-serial
that of the multiplier. The area of a bit parallel           multiplier, but the routing is simplified because no
multiplier is O(m2) and the critical path is O(log(m)).      feedback is required.
However, when implemented on FPGA, these                        The multiplier was implemented with several
measures have less significance. To implement a bit-         different depths of pipeline; the results are shown in
parallel multiplier without a pipelining scheme will         Table II. When timing- driven FPGA placement is
lead to low clock frequencies, due to the amount of          performed, the placer may be forced to heavily
logic in the critical path and the complex routing.          replicate certain areas of the circuit to meet timing
                                                             goals, particularly if certain nets have high fan-outs.
         The Mastrovito multiplier architecture was          This can be seen in the implementation results of
chosen not only because it was suitable for                  the multiplier, where the multiplier with three
pipelining, but also because its regularity simplifies       pipeline stages requires more resources than its four-
placement and, consequently, routing. Given the              stage equivalent because heavy replication was
deep sub micrometer fabrics used by FPGAs, the               performed to meet timing goals. In terms of area-time
routing is the largest component of delay, and, as           performance, it is clear that the four-stage multiplier
shown in Section III-C, the Mastrovito multiplier can        offers superior performance, which is an important
be divided into very regular blocks to improve               result when the overall performance of the processor
routing. When pipelined, the implemented                     is considered.




                                                              performance penalty in terms of speed, though more
 C. SQR/ADD Block                                             area resources would be required.
     By implementing the functionality as a single
 Boolean function, improved logic optimization can           V. PERFORMANCE ANALYSIS
 be achieved because the hierarchy will be flattened                  Let      be the depth of the pipeline, then the
 leading to resource sharing and improved logic              number of cycles required to perform a point
 minimization.                                               multiplication (excluding data I/O and coordinate
                                                             conversion) is given by
 D. Control
          Typically, a general-purpose processor with a             Latency cycles         = 2L+3
 pipelined data path would use an instruction ROM for
 control. However, for applications such as this one                    The latency in cycles of the point
 where the mode of operation is fixed and the control        multiplication is decomposed in Table I. To ascertain
 is relatively simple, a state machine is far more           the optimal pipeline depth for the processor, the bit-
 efficient, particularly in terms of area, and thus is the   parallel multiplier was implemented with different
 choice for this design.                                     pipeline depths. The results of these implementations
    A Moore state machine is most suitable for the           (see Table II) and the corresponding number of
 proposed architecture, as any extra area required is        cycles to perform a point multiplication (see Table
 negligible compared to the overall area of the              I) can be used to estimate the point multiplication
 processor, and the improved clock frequency                 times for each pipeline depth, as the critical path of
 compared to a Mealy state machine is desirable. The         the processor is in the multiplier.
 key is stored in a special-purpose register,                   Fig. 4 plots throughput (point multiplications per
 independent of the data path, and loaded through a          second) against area (FPGA slices) for each of the
 separate channel. Thus, the state machine has no            multiplier variations implemented. Note that the area
 dependency on the data path and the control signals         figure shown is the area of the multiplier not that of
 can be registered to achieve the desired delay              the complete processor, but the multiplier accounts
 performance. The total resource usage for control is        for over 95% of the total area.
 only 93 slices. However, if extra flexibility is
 required, an instruction ROM can be used with no



                                                                                                       417 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                  and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.412-419

                                                            offers superior performance. Therefore, this was the
                                                            architecture implemented on both the Virtex-E and
                                                            Virtex-4 devices, the results of which are detailed in
                                                            Table II and summarized in Table III for comparison
                                                            with the state of the art. It appears that none of the
                                                            architectures in the comparison were floor planned,
                                                            so for a fairer comparison the implementation of the
                                                            proposed architecture was not floor planned either. It
                                                            is important to note for FPGA implementations,
                                                            being target specific, a detailed comparison of
                                                            resource usage is not always straightforward, in
                                                            particular when it is not clear whether the quoted
                                                            figures are actual post place and route
         It is clear that the relationship between          implementation results as in our case or merely
pipeline depth and area throughput performance is           synthesis estimates.
approximately linear (as we would expect from the
previous equation), but the seven-stage pipeline




   The proposed architecture compares very                  complex instructions. The further seventh-stage
favorably with the state of the art, being smaller and      pipelining has increased the clock frequency of the
considerably faster than similar works. The                 architecture without prohibitively increasing the
implementation of the proposed architecture is              latency, which leads to the fastest point
several times faster than the first three, lower-resource   multiplication time reported in the literature. Hence,
table entries, which are based on digit-serial              the use of an application specific instruction set in
multiplication. It is interesting to note that the          conjunction with pipelining has better exploited
proposed architecture achieved a higher clock               operation parallelism compared with duplicating
frequency than all three, demonstrating that the            arithmetic circuits as proposed.
simplified routing resulting from the pipelined
architecture does indeed improve the critical path.             Comparing against our recently reported pipelined
                                                             ASIP architecture, a 10.12% improvement in point
   Comparing       against    alternative   high-speed       multiplication time was attained, but the resource
architectures, the implementation of the proposed            usage was only increased by around 2%. The
architecture performs a point multiplication in only         improvement comes as a result of the new
60.01% of the estimated performance time of the              combined algorithm to perform DBL and ADD
fastest alternative. The area resource of is not stated      fewer instructions are required, which reduces the
for the field, only the performance time, but the            latency and the increased pipeline depth, which
resource usage (slices) of the proposed architecture         increases the clock frequency.
is significantly less than the other alternative high-
speed architectures. The improvements over the
implementation from Jarvinen, et al. are due once
more to the reduced area requirements but also due to
the reduced latency resulting from pipelining and the



                                                                                                   418 | P a g e
A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                   Vol. 2, Issue 5, September- October 2012, pp.412-419

VI. CONCLUSION
           A high-performance ECC has been                   [11] C.H.Kim and C. P. Hong, ―High-speed
implemented using FPGA technology. A combined                   division architecture for GF(2m),‖ Electron.
algorithm to perform DBL and ADD was developed                         Lett., vol. 38, pp. 835–836, 2002.
based on the Lopez Dahab Point multiplication              [12] T. Itoh and S. Tsujii, ―A fast algorithm for
algorithm .The data path was pipelined, allowing               computing        multiplicative     inverses       in
operation parallelism to be perform fastly and taking          GF(2/sup m/) using normal bases,‖ Inf.
less time. Consequently, an implementation with a              Computation, vol.78, pp. 171–177, 1988.
four-stage pipeline achieved a point multiplication        [13]P.A.Scott,S.E. Tavares, and L. E. Peppard,
on Xilinx Virtex-4 device, making it the fastest               ―A fast VLSI multi- plier for GF(2/sup m/),‖
implementation. This work has confirmed the                    IEEE J. Sel. Areas Commun., vol. 4, no. 1.
suitability of a pipelined data path and an efficient      [14] H. Wu, ―Bit-parallel finite field multiplier
Galois field multiplier (2^163) is developed and              and squarer using polyno- mial basis,‖ IEEE
implementations of ECC over GF(2^163) performs                Trans. Comput., vol. 51, no. 7, pp. 750–758,
the better security with less key size .                      Jul.2002.
                                                           [15] A. Reyhani-Masoleh and M. A. Hasan,
REFERENCES                                                    ―Low complexity bit parallel architectures
 [1]   J. Lutz and A. Hasan, ―High performance                for polynomial basis multiplication over
       FPGA based elliptic curve cryptographic                GF(2/sup m/),‖IEEE Trans. Comput., vol. 53,
       co-processor,‖ in Proc. Int. Conf. Inf.                no. 8, pp. 945–959, Aug. 2004.
       Technol.: Coding Comput. (ITCC), 2004,              [16] L. Song and K. K. Parhi, ―Low-energy
       p. 486.                                                digit-serial/parallel finite field multipliers,‖ J.
 [2]   F. Rodriguez-Henriquez, N. A. Saqib, and               VLSI Signal Process. Syst., vol. 19, pp. 149–
       A.        Diaz-Perez, ―A         fast parallel         166, 1998.
       implementation of elliptic curve point             [17] M. C. Mekhallalati, A. S. Ashur, and
       multiplication          over         GF(2m),‖          M. K.       Ibrahim, ―Novel radix finitefield
       Microprocessors Microsyst., vol. 28, pp.               multiplier for GF(2/sup m/),‖ J. VLSI Signal
       329–339, 2004.                                         Process., vol. 15,pp.233–245, 1997.
 [3]   N. Mentens, S. B. Ors, and B. Preneel, ―An        [18] P. K. Mishra, ―Pipelined computation of
       FPGA implementation of an elliptic curve               scalar multiplication in elliptic curve
       processor GF(2/sup m/),‖ presented at the              cryptosystems,‖ presented at the CHES,
       14th ACM Great Lakes Symp. VLSI,                       Cambridge, MA, 2004.
       Boston, MA, 2004.
 [4]   J.     Lopez     and     R.     Dahab,    ―Fast
       multiplication on       elliptic curves over
       GF(2/sup m/) without precomputation,‖
       presented at the Workshop on Cryptographic
       Hardware       Embedded       Syst.    (CHES),
       Worcester, MA,1999.
 [5]   G. Seroussi and N. P. Smart, Elliptic Curves
       in Cryptography.          Cam- bridge, U.K.:
       Cambridge Univ. Press, 1999.
 [6]   C. H. Kim and C. P. Hong, ―High-speed
       division architecture for GF(2m),‖ Electron.
       Lett., vol. 38, pp. 835–836, 2002.
  [7] W. Chelton and M. Benaissa, ―High-speed
       pipelined ECC processor over GF(2m),‖
       presented at the IEEE Workshop Signal
       Process. Syst.(SiPS), Banff, Canada, 2006.
 [8] G. Seroussi and N. P. Smart, Elliptic Curves in
      Cryptography.            Cam- bridge, U.K.:
      Cambridge Univ. Press, 1999.
 [9] P. L. Montgomery, ―Speeding the Pollard and
      elliptic curve methods of factorization,‖
      1987.
 [10] Z. Yan and D. V. Sarwate, ―New systolic
      architectures for inversion and division in
      GF(2m),‖ IEEE Trans. Comput., vol. 52, no.
      11, pp.1514–1519, Nov. 2003.


                                                                                                  419 | P a g e

Mais conteúdo relacionado

Mais procurados

IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET Journal
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible Gate
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible GateIRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible Gate
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible GateIRJET Journal
 
128 bit low power and area efficient carry select adder amit bakshi academia
128 bit low power and area efficient carry select adder   amit bakshi   academia128 bit low power and area efficient carry select adder   amit bakshi   academia
128 bit low power and area efficient carry select adder amit bakshi academiagopi448
 
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...VLSICS Design
 
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERS
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERSOPTIMIZED REVERSIBLE VEDIC MULTIPLIERS
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERSUday Prakash
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning PaperJoseph Mogannam
 
Fpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aesFpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aeseSAT Publishing House
 
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...IJERD Editor
 
An Extensive Literature Review on Reversible Arithmetic and Logical Unit
An Extensive Literature Review on Reversible Arithmetic and Logical UnitAn Extensive Literature Review on Reversible Arithmetic and Logical Unit
An Extensive Literature Review on Reversible Arithmetic and Logical UnitIRJET Journal
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd Iaetsd
 
IRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET Journal
 
High throughput finite field multipliers using redundant basis for fpga and a...
High throughput finite field multipliers using redundant basis for fpga and a...High throughput finite field multipliers using redundant basis for fpga and a...
High throughput finite field multipliers using redundant basis for fpga and a...LogicMindtech Nologies
 
System design using HDL - Module 5
System design using HDL - Module 5System design using HDL - Module 5
System design using HDL - Module 5Aravinda Koithyar
 
Design and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select AdderDesign and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select Adderijsrd.com
 

Mais procurados (18)

IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible Gate
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible GateIRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible Gate
IRJET- Review Paper on Radix-2 DIT Fast Fourier Transform using Reversible Gate
 
128 bit low power and area efficient carry select adder amit bakshi academia
128 bit low power and area efficient carry select adder   amit bakshi   academia128 bit low power and area efficient carry select adder   amit bakshi   academia
128 bit low power and area efficient carry select adder amit bakshi academia
 
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...
DESIGN OF PARITY PRESERVING LOGIC BASED FAULT TOLERANT REVERSIBLE ARITHMETIC ...
 
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERS
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERSOPTIMIZED REVERSIBLE VEDIC MULTIPLIERS
OPTIMIZED REVERSIBLE VEDIC MULTIPLIERS
 
Ds36715716
Ds36715716Ds36715716
Ds36715716
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning Paper
 
P044067476
P044067476P044067476
P044067476
 
Fpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aesFpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aes
 
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...
High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex...
 
An Extensive Literature Review on Reversible Arithmetic and Logical Unit
An Extensive Literature Review on Reversible Arithmetic and Logical UnitAn Extensive Literature Review on Reversible Arithmetic and Logical Unit
An Extensive Literature Review on Reversible Arithmetic and Logical Unit
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fft
 
IRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth Multiplier
 
High throughput finite field multipliers using redundant basis for fpga and a...
High throughput finite field multipliers using redundant basis for fpga and a...High throughput finite field multipliers using redundant basis for fpga and a...
High throughput finite field multipliers using redundant basis for fpga and a...
 
System design using HDL - Module 5
System design using HDL - Module 5System design using HDL - Module 5
System design using HDL - Module 5
 
Design and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select AdderDesign and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select Adder
 

Destaque (20)

Bg25346351
Bg25346351Bg25346351
Bg25346351
 
By25450453
By25450453By25450453
By25450453
 
Er25885892
Er25885892Er25885892
Er25885892
 
Bn25384390
Bn25384390Bn25384390
Bn25384390
 
Bi25356363
Bi25356363Bi25356363
Bi25356363
 
Ba25315321
Ba25315321Ba25315321
Ba25315321
 
Ei25822827
Ei25822827Ei25822827
Ei25822827
 
Cj25509515
Cj25509515Cj25509515
Cj25509515
 
Aw25293296
Aw25293296Aw25293296
Aw25293296
 
Cn25528534
Cn25528534Cn25528534
Cn25528534
 
Du25727731
Du25727731Du25727731
Du25727731
 
Dz25764770
Dz25764770Dz25764770
Dz25764770
 
Ck31577580
Ck31577580Ck31577580
Ck31577580
 
Bq31464466
Bq31464466Bq31464466
Bq31464466
 
Bv31491493
Bv31491493Bv31491493
Bv31491493
 
Task 1 max watts
Task 1 max wattsTask 1 max watts
Task 1 max watts
 
CCAFS: An Overview National Policy Workshop, Addis Ababa
CCAFS: An Overview National Policy Workshop, Addis AbabaCCAFS: An Overview National Policy Workshop, Addis Ababa
CCAFS: An Overview National Policy Workshop, Addis Ababa
 
Segunda actividad clase 2 inicial
Segunda actividad clase 2 inicialSegunda actividad clase 2 inicial
Segunda actividad clase 2 inicial
 
Carta presidente-cordero-a-la-andina
Carta presidente-cordero-a-la-andinaCarta presidente-cordero-a-la-andina
Carta presidente-cordero-a-la-andina
 
Learning Event No. 13, Session 2: al-Momany. ARDD2012 Rio,
Learning Event No. 13, Session 2: al-Momany.  ARDD2012 Rio, Learning Event No. 13, Session 2: al-Momany.  ARDD2012 Rio,
Learning Event No. 13, Session 2: al-Momany. ARDD2012 Rio,
 

Semelhante a Bs25412419

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Design of efficient reversible floating-point arithmetic unit on field progr...
Design of efficient reversible floating-point arithmetic unit on  field progr...Design of efficient reversible floating-point arithmetic unit on  field progr...
Design of efficient reversible floating-point arithmetic unit on field progr...IJECEIAES
 
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)Karthik Sagar
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdfRioCarthiis
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Implementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLImplementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLpaperpublications3
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICijtsrd
 
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...IJECEIAES
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAA LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAIRJET Journal
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithmcscpconf
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithmcscpconf
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGAVLSICS Design
 

Semelhante a Bs25412419 (20)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Hbdfpga fpl07
Hbdfpga fpl07Hbdfpga fpl07
Hbdfpga fpl07
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Design of efficient reversible floating-point arithmetic unit on field progr...
Design of efficient reversible floating-point arithmetic unit on  field progr...Design of efficient reversible floating-point arithmetic unit on  field progr...
Design of efficient reversible floating-point arithmetic unit on field progr...
 
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)
Fpga implementation of high speed 8 bit vedic multiplier using barrel shifter(1)
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Implementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLImplementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDL
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
 
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAA LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithm
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithm
 
IJET-V3I1P14
IJET-V3I1P14IJET-V3I1P14
IJET-V3I1P14
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
 

Bs25412419

  • 1. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 Optimized Elliptic Curve Cryptography A.Durga Bhavani*, P.Soundarya Mala** Dept of Electronics and Communication Engg,GIET College,Rajahmundry. Abstract— This paper details the design of a new number of required cycles. In particular, the works high-speed pipelined application-specific which have duplicate arithmetic blocks to exploit the instruction set processor (ASIP) for elliptic parallelism in the underlying operations. Yet for most curve cryptography (ECC) technology. Different of these implementations, efforts are concentrated on levels of pipelining were applied to the data path algorithm optimization or improved arithmetic to explore the resulting performances and find architectures and rarely on a processor architecture an optimal pipeline depth. Three complex particularly suited for ECC point multiplication. instructions were used to reduce the latency by Some of the design techniques used in modern high reducing the overall number of instructions, and performance processors were incorporated into the a new combined algorithm was developed to design of an application-specific instruction set perform point doubling and point addition using processor (ASIP) for ECC. Pipelining was applied to the application specific instructions In the present the design, giving improved clock frequencies. Data work, by using pipeline techniques are optimized forwarding and instruction reordering were the point multiplication speed by implementing on incorporated to exploit the inherent parallelism of the modern Xilinx Virtex-4 and the same to be Lopez and Dahab point multiplication algorithm, compare with the Xilinx Virtex-E. reducing pipeline stalls. Key Terms—Complex instruction set, efficient In this paper, thorough treatment is given to hardware implementation, elliptic curve the design of an ASIP for ECC, yielding a new cryptography (ECC), pipelining. combined algorithm to perform point doubling and point addition based on the instruction set I. INTRODUCTION developed, further reducing the required number of FIRST introduced in the 1980s, elliptic curve instructions per iteration. The same data path is used, cryptography (ECC) has become popular due to its but the performance is explored for different levels superior strength- per-bit compared to existing public of pipelining, and a superior choice of pipeline key algorithms. This superiority translates to depth is found. The resulting processor has high equivalent security levels with smaller keys, clock frequencies and low latency, and it has only a bandwidth savings, and faster implementations, single instance of each of the arithmetic units. An making ECC very appealing. The IEEE proposed FPGA implementation over GF 2163is presented, standard P1363-2000 recognizes ECC-based key which is by far the fastest implementation reported in agreement and digital signature algorithms and a list the literature to date. of secure curves is given by the U.S. Government In Section II, a background is given in terms of National Institute of Standards and Technology elliptic curve operations, Galois fields arithmetic and (NIST). the state-of-the-art hardware implementations. Intuitively, there are numerous advantages of Section III develops the ASIP architecture, using field- programmable gate-array (FPGA) beginning with the application-specific instructions technology to implement in hardware the and a new combined algorithm to perform point computationally intensive operations needed for doubling and point addition (crux ECC operations) ECC. Indeed these advantages are comprehensively based on the new algorithms. In Section IV, the studied and listed by Wollinger, et al. In particular, FPGA implementation of the processor is described, performance, cost efficiency, and the ability to easily and the performance is analyzed and compared to update the cryptographic algorithm in fielded devices the state-of-the-art in Section V. This paper is are very attractive for hardware implementations for concluded in Section VI. ECC. Numerous ECC hardware accelerators and II. BACKGROUND cryptographic processors have been presented in the A. ECC literature. More recently, these have included a ECC is performed over one of two number of FPGA architectures, which present underlying Galois fields: prime order fields or acceleration techniques to improve the performance of characteristic two fields. Both fields are considered the ECC operations. to provide the same level of security, but arithmetic in GF(2m) will be the focus of this paper because it The optimization goal is usually to reduce the can be implemented in hardware more efficiently latency of a point multiplication in terms of the using modulo-2 arithmetic. An elliptic curve over 412 | P a g e
  • 2. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 the field is the set of solutions to the equation division architectures exist in the literature, but they are usually expensive in terms of area. In- version using y2+xy = x3+ax2+b (1) Fermat’s little theorem, as in the Itoh–Tsujii algorithm, consists of multiplication and squaring only, so can where a,b € GF(2m),b≠0 often be implemented without additional hardware resource. It has been shown that similar performance to an EEA-based inverter can be achieved using the Itoh–Tsujii algortihm if the multiplier latency is sufficiently low and repeated squaring can be performed efficiently. Hence, multiplication is considered to be the most resource-sensitive operation in the field. Algorithm 1: Lopez-Dahab Point Multiplication Hence, when = we have the point- doubling operation (DBL), and when ≠ we have the point-adding operation (ADD). These operations in turn constitute the crux of any ECC- based algorithm, known as point multiplication or scalar multiplication. Due to the computational expense of inversion compared to multiplication, several projective coordinate methods have been Multipliers are usually implemented using proposed, which use fractional field arithmetic to one of three approaches: bit-serial, bit-parallel, or defer the in- version operation until the end of the digit-serial. Bit-serial multi- pliers are small, but point multiplication; In this paper, the high- they require m steps to perform a multiplication. Bit- performance, generic projective-coordinate algorithm parallel multipliers are large but perform a proposed by Lopez and Dahab is used, which is an multiplication in a single step. Digit-serial efficient implementation of Montgomery’s method for multipliers are most commonly used in cryptographic computing kP. No precomputations or special hard- ware, as performance can be traded against field/curve properties are required. area. Recently, in a drive for increased speed, some In this, procedures to perform DBL and ADD are methods have been proposed, and implementations derived from efficient formulas which use only the presented , for ECC hardware based on bit-parallel coordinate of the points. In the projective coordinate multipliers. version of the formulas, the x-coordinate of is Squaring is a special case of multiplication, and it represented by Xi/Zi, for i € {1,2,3}; is only a little more complex than reduction modulo The corresponding DBL and ADD computations are the irreducible polynomial. Refer to finite field shown respectively, and are used in the projective multiplier for efficient architectures to perform coordinate point multiplication algorithm shown in polynomial basis squaring, which only require Algorithm 1 combinational logic. C. Previous Work Several recent FPGA-based hardware implementations of ECC have achieved high- performance throughput. Various acceleration techniques have been used, usually based on parallelism or pre computation. The work introduced by Orlando and Paar[2] is B. Arithmetic Over GF(2163) for ECC based on the Montgomery method for computing kp Addition and subtraction in the Galois developed by Lopez and Dahab and operates over a field GF(2m) are equivalent, performed by modulo-2 single field. A point multiplication over GF(2167) is addition, i.e., a bit-XOR operation. As a result, performed in 210 micro seconds, using a Galois field arithmetic in the field is implemented more multiplier with an eleven-cycle latency. This provides efficiently because it is carry free. an excellent benchmark for all high-performance Inversion is the most computationally expensive architectures. operation in the field, based either on the extended Euclidean algorithm (EEA) or Fermat’s little theorem. Many efficient EEA-based inversion and 413 | P a g e
  • 3. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 Acceleration techniques based on precomputation can Algorithms be particularly effective for a class of curves known Complex instruction set computers (CISCs) as anomalous binary curves (ABCs) or Koblitz reduce the number of instructions, and consequently curves. The implementation from Lutz and Hasan the overall latency, by performing multiple tasks in a implements some of these techniques achieving a single instruction. It was shown that complex point-multiplication time of 75 µs over, using a instructions could be used to reduce latency in the Galois field multiplier with a four-cycle latency. design of an ASIP for ECC. Three new instructions How- ever, efforts here will be concentrated on high- were introduced, which will now be described and performance architectures for generic curves, where their use justified. such acceleration techniques cannot be used. An Considering the projective-coordinate important result of , relevant regardless of the point- formula for point addition, an obvious instruction to multiplication algorithm used, is that efficiently combine operations is a multiply and accumulate performing repeated squaring operations greatly instruction (MULAD), which will save two reduces the cost of multiplicative inversion, which is instruction executions. Also, rewriting the projective- required at the end of the point multiplication. coordinate formula for point doubling , we have An ECC processor capable of operating over multiple Galois fields was presented by Gura, et al. , which performs a point multiplication over GF(2163)in 143 µs, using a Galois field multiplier with a three-cycle latency. Gura, et al. stated two so a multiply- and- square operation (MULSQ) important conclusions: the efficient implementation would be beneficial, saving three instruction of inversion has a significant impact on overall executions. performance and, as the latency of multiplication is These two extra instructions reduce the number of improved, system tasks such as reading and writing instructions executions required to perform the point- have a significant impact on performance. The work adding and point- doubling algorithms from 14 to introduced by Jarvinen, et al. uses two bit-parallel 9—a 35.7% reduction. multipliers to perform multiplications concurrently. The Itoh–Tsujii inversion algorithm is based on The multipliers have several registers in the critical Fermat’s little theorem and is used to compute the path in order to operate at high clock frequencies, inversion at the end of the point multiplication. The but the operations are not pipelined, resulting in long algorithm uses addition chains to reduce the latencies (between 8 and 20 cycles). Hence, while this required number of multiplications, but it contains implementation offers high performance, it is not the exponentiations that must be performed through fastest reported in the literature but it is one of the repeated squaring operations—as many as 64 repeated largest. Rodriguez, et al. introduced an FPGA squaring operations for the field GF(2163). It was implementation that performs DBL and ADD in shown, that relatively low latencies can be achieved parallel, containing multiple in- stances of circuits to using this algorithm if repeated squaring can be perform the arithmetic functions. It would appear that performed efficiently. Hence, a third application- the inversion required for coordinate conversion at specific instruction will be used to perform the end of the point multiplication is not performed, repeated squaring (RESQR) in order to accelerate the and that the quoted point multiplication time does Itoh–Tsujii inversion algorithm. not include the co- ordinate conversion, but the Using these instructions, a combined algorithm to stated point multiplication time is one of the fastest in perform point doubling and point addition can be the literature. However, the complex structure of the developed, which has only nine arithmetic multiplier has a long critical path, and as a result the instructions, shown in Algorithm 2. overall performance is let down by quite a low clock frequency (46.5 MHz). B. Pipelined Data Path More recently, Cheung, et al. presented a A data path capable of performing the new hardware design that uses a normal basis application specific instructions is required. representation. The customizable hardware offers a Any of the approaches to implementing trade between cost and performance by varying multiplication mentioned in the background section the level of parallelism through the number of could be used, but because this work develops high- multipliers and level of pipelining. To the authors’ throughput hardware a bit-parallel multiplier is used. knowledge, this is the fastest implementation in the The data path must be constructed such that the result literature performing a point multiplication in of a multiplication can be squared or added to the approximately 55 µs, although once again the low previous result. Furthermore, feedback is required so clock frequency (43 MHz) limits the potential that the result can be repeatedly squared. This performance. functionality is implemented using the SQR/ADD block, described in Section III-D. III. PIPELINED ASIP FOR ECC To maintain high-throughput performance, high A.Application-Specific Instruction Set and clock frequencies are required and one 414 | P a g e
  • 4. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 instruction/cycle or more is desirable. Therefore, the notation as data path must be pipelined. The pipelined data path is shown in Fig. 1. The performance-critical component is the multiplier, so to achieve high clock frequencies, we must sub pipeline the multiplier. The Z matrix is referred to as the product matrix and the functions f i,j are linear functions of the components of A(x) .The columns of represent the consecutive states of the Galois type LFSR with the initial state given by the multiplicand A(x), i.e., the jth column is given by xj A(x) mod G(x). Therefore, the product matrix can be realized by cascading instances of the logic to perform a left shift modulo G(x). The circuit to perform this modulo shift is referred to as an alpha cell. The same notation will be used here. The product matrix can be divided into a number of smaller submatrices that calculate the partial products, which give the final product when summed together. Fig. 1. Pipelined ASIP data path. C. Subpipelined Bit-Parallel Multiplier Fig. 2. Pipeline stage of subpipelined bit-parallel Some recent hardware implementations of multiplier. elliptic-curve accelerators, in addition to some proposed architectures, have used bit-parallel multipliers. Numerous architectures for bit-parallel multiplication exist in the literature. However, very few actual implementations of bitParallel multiplication have been presented for the large fields used in public-key cryptography. Where they have been implemented, poor performance has often been reported either in terms of clock frequency or latency. This is not surprising given the deep sub micrometer fabrics used in modern FPGA devices. The major component of delay is due to routing, and the number of interconnections in a bit-parallel More formally, the product matrix can be divided multiplier implemented for ECC will be in the region into D submatrices, which will contain at most =[m- of many tens of thousands or more making routing 1/D] columns,i.e, very complex. Therefore, to improve routing Hence, the multiplier can be easily pipelined into complexity and to reduce the amount of logic in the D stages, each computing the sum of at most d+1 critical path, a bit-parallel multiplier that is amenable terms, which are the outputs of alpha cells and the to pipelining is desirable. input from the previous multiplication pipeline stage. As a result the logic in the critical path is The well-known Mastrovito multiplier is such reduced and the routing is simplified. Note that the an architecture; the reader is referred to the original number of logic gates in the critical path of each work for full details. The multiplication stage will be similar to that of a digit serial multiplier C(x)=A(x)B(x) mod G(x) is expressed in matrix with equivalent , but the routing is simpler as no feedback is required. The resulting architecture 415 | P a g e
  • 5. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 of the i th pipeline stage 0≤i≤D-1, is shown in Fig. 2; IV. FPGA IMPLEMENTATION When the alpha input is A(x) and the sum input is The processor was implemented over the A(x)b0, when i = D-1 the sum output is the smallest field recommended by NIST [8] GF(2163) multiplication result and the alpha output is using the irreducible polynomial given, unconnected. G(x)=x163+x7+x6+x3+1.Two FPGA devices were used for implementation: the older Virtex-E de- vice D. SQR/ADD Block (XCV2600E-FG1156-8),for fair comparison with the The extra functionality required for the new other architectures presented in the literature, and the instructions MULAD, MULSQ, and RESQ is more modern Virtex-4 (XC4VLX200-FF1513-11) to provided by the SQR/ADD block, which is demonstrate the performance of the proposed ASIP appended to the multiplier output as shown in Fig. with modern technology. Each component was 1. A block diagram showing the functionality of the implemented and optimized individually, particularly SQR/ADD block is shown in Fig. 3; note that the the multiplier, to examine the optimal pipeline depth flip-flops shown in Fig. 3 are the pipeline flip- before the complete processor was implemented. flops placed after the SQR/ADD block in the data Aggressive optimization for speed was performed, path diagram shown in Fig. 1. including timing-driven packing and placement. No Selection 0 on the MUX shown in Fig.3 sets the detailed floor planning was performed, only a few data path to perform standard multiplication over simple constraints were applied. GF(2m). Selection 1 sets the data path to perform MULSQ. Selection 2 sets the data path to perform A. Register File MULAD. Selection 3 sets the data path to perform Xilinx FPGA devices support three main RESQ; note that RESQ can be performed after any types of storage element: flip-flops, block RAM and other operation, e.g., MULAD-RESQ is a valid distributed RAM. Block RAM is the dedicated instruction sequence. memory resource of FPGA devices, which has different sizes and locations depending on the device being used. TABLE I DECOMPOSITION OF POINT MULTIPLICATION TIME IN CYCLES Fig.3. SQR/ADD block. Distributed RAM uses the basic logic elements of E. Data Forwarding the FPGA device, lookup tables (LUTs), to form The optimal depth of a pipeline is a trade RAMs of the desired size, function and location, between clock frequency and instruction throughput. making it far more flexible. On Virtex-E devices, Ideally, one instruction per cycle (or more) will be Block RAMs are 4096 bits each, which increase to performed, while sufficient pipelining to achieve the 18 k bits if the Virtex-4 is used. So the use of block desired clock frequency is implemented. However, RAM to store relatively small amounts of data is as the depth of the pipeline is increased, data inefficient. The processor presented in this paper dependencies can lead to pipeline stalls and even requires only 13 storage locations including all pipeline flushes, resulting in less than the ideal one temporary storage, so distributed RAM was used. instruction/cycle being performed. On Xilinx FPGA devices, 16-bits of storage can Data forwarding is a common technique in be gained from a single LUT. However, as reading processor design, used to avoid pipeline stalls that from two ports and writing to a third port is occur due to data dependencies. Typically, data is required independently and concurrently, a single forwarded to either input of the arithmetic logic unit dual-port distributed RAM is not sufficient. (ALU) from all subsequent pipeline stages. The Therefore, the register file utilized in this design is scheme used here differs slightly because such formed by cross-connecting two dual-port generality is not necessary: the order of the distributed RAMs. This approach is still far more instructions and the instances when forwarding is efficient than using block RAMs or flip-flops. required are known and do not change. Therefore, the control signals may be explicitly defined for each B. Subpipelined Multiplier instruction, and, as can be seen in Fig. 3, the data The multiplier dominates the processor, path can be simplified because data forwarding to both in terms of area (accounting for over 95% of both ALU inputs from all sub- sequent stages is not the total) and clock frequency, as the critical path of required. 416 | P a g e
  • 6. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 the other components is considerably shorter than architecture has a critical path similar to a digit-serial that of the multiplier. The area of a bit parallel multiplier, but the routing is simplified because no multiplier is O(m2) and the critical path is O(log(m)). feedback is required. However, when implemented on FPGA, these The multiplier was implemented with several measures have less significance. To implement a bit- different depths of pipeline; the results are shown in parallel multiplier without a pipelining scheme will Table II. When timing- driven FPGA placement is lead to low clock frequencies, due to the amount of performed, the placer may be forced to heavily logic in the critical path and the complex routing. replicate certain areas of the circuit to meet timing goals, particularly if certain nets have high fan-outs. The Mastrovito multiplier architecture was This can be seen in the implementation results of chosen not only because it was suitable for the multiplier, where the multiplier with three pipelining, but also because its regularity simplifies pipeline stages requires more resources than its four- placement and, consequently, routing. Given the stage equivalent because heavy replication was deep sub micrometer fabrics used by FPGAs, the performed to meet timing goals. In terms of area-time routing is the largest component of delay, and, as performance, it is clear that the four-stage multiplier shown in Section III-C, the Mastrovito multiplier can offers superior performance, which is an important be divided into very regular blocks to improve result when the overall performance of the processor routing. When pipelined, the implemented is considered. performance penalty in terms of speed, though more C. SQR/ADD Block area resources would be required. By implementing the functionality as a single Boolean function, improved logic optimization can V. PERFORMANCE ANALYSIS be achieved because the hierarchy will be flattened Let be the depth of the pipeline, then the leading to resource sharing and improved logic number of cycles required to perform a point minimization. multiplication (excluding data I/O and coordinate conversion) is given by D. Control Typically, a general-purpose processor with a Latency cycles = 2L+3 pipelined data path would use an instruction ROM for control. However, for applications such as this one The latency in cycles of the point where the mode of operation is fixed and the control multiplication is decomposed in Table I. To ascertain is relatively simple, a state machine is far more the optimal pipeline depth for the processor, the bit- efficient, particularly in terms of area, and thus is the parallel multiplier was implemented with different choice for this design. pipeline depths. The results of these implementations A Moore state machine is most suitable for the (see Table II) and the corresponding number of proposed architecture, as any extra area required is cycles to perform a point multiplication (see Table negligible compared to the overall area of the I) can be used to estimate the point multiplication processor, and the improved clock frequency times for each pipeline depth, as the critical path of compared to a Mealy state machine is desirable. The the processor is in the multiplier. key is stored in a special-purpose register, Fig. 4 plots throughput (point multiplications per independent of the data path, and loaded through a second) against area (FPGA slices) for each of the separate channel. Thus, the state machine has no multiplier variations implemented. Note that the area dependency on the data path and the control signals figure shown is the area of the multiplier not that of can be registered to achieve the desired delay the complete processor, but the multiplier accounts performance. The total resource usage for control is for over 95% of the total area. only 93 slices. However, if extra flexibility is required, an instruction ROM can be used with no 417 | P a g e
  • 7. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 offers superior performance. Therefore, this was the architecture implemented on both the Virtex-E and Virtex-4 devices, the results of which are detailed in Table II and summarized in Table III for comparison with the state of the art. It appears that none of the architectures in the comparison were floor planned, so for a fairer comparison the implementation of the proposed architecture was not floor planned either. It is important to note for FPGA implementations, being target specific, a detailed comparison of resource usage is not always straightforward, in particular when it is not clear whether the quoted figures are actual post place and route It is clear that the relationship between implementation results as in our case or merely pipeline depth and area throughput performance is synthesis estimates. approximately linear (as we would expect from the previous equation), but the seven-stage pipeline The proposed architecture compares very complex instructions. The further seventh-stage favorably with the state of the art, being smaller and pipelining has increased the clock frequency of the considerably faster than similar works. The architecture without prohibitively increasing the implementation of the proposed architecture is latency, which leads to the fastest point several times faster than the first three, lower-resource multiplication time reported in the literature. Hence, table entries, which are based on digit-serial the use of an application specific instruction set in multiplication. It is interesting to note that the conjunction with pipelining has better exploited proposed architecture achieved a higher clock operation parallelism compared with duplicating frequency than all three, demonstrating that the arithmetic circuits as proposed. simplified routing resulting from the pipelined architecture does indeed improve the critical path. Comparing against our recently reported pipelined ASIP architecture, a 10.12% improvement in point Comparing against alternative high-speed multiplication time was attained, but the resource architectures, the implementation of the proposed usage was only increased by around 2%. The architecture performs a point multiplication in only improvement comes as a result of the new 60.01% of the estimated performance time of the combined algorithm to perform DBL and ADD fastest alternative. The area resource of is not stated fewer instructions are required, which reduces the for the field, only the performance time, but the latency and the increased pipeline depth, which resource usage (slices) of the proposed architecture increases the clock frequency. is significantly less than the other alternative high- speed architectures. The improvements over the implementation from Jarvinen, et al. are due once more to the reduced area requirements but also due to the reduced latency resulting from pipelining and the 418 | P a g e
  • 8. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 VI. CONCLUSION A high-performance ECC has been [11] C.H.Kim and C. P. Hong, ―High-speed implemented using FPGA technology. A combined division architecture for GF(2m),‖ Electron. algorithm to perform DBL and ADD was developed Lett., vol. 38, pp. 835–836, 2002. based on the Lopez Dahab Point multiplication [12] T. Itoh and S. Tsujii, ―A fast algorithm for algorithm .The data path was pipelined, allowing computing multiplicative inverses in operation parallelism to be perform fastly and taking GF(2/sup m/) using normal bases,‖ Inf. less time. Consequently, an implementation with a Computation, vol.78, pp. 171–177, 1988. four-stage pipeline achieved a point multiplication [13]P.A.Scott,S.E. Tavares, and L. E. Peppard, on Xilinx Virtex-4 device, making it the fastest ―A fast VLSI multi- plier for GF(2/sup m/),‖ implementation. This work has confirmed the IEEE J. Sel. Areas Commun., vol. 4, no. 1. suitability of a pipelined data path and an efficient [14] H. Wu, ―Bit-parallel finite field multiplier Galois field multiplier (2^163) is developed and and squarer using polyno- mial basis,‖ IEEE implementations of ECC over GF(2^163) performs Trans. Comput., vol. 51, no. 7, pp. 750–758, the better security with less key size . Jul.2002. [15] A. Reyhani-Masoleh and M. A. Hasan, REFERENCES ―Low complexity bit parallel architectures [1] J. Lutz and A. Hasan, ―High performance for polynomial basis multiplication over FPGA based elliptic curve cryptographic GF(2/sup m/),‖IEEE Trans. Comput., vol. 53, co-processor,‖ in Proc. Int. Conf. Inf. no. 8, pp. 945–959, Aug. 2004. Technol.: Coding Comput. (ITCC), 2004, [16] L. Song and K. K. Parhi, ―Low-energy p. 486. digit-serial/parallel finite field multipliers,‖ J. [2] F. Rodriguez-Henriquez, N. A. Saqib, and VLSI Signal Process. Syst., vol. 19, pp. 149– A. Diaz-Perez, ―A fast parallel 166, 1998. implementation of elliptic curve point [17] M. C. Mekhallalati, A. S. Ashur, and multiplication over GF(2m),‖ M. K. Ibrahim, ―Novel radix finitefield Microprocessors Microsyst., vol. 28, pp. multiplier for GF(2/sup m/),‖ J. VLSI Signal 329–339, 2004. Process., vol. 15,pp.233–245, 1997. [3] N. Mentens, S. B. Ors, and B. Preneel, ―An [18] P. K. Mishra, ―Pipelined computation of FPGA implementation of an elliptic curve scalar multiplication in elliptic curve processor GF(2/sup m/),‖ presented at the cryptosystems,‖ presented at the CHES, 14th ACM Great Lakes Symp. VLSI, Cambridge, MA, 2004. Boston, MA, 2004. [4] J. Lopez and R. Dahab, ―Fast multiplication on elliptic curves over GF(2/sup m/) without precomputation,‖ presented at the Workshop on Cryptographic Hardware Embedded Syst. (CHES), Worcester, MA,1999. [5] G. Seroussi and N. P. Smart, Elliptic Curves in Cryptography. Cam- bridge, U.K.: Cambridge Univ. Press, 1999. [6] C. H. Kim and C. P. Hong, ―High-speed division architecture for GF(2m),‖ Electron. Lett., vol. 38, pp. 835–836, 2002. [7] W. Chelton and M. Benaissa, ―High-speed pipelined ECC processor over GF(2m),‖ presented at the IEEE Workshop Signal Process. Syst.(SiPS), Banff, Canada, 2006. [8] G. Seroussi and N. P. Smart, Elliptic Curves in Cryptography. Cam- bridge, U.K.: Cambridge Univ. Press, 1999. [9] P. L. Montgomery, ―Speeding the Pollard and elliptic curve methods of factorization,‖ 1987. [10] Z. Yan and D. V. Sarwate, ―New systolic architectures for inversion and division in GF(2m),‖ IEEE Trans. Comput., vol. 52, no. 11, pp.1514–1519, Nov. 2003. 419 | P a g e