Re: P64 Math
>>On Thu, 31 Aug 1995, Steve Atkins wrote:
>>
>>> >>Readers,
>>
>>> [ which adder architecture ]
>>
>>> When you hit 32 bits I think it's worthwhile to use some sort of carry
>>> propagate acceleration. Something like a carry skip adder doesn't cost too
>>> much silicon, and will be *much* faster than a ripple adder for wide words.
>>
>>Can you name numbers? At least a qualitative factor. It still scales
>>linearly to bit length, I presume?
When I find the book with the decent analysis in it I'll post the
numbers. (Someone `borrowed' it).
>>
>>> I've been looking at 64 bit arithmetic recently, and (for the silicon we use)
>>> the tradeoffs do seem to change between 32 & 64 bit operations. For a fast 64
>>> bit processor, where addition may be a bottleneck, a good optimised adder would
>>> be essential - either a carry skip adder for medium speeds, a hybrid carry select
>>> adder (such as the one the Alpha uses) for fast speeds. If the word width grew
>>
>>Can you outline their principles, at least roughly? (My numerics
>>hardware book is a bit outdated :(
Certainly....
-*- text -*- ~atkinss/add.txt
Notation:
x_i is bit i of the number x, in standard twos complement form.
. is logical and
+ is logical or
# is logical xor
Basics
======
We're adding two n-bit numbers a & b, and a carry-in c_0 to give an
n-bit result z, and a carry-out c_n
z = a plus b plus c_0
The operands and result are of the form
a = { a_n-1, a_n-2, ... a_1, a_0 }
and so on.
Logically addition is this:
For all 0 <= i < n
z_i = a_i # b_i # c_i
c_i+1 = x_i . y_i + c_i . (x_i + y_i) [1]
<This would be so much clearer in TeX or HTML3......>
This is often reformulated in terms of propagate (p) and generate (g) signals:
g_i = a_i . b_i [2]
p_i = a_i # b_i [3]
[ The alternative p_i = a_i + b_i is sometimes used ]
_________
[ Often a kill signal (k) is used too: k_i = a_i + b_i ]
Then the carry signals can be given as:
c_i+1 = g_i + c_i . p_i substitute [2] & [3] in [1] [4]
and
z_i = p_i # c_i
The expensive bit about addition is generating c_i. Once all the c_i
are generated the z_i can be generated in constant time (not a negligable
time, but at least it's constant with operand length).
Different adder architectures
=============================
Ripple Carry (RCA)
~~~~~~~~~~~~
A ripple carry adder (RCA) implements [1] directly. You all understand
ripple carry adders, so I'll just say that they are cheap and slow.
The critical path in an RCA is from c_0, a_0 or b_0 through all the
full adders to z_n-1 or c_n.
Carry lookahead (CLA)
~~~~~~~~~~~~~~~
For a four bit adder we can expand [4] to give
c_1 = g_0 + c_0.p_0 [5-8]
c_2 = g_1 + g_0.p_1 + c_0.p_0.p_1
c_3 = g_2 + g_1.p_2 + g_0.p_1_p_2 + c_0.p_0.p_1.p_2
c_4 = g_3 + g_2.p_3 + g_1.p_2.p_3 + g_0.p_1.p_2.p_3 + c_0.p_0.p_1.p_2.p_3
This is reasonable for 4 bits, but will get grotesquely large in
terms of are and fan-out if taken much further.
One option is to divide the operands into groups of four bits, use
carry-lookahead within each group and ripple the carries between groups.
* Divide and Conquer
If we can improve on speed of rippling carries within each group, then
surely we can improve on the speed of rippling carries between groups
by a similar approach.
Consider Group_0, consisting of (g_0 - g_3, p_0 - p_3)
Defining a `group generate' g* and a `group propagate' p*:
For Group_0:
g*_0 = g_3 + g_2.p_3 + g_1.p_2.p_3 + g_0.p_1.p_2.p_3
p*_0 = p_0.p_1.p_2.p_3
and identically for all other groups.
Now
c_4 = g8_0 + c_0.p*_0
c_8 = g*_1 + g*_0.p*_1 + c_0.p*_0.p*_1
c_12 = g*_2 + g*_1.p*_2 + g*_0.p*_1.p*_2 + c_0.p*_0.p*_1.p*_2
c_16 = [deleted - it's > 80 characters wide....]
These are identical to equations [5-8] used to generate carries within
groups.
There's no need to stop there.
The four-bit groups can be combined into sixteen-bit groups of groups.
Each can produce generate (g**) and propagate (p**) signals, combined as
above to give c_16, c_32, c_48 and c_64.
This divide and conquer approach will eventually generate c_i for the
entire word width.
<This would be a lot clearer with diagrams>
For board level adders CLA is pretty good, 'cos it can use standard MSI
lookahead generator ICs for nearly everything.
It's not *bad* in CMOS, but it's not particularly good either. I'd use
one of the following architectures in preference. The only exception might
be in a cell-based design where an optimised lookahead cell is provided.
Carry Skip adder
~~~~~~~~~~~~~~~~
"In VLSI technology the carry-skip adder is comparable in speed to the carry
look-ahead technique (for commonly used word lengths) while it requires
less chip area and consumes less power."
-- Computer Arithmetic Algorithms, Israel Koren
...and that's why it's the adder I'd use for a 32 bit system, and probably
for a 64 bit system too.
Carry propagation can skip any stage for which p_i = 1 (ie a_i != b_i).
Several consecutive stages can be skipped if p_i =1 for each stage.
A carry-skip adder is divided into groups of consecutive stages, with a
simple ripple carry scheme in each group.
Each group generates a group propagate signal, p*_i.
For Group_i, consisting of k stages j, j+1, ... j+k-1
p*_i = p_j . p_j+1 . ... . p_j+k-1
This is used to allow an incoming carry into the group to skip over all the
stages in the group and generate a group carry out.
Group_i-Carry_out = c_j+k + p*_i . Group_i-Carry_in
(c_j+k is the normal ripple carry out from the most significat stage
in the group)
The critical path through a carry-skip adder is via ripple carry through,
one of the groups, and via the skip carry chain through the remainder of
the groups. (Think about it. A carry coming out of a group via the ripple
chain *must* have been generated within the group - if it was generated
before the group, p*_i would have been 1 and the stage would have been
skipped. So the critical path will travel through only one group).
In a carry-skip adder the groups will not all be the same size. The optimal
division of an n-bit adder into carry-skip groups depends on the
characteristics of the target technology.
A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1}
for a typical technology.
VLSI implementations of carry-skip adders tend to be quite small - using the
32 bit adder given above the extra cost over an RCA is about 20 extra gates.
While a single level of carry-skip speeds things up a lot a second level of
skip, skipping over more than one group can speed things up a little further
for very little extra cost.
Carry Select adder
~~~~~~~~~~~~~~~~~~
The reason carry propagation is slow in a ripple adder is because each stage
needs to have a_i, b_i and c_i available before it can calculate c_i+i.
One way of removing this dependency on c_i is to calculate both
a_i + b_i + 0, and a_i + b_i + 1, then choose the appropriate result when
c_i becomes available.
This is the basic trick of the carry select adder.
For 32bit operands the 32 stage adder could consist of four 8bit
groups.
Group_0 is purely an 8bit ripple carry adder, with c_0 as it's carry input.
The sum output from this RCA goes directly to the adder output.
Group_1 has two 8bit ripple cary adders, one with a Cin of 0, the other with
a Cin of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor
controlled by the Cout of Group_0. The output of the mux goes to the adder
output.
Group_2 has two 8bit RCAs, the same as group_1. The sum outputs of this go
to two 2:1 muxes. One is controlled by the Cout of the Group_1 chain with a
Cin of 0, so the mux output is what the sum would be if the Cout of Group_0
were 0. The other is controlled by the Cout of the Group_1 chain with a Cin
of 1, giving the sum if the Cout of Group_1 were 1.
These two mux outputs are fed to another 2:1 mux, controlled by the Cout of
Group_0, to select the correct sum to send to the adder output.
Group_3 is similar. The sums from the two RCAs go through two 2:1 muxes
controled by the two Couts of the two RCAs in Group_2, giving two values, one
if the Cout of Group_1 is 1, one if it is 0.
These two values are passed to another pair of 2:1 muxes, controlled by the
two Couts of Group_1, giving two values, one if the Cout of Group_0 is 1, the
other if it is 0.
The correct one of these two values is selected by a final 2:1 controlled by
the Cout of Group_0.
<This is far too complicated for ASCII art at this time of night. If there's
any interest I might HTMLify it>
Carry select adders tend to be slightly faster than skip adders, particularly
for wide operands. They will typically consume a little over twice the area
of a ripple carry adder. The layout for a select adder tends to be very
regular, which can be a big advantage for datapath compilers/tilers.
One of the DEC Alphas uses a slight variant on this approach to do a 64bit
add at 200MHz with one cycle of latency in a 0.75um technology. There's a
paper in Vol 27, No 11, November 1992 of the IEEE Journal of Solid-State
Circuits giving a few paragraphs about the adder, and a nice diagram.
Prefix adders
~~~~~~~~~~~~~
These are bizarre to think about, but very powerful, particularly for longer
word lengths.
It's a little mathematical in places....
Define an operator 'o':
(g,p) o (g',p') = (g + (p . g'), p . p')
Define G_i & P_i:
(G_1,P_1) = (g_1,p_1)
(G_i,P_i) = (g_i,p_i) o (G_i-1,P_i-1) 2 <= i <= n
Then
c_i = G_i for 1 <= i <= n [1]
There's a fairly easy, but not too interesting, inductive proof of [1].
Next, 'o' is associative, ie
(g_1,p_1) o ( (g_2,p_2) o (g_3,p_3) )
= ( (g_1,p_1) o (g_2,p_2) ) o (g_3,p_3) for all (g_i,p_i) [2]
Again I'll miss out the proof - it's just an 'expand both sides and notice
they are identical'.
So to find c_i it suffices to calculate
(G_i,P_i) = (g_i,p_1) o (g_i-1,p_i-1) o ... o (g_1,p_1)
and by [2] this can be calculated in any order.
The Brent-Kung architecture
===========================
(The original reference is Brent & Kung, IEEE Transactions on Computers,
Vol C-31,No 3, March 1982)
First consider the simple problem of calculating just c_n, for n=16.
Since 'o' is associative (G_16,P_16) can be generated by a binary tree:
Each wire in the diagram carries a pair of signals (g,p). (g_i,p_1) are
fed in at the base.
Each node 'O' performs this operation:
(g_in,p_in) o (g'_in,p'_in)
|
|
O
|\
| \
| \
| \
(g_in,p_in) (g'_in,p'_in)
(G_16,P_16)
|
|
O
|\____
| \_______
| \________
| \
O O
|\ |\
| \__ | \__
| \__ | \__
| \__ | \__
| \ | \
O O O O
|\_ |\_ |\_ |\_
| \_ | \_ | \_ | \_
| \ | \ | \ | \
O O O O O O O O
|\ |\ |\ |\ |\ |\ |\ |\
| \ | \ | \ | \ | \ | \ | \ | \
| \ | \ | \ | \ | \ | \ | \ | \
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
(g_i,p_i) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
This will give (G_16,p_16) and hence c_16. To generate c_15 downto c_1
another similar tree is required:
c_i 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
| | | | | | | | | | | | | | | |
| O | O | O | O | O | O | O | |
| |\ | |\ | |\ | |\ | |\ | |\ | |\ | |
| | \| | \| | \| | \| | \| | \| | \| |
| | O | | | O | | | O | | | | |
| | |\_| | | |\_| | | |\_| | | | |
| | | \_ | | | \_ | | | \_ | | | |
| | | | \| | | | \| | | | \| | | |
| | | | O | | | | | | | | | | |
| | | | |\_| | | | | | | | | | |
| | | | | \__| | | | | | | | | |
| | | | | | \__| | | | | | | | |
| | | | | | | \ | | | | | | | |
| | | | | | | |\ | | | | | | | |
| | | | | | | | \| | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
O | | | | | | | | | | | | | | |
|\____| | | | | | | | | | | | | |
| | \_______ | | | | | | | | | | |
| | | | | \________ | | | | | | | |
| | | | | | | | \| | | | | | | |
O | | | | | | | O | | | | | | |
|\ | | | | | | | |\ | | | | | | |
| \__ | | | | | | | \__ | | | | | |
| | \__ | | | | | | | \__ | | | | |
| | | \__ | | | | | | | \__ | | | |
| | | | \| | | | | | | | \| | | |
O | | | O | | | O | | | O | | |
|\_| | | |\_| | | |\_| | | |\_| | |
| \_ | | | \_ | | | \_ | | | \_ | |
| | \| | | | \| | | | \| | | | \| |
O | O | O | O | O | O | O | O |
|\ | |\ | |\ | |\ | |\ | |\ | |\ | |\ |
| \| | \| | \| | \| | \| | \| | \| | \|
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
(g_i,p_i) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
There are many varying architectures for prefix adders, driven by
speed/area/layout complexity tradeoffs.
--
-- Steve Atkins -- <atkinss@inmos.co.uk> -- +44 454 611439 --
<Squeak!> <Squeak!> <Squeak!>....... <Splash!>
[ It's a rat leaving a sinking ship. I'm off to AMD......]