Dr. David Mundie
Andrew Babian
May 6, 1993
ELEC

Report on Amniac Implementation.

Abstract.

The Amniac implementation is based on a model of a general

neural network.  It uses two major block types.  The IO-cell

(roughly, a node or neuron) calculates the sum of the

products of the weights and activations of the other nodes.

The S-cell routs the activations from one node to the others

and contributes a weight appropriate for its link.  The chip

interfaces with the front end computer through a RAM buffer,

which also contains the activation functions for the nodes.

IO-cells are grouped with six on a chip, each communicating

with others in a neighborhood are only limited by the

bandwidth, or how many S-cells are in a column with the

IO-cell.


Introduction.

The Amniac project was described in [1] as a theoretical

approach to a general purpose neural network, with a

generality like that of a universal Turing machine-- any

neural network architecture can be represented on this

design.  The successful simulations of the design on

massively parallel machines indicated that the design would

be useful to implement in VLSI, because of its

programmability, its geometric regularity, and its digital

nature.


        The implementation design consists of two major

blocks of circuitry, the S-Cell and the IO-cell, which act

like the synapse and "cell body" of an artificial neuron. In

this implementation, several IO-cells are in a device, and

each IO-cell has a column of S-cells above it, with one row

of S-cells active at a time. The active S-cell transmits

both an activation value from its IO-cell and a weight

(stored in the S-cell), diagonally to another IO-cell; each

IO-cell therefore communicates with neighboring IO-cells, up

to a certain distance away, called the bandwidth.


        The theoretical model in [1] has more

fundamental blocks than those here described,  and in it

the multiplication is done in the S-cell and then added

together at the IO-cell. Multiplication in the S-cell was

seen to be unnecessary since even though all the

multiplications are done in parallel, the additions must

still be done sequentially into an accumulator.


The Major Blocks.

The Major Blocks are the S-cell and the IO-cell. The S-cell

corresponds to a synapse and the IO-cell corresponds to

the nucleus which sums the products of weights and

activations from the S-cell.


S-Cell design.

This design is quite simple.  The S-cell stores two 8-bit

weights, for forward and backward activations.  It also

contains two token bits which determine its behavior, which

get passed upwards along a column such that only one S-cell

in the column has the token.  If it has the token, the

S-cell will transmit its weights along its diagonal buses,

and will pass the activation from the IO-cell below it in

its column.  If it does not have the token, the S-cell just

transmits whatever signals comes thru it: it acts like

wires passing on the signals when it has no token.


        It has three inputs, from ABUS (the activation

below), BBUS (the diagonal backward input from above), and

FBUS (the diagonal forward input from above).  It also has

three corresponding outputs, up to ABUS and down to BBUS

and FBUS.  These corresponding buses are connected together

when the S-cell does not have the token.

        A separate token is used to tell the S-cell to load

the weights.  The weights are loaded from the activation

at the base of the column register to the registers in the

S-cell with the token.


IO-Cell Design.

The IO-cell is more complicated than the S-cell because

all the work in the design has been shifted away from the

S-cell (which originally did the multiply) which now is only

a router.  The IO-cell does a multiply accumulate and must

store all the operands and partial products.  The two

separate multiplies (one from each direction forward and

back) are each done by a single multiplier which multplies

one pair of weight and activation and then the second pair,

which has been stored until needed.

        While it may be possible to eliminate some of the

storage buffers, the IO-cell uses nine eight-bit shift

registers, compared to two in the S-cell.  One register is

for the feedback weight (the forward weight is shifted into

the addend).  Two registers hold the forward and backward

activations while they are being used in the multiply.

All of the shift register have the ability to circle around

to maintain their state in a dynamic register.


         The activation is stored in two registers  so that

the new activation can be compared to the older one to check

convergence. The activation at the beginning of the cycle is

loaded quickly into the Activation buffer from the outside.

It is then shifted into to Activation output register, while

checking bit by bit for equality.


        The multiply-accumulator contains two sixteen bit

registers (equivalent to four eight-bit registers--all the

other registers are eight bits).  The Addend (16-bit)

register holds a signed weight value which will be added to

the accumulator or not depending on the current bit in the

activation.  These addition will automatically add in the

product of the weight and activations into the accumulator.

A product is never actually generated by itself, but instead

The accumulator keeps growing over a full Big cycle for all

levels of S-cells, and the most significant eight bits is

used as the argument for the activation function from RAM.


        The parallel-adder, which adds sixteen bits in one

clock cycle, significantly increases the speed of the

IO-cell, bringing the multiplication time (17 cycle) to

about the same time as other communication functions of the

IO-cell (18).  A fully serial multipler would take perhaps

64 clock cycles for a multiply, but the parallel adder only

added the equivalent of about 6 registers, and there is

probably no more reasonable speedup possible in the multiply.

 
        Each IO-cell is connected to an internal address bus

which selects it for reading and writing off chip. Many

control lines control the operation of the IO-cell.  Each

such signal activates an single operation in the IO-cell,

in parallel for all the IO-cells on the chip.


Timing and Control.

The pseudocode description of the control system is given

in appendix A. The function of the chip occurs in two major

phases.  After loading in the activations, the chip performs

a multiply accumulate for each level of S-cells. "Small

cycle" here refers to a single multiply accumulate servicing

one row of S-cells. The weights from each direction are

added into the accumulator.


     A big cycle refers to the bandwidth counter going from

beginning to end of the cycle, going up a full column and

servicing all S-cells.  This big cycle ends with a valid

output from each node.  There is some handshaking to tell

the front end that the chip will access RAM and is done with

the cycle.


Interfacing.

        For the programmer of the communications on the

front end, the interface is here specified.  The input data

and output data for the Amniac chip are all eight bit (one

byte) binary values. Each chip has its own RAM chip, and

each IO-cell has its own 256-byte block of storage for its

activation function.  The address in RAM takes eleven bits

(for a 2kx8 RAM chip).  The bottom eight bits are for the

argument of the function, the top three bits describe which

page, or which IO-cell, the function is for.  For example,

to access the activation value for IO cell 3 with an

argument 7B(hex), we access cell 37B(hex).


        Page 7 or the top page in RAM is reserved for the

input and output buffers needed for the front end to

communicate with the neural network chip.  Each bit of the

direction flag, the highest byte, at address 7FF(hex),

determines where the input back into the neural network will

come from. If the corresponding bit is 0, the IO-cell will

calculate an activation function, but if the bit is 1, the

IO-cell will read in from its input buffer in RAM.


        The input buffer address has its lower three bits

equal to the IO-cell number, and its upper three bits set to

1 (for page 7), and the other bits set to 0.  The output

buffer has a similar address, except that the fourth lowest

bit is 1. For example, IO-cell three has input buffer at

703(hex), and output buffer at 70B(hex).


        Several control lines are used to determine when it

is time for the front end computer to write into RAM.  The

WAIT line when asserted, will hold the neural net chip at

the beginning of its big cycle or after it has output its

activations.  The front end can then examine the outputs in

the output buffers and write appropriate inputs into the

input buffers (remembering to set the direction bit).  When

the WAIT line is released, the neural network will continue

processing.  The DEBUG line will cause the accumulators to

be output every small cycle into the output buffer, and the

WAIT signal will hold the neural net chip at the end of the

cycle until they are read. Since the WAIT line is only

checked when the chip gets to a stopping place, it could be

set and latched any place in the cycle, to stop it at the

end.


        After the WAIT signal is released, there is a short

period in which the neural net chip must use the RAM, and

the front end should keep off (usually 14 clock cycles,

except for debug mode).  After this period there is a long

time during the big cycle that the RAM is not used and the

front end can read the outputs and write the inputs.  The

neural net will need the RAM again when the bandwidth

counter gets to zero and it outputs END_OF_SMALL_CYCLE.


The "front end" refers to the conventional computer

controlling the neural network system.


Debug mode.

There is a line external to the chip which when set puts

the chip in debug mode.  In debug mode, the chip will

output all the activations into the buffers in RAM on each

small cycle.


Miscellaneous Issues.

Several assorted issues arose in the design.

Some types of neural network architectures use clamped input

cells, while others set them initially and allow them to

vary.  The clamped input can be handled by either making the

activation function a fixed value or by having the self

activation, with a weight of unity, be the only contributing

value, with an identity activation function. A clamped input

can also be maintain by always setting the direction for

activations to come from the RAM input buffer. The input can

be released then by reset the direction bit to read from the

activation function.


The self-activation is needed in some networks which allow

a node to add activations from itself.  It is equivalent to

half of a standard S-cell, with a single weight stored.

The current activation in a cell is multiplied by this

weight and added to its accumulator.


The internal bus on the chip is a shared resource, which

simplifies the communication to the RAM and allows an eight

bit parallel data flow, much faster and simpler than serial.

Each IO-cell will be addressed in sequence an told to put

on or take off data from the bus.  The sequence, a quick

burst of data to the RAM, leaves the RAM  free for the front

end to examine.


     The multiplier may be improved.  It now takes a large

amount of chip space, as a shift and add with a full

parallel adder with carry save.  Many other types of

multpliers use a recoding technique to increase speed,

probably too complicated an approach for here.  The many

speedup techniques which have varying times for computation

(such as shift over zeroes), cause problems for parallel

timing.  The current multiplier design makes a reasonable

conpromise for speed, such that the multiplication time is

now comparable to the communication time.


     There is a test for the convergence of the neural

network in which all the activations are compared with their

previous values.  There was some concern about whether this

worked with clamped and unclamped input.


     It is important that it be possible to add more chips

to the network at the top of the colums (adding more S-cells

and bandwidth), along with adding them adjacent (adding more

IO-cells and nodes).  The VLSI-CMOS design required there to

be the single chip (only one chip design was to be made) to

handle vertical expansion, which would waste the IO-cells in

the stacked chips.  Since, the programmable gate array

design has two separate chips for IO-cells and S-cells, this

expansion is not as wasteful. The most important simplifying

design decision made has been to use no pin multiplexing

between chips.  Although there are many communication

channels need, (for each S-cell on the top row of a chip

three lines are needed, for each on the side, two-lines),

the chips used generally have adequate numbers of pins to

handle them using one pin for one line.


Media.

        Some time has been spent before deciding to use a

field reprogrammable gate array for the initial design

instead of a CMOS VLSI design (the initial assumption).  The

implementation at the gate level is different for different

media, but most of the design is the same.  CMOS VLSI allows

for a transmission gate design that does not correspond to a

gate implementation.  The transmission gate can make a

simpler adder circuit and simpler multiplexer. The CMOS

design may be cheaper and faster and have a higher density

(more elements on a chip), but it has no room for

error, since mistakes cannot be fixed.  A gate array will

not allow for the same number of cells on one chip, but it

allows for better testing, shorter cycle time, and if

erasable, very small risk from design error.  The decision

to intially use programmable gate arrays seemed clear.  It

is likely that a later version will call for a custom VLSI

chip.


Conclusion.

The design is a very straightforward and logical general

design for a neural network. The chip has a high probability

of working as a general neural network.


References.

[1]  Max H. Garzon, Stanley P. Franklin, Willian Baggett,

     William S. Boyd., Jr., and Dinah Dickerson. "Design and

     Testing of a General-Purpose Neurocomputer." Journal of

     Parallel and Distributed Computing 14, 203-220(1992).


Appendix A.

The timing control description for the amaniac design.

Big Cycle, for each Activation:
1.  The front end processor asserts a WAIT signal.
    The neural net does nothing.
    while the front end
       loads the RAM with the activations (initally)
       writes to RAM input buffers
       reads the RAM output buffers.
       write the direction buffer.
    While WAIT is assert stay in step 1.
    on WAIT go to step 1.

 Read in Direction byte.
    Each bit tells whether this will be a new value for
       activation or a lookup into RAM, see text.
    2. RAM Address = DirectionByte ( =7FF(hex))
    3. Read in DirectionByte

  Read in Activation functions and inputs.
        Load IOCounter, IOC=5.
    4. If Direction = 0,
          RAM Address = IOC*256+Accumulator
          (page IOC, offset Accum: Activation function)

       If Direction = 1,
          RAM Address= 7*256+IOC
          (page 7,offset IOC: Input buffer)

    5.  Read Back in the new activation value.
        While --IOCounter>0 go to step 4.
           (note: predecrement notation)

    6.  Reset the Accumulator.
        Initialize the token.
        Clear Addend.

Shift in the new Activations from the Activation buffers,
    and check if the activation has changed.
    7.   For Bitcount=0 to 7
         Shift one bit to Activation from the Activation Buffer
         (simultaneous for all IO Cells)
         next

Small Cycle, for each level of bandwidth.

8,9  Advance tokens along activation bus. (It's 2 steps)

10.  For Bitcount=0 to 7
     Shift in Activation for 8 bits  (As in 6,7)
         Forward and backward Activation.
           Fbus from S-cell shifts into Forward Activation Buffer.
           Bbus from S-cell shifts into Backward Activation Buffer.
     Shift up Activation
     next

11.  For Bitcount=0 to 7
     Shift in Weights for 8 bits  (As in 6,7)
        Forward and backward weights.
           FBus shifts straight into the Addend.
           Bbus shifts into the Backward Activation Buffer.
     next

Do the Multiplication.
   12. For Bitcount=0 to 7
       Add the Addend to the Store (Accumulator),
         with saved carries.
         According to Multiplier (last bit of FAB)
       shift Addend with sign extend.
       shift Multiplier.

   on 8th cycle, load BWB into addend.

   13. Repeat as in step 12

   14. Clear addend.
       Propagate Carry.

   15. Tell the Bandwidth counter, we're done with this layer.
       If DEBUG, goto 16.
       If (not END_OF_BANDWIDTH)
          Bandwidth Counter says not end of activation
          goto step 8.

END of Small Cycle

Output Activations to output buffers.
    16. For IOCounter=5 to 0
        RAM Address= 7*256+8+IOCounter
            (Page 7,offset 8+IOCounter: Output buffer)

    17. Write out accumulator indicated by IOCounter.
        next
        If (DEBUG) goto 19

    18. Send DONE_WITH_BIG_CYCLE to front end
        goto step 1

    19. (Waiting for Debug)
        While WAIT is set, stay in step 19.
        else go to step 1.

END of BIG Cycle