Dr. David Mundie Andrew Babian May 6, 1993 ELEC Report on Amniac Implementation. Abstract. The Amniac implementation is based on a model of a general neural network. It uses two major block types. The IO-cell (roughly, a node or neuron) calculates the sum of the products of the weights and activations of the other nodes. The S-cell routs the activations from one node to the others and contributes a weight appropriate for its link. The chip interfaces with the front end computer through a RAM buffer, which also contains the activation functions for the nodes. IO-cells are grouped with six on a chip, each communicating with others in a neighborhood are only limited by the bandwidth, or how many S-cells are in a column with the IO-cell. Introduction. The Amniac project was described in [1] as a theoretical approach to a general purpose neural network, with a generality like that of a universal Turing machine-- any neural network architecture can be represented on this design. The successful simulations of the design on massively parallel machines indicated that the design would be useful to implement in VLSI, because of its programmability, its geometric regularity, and its digital nature. The implementation design consists of two major blocks of circuitry, the S-Cell and the IO-cell, which act like the synapse and "cell body" of an artificial neuron. In this implementation, several IO-cells are in a device, and each IO-cell has a column of S-cells above it, with one row of S-cells active at a time. The active S-cell transmits both an activation value from its IO-cell and a weight (stored in the S-cell), diagonally to another IO-cell; each IO-cell therefore communicates with neighboring IO-cells, up to a certain distance away, called the bandwidth. The theoretical model in [1] has more fundamental blocks than those here described, and in it the multiplication is done in the S-cell and then added together at the IO-cell. Multiplication in the S-cell was seen to be unnecessary since even though all the multiplications are done in parallel, the additions must still be done sequentially into an accumulator. The Major Blocks. The Major Blocks are the S-cell and the IO-cell. The S-cell corresponds to a synapse and the IO-cell corresponds to the nucleus which sums the products of weights and activations from the S-cell. S-Cell design. This design is quite simple. The S-cell stores two 8-bit weights, for forward and backward activations. It also contains two token bits which determine its behavior, which get passed upwards along a column such that only one S-cell in the column has the token. If it has the token, the S-cell will transmit its weights along its diagonal buses, and will pass the activation from the IO-cell below it in its column. If it does not have the token, the S-cell just transmits whatever signals comes thru it: it acts like wires passing on the signals when it has no token. It has three inputs, from ABUS (the activation below), BBUS (the diagonal backward input from above), and FBUS (the diagonal forward input from above). It also has three corresponding outputs, up to ABUS and down to BBUS and FBUS. These corresponding buses are connected together when the S-cell does not have the token. A separate token is used to tell the S-cell to load the weights. The weights are loaded from the activation at the base of the column register to the registers in the S-cell with the token. IO-Cell Design. The IO-cell is more complicated than the S-cell because all the work in the design has been shifted away from the S-cell (which originally did the multiply) which now is only a router. The IO-cell does a multiply accumulate and must store all the operands and partial products. The two separate multiplies (one from each direction forward and back) are each done by a single multiplier which multplies one pair of weight and activation and then the second pair, which has been stored until needed. While it may be possible to eliminate some of the storage buffers, the IO-cell uses nine eight-bit shift registers, compared to two in the S-cell. One register is for the feedback weight (the forward weight is shifted into the addend). Two registers hold the forward and backward activations while they are being used in the multiply. All of the shift register have the ability to circle around to maintain their state in a dynamic register. The activation is stored in two registers so that the new activation can be compared to the older one to check convergence. The activation at the beginning of the cycle is loaded quickly into the Activation buffer from the outside. It is then shifted into to Activation output register, while checking bit by bit for equality. The multiply-accumulator contains two sixteen bit registers (equivalent to four eight-bit registers--all the other registers are eight bits). The Addend (16-bit) register holds a signed weight value which will be added to the accumulator or not depending on the current bit in the activation. These addition will automatically add in the product of the weight and activations into the accumulator. A product is never actually generated by itself, but instead The accumulator keeps growing over a full Big cycle for all levels of S-cells, and the most significant eight bits is used as the argument for the activation function from RAM. The parallel-adder, which adds sixteen bits in one clock cycle, significantly increases the speed of the IO-cell, bringing the multiplication time (17 cycle) to about the same time as other communication functions of the IO-cell (18). A fully serial multipler would take perhaps 64 clock cycles for a multiply, but the parallel adder only added the equivalent of about 6 registers, and there is probably no more reasonable speedup possible in the multiply. Each IO-cell is connected to an internal address bus which selects it for reading and writing off chip. Many control lines control the operation of the IO-cell. Each such signal activates an single operation in the IO-cell, in parallel for all the IO-cells on the chip. Timing and Control. The pseudocode description of the control system is given in appendix A. The function of the chip occurs in two major phases. After loading in the activations, the chip performs a multiply accumulate for each level of S-cells. "Small cycle" here refers to a single multiply accumulate servicing one row of S-cells. The weights from each direction are added into the accumulator. A big cycle refers to the bandwidth counter going from beginning to end of the cycle, going up a full column and servicing all S-cells. This big cycle ends with a valid output from each node. There is some handshaking to tell the front end that the chip will access RAM and is done with the cycle. Interfacing. For the programmer of the communications on the front end, the interface is here specified. The input data and output data for the Amniac chip are all eight bit (one byte) binary values. Each chip has its own RAM chip, and each IO-cell has its own 256-byte block of storage for its activation function. The address in RAM takes eleven bits (for a 2kx8 RAM chip). The bottom eight bits are for the argument of the function, the top three bits describe which page, or which IO-cell, the function is for. For example, to access the activation value for IO cell 3 with an argument 7B(hex), we access cell 37B(hex). Page 7 or the top page in RAM is reserved for the input and output buffers needed for the front end to communicate with the neural network chip. Each bit of the direction flag, the highest byte, at address 7FF(hex), determines where the input back into the neural network will come from. If the corresponding bit is 0, the IO-cell will calculate an activation function, but if the bit is 1, the IO-cell will read in from its input buffer in RAM. The input buffer address has its lower three bits equal to the IO-cell number, and its upper three bits set to 1 (for page 7), and the other bits set to 0. The output buffer has a similar address, except that the fourth lowest bit is 1. For example, IO-cell three has input buffer at 703(hex), and output buffer at 70B(hex). Several control lines are used to determine when it is time for the front end computer to write into RAM. The WAIT line when asserted, will hold the neural net chip at the beginning of its big cycle or after it has output its activations. The front end can then examine the outputs in the output buffers and write appropriate inputs into the input buffers (remembering to set the direction bit). When the WAIT line is released, the neural network will continue processing. The DEBUG line will cause the accumulators to be output every small cycle into the output buffer, and the WAIT signal will hold the neural net chip at the end of the cycle until they are read. Since the WAIT line is only checked when the chip gets to a stopping place, it could be set and latched any place in the cycle, to stop it at the end. After the WAIT signal is released, there is a short period in which the neural net chip must use the RAM, and the front end should keep off (usually 14 clock cycles, except for debug mode). After this period there is a long time during the big cycle that the RAM is not used and the front end can read the outputs and write the inputs. The neural net will need the RAM again when the bandwidth counter gets to zero and it outputs END_OF_SMALL_CYCLE. The "front end" refers to the conventional computer controlling the neural network system. Debug mode. There is a line external to the chip which when set puts the chip in debug mode. In debug mode, the chip will output all the activations into the buffers in RAM on each small cycle. Miscellaneous Issues. Several assorted issues arose in the design. Some types of neural network architectures use clamped input cells, while others set them initially and allow them to vary. The clamped input can be handled by either making the activation function a fixed value or by having the self activation, with a weight of unity, be the only contributing value, with an identity activation function. A clamped input can also be maintain by always setting the direction for activations to come from the RAM input buffer. The input can be released then by reset the direction bit to read from the activation function. The self-activation is needed in some networks which allow a node to add activations from itself. It is equivalent to half of a standard S-cell, with a single weight stored. The current activation in a cell is multiplied by this weight and added to its accumulator. The internal bus on the chip is a shared resource, which simplifies the communication to the RAM and allows an eight bit parallel data flow, much faster and simpler than serial. Each IO-cell will be addressed in sequence an told to put on or take off data from the bus. The sequence, a quick burst of data to the RAM, leaves the RAM free for the front end to examine. The multiplier may be improved. It now takes a large amount of chip space, as a shift and add with a full parallel adder with carry save. Many other types of multpliers use a recoding technique to increase speed, probably too complicated an approach for here. The many speedup techniques which have varying times for computation (such as shift over zeroes), cause problems for parallel timing. The current multiplier design makes a reasonable conpromise for speed, such that the multiplication time is now comparable to the communication time. There is a test for the convergence of the neural network in which all the activations are compared with their previous values. There was some concern about whether this worked with clamped and unclamped input. It is important that it be possible to add more chips to the network at the top of the colums (adding more S-cells and bandwidth), along with adding them adjacent (adding more IO-cells and nodes). The VLSI-CMOS design required there to be the single chip (only one chip design was to be made) to handle vertical expansion, which would waste the IO-cells in the stacked chips. Since, the programmable gate array design has two separate chips for IO-cells and S-cells, this expansion is not as wasteful. The most important simplifying design decision made has been to use no pin multiplexing between chips. Although there are many communication channels need, (for each S-cell on the top row of a chip three lines are needed, for each on the side, two-lines), the chips used generally have adequate numbers of pins to handle them using one pin for one line. Media. Some time has been spent before deciding to use a field reprogrammable gate array for the initial design instead of a CMOS VLSI design (the initial assumption). The implementation at the gate level is different for different media, but most of the design is the same. CMOS VLSI allows for a transmission gate design that does not correspond to a gate implementation. The transmission gate can make a simpler adder circuit and simpler multiplexer. The CMOS design may be cheaper and faster and have a higher density (more elements on a chip), but it has no room for error, since mistakes cannot be fixed. A gate array will not allow for the same number of cells on one chip, but it allows for better testing, shorter cycle time, and if erasable, very small risk from design error. The decision to intially use programmable gate arrays seemed clear. It is likely that a later version will call for a custom VLSI chip. Conclusion. The design is a very straightforward and logical general design for a neural network. The chip has a high probability of working as a general neural network. References. [1] Max H. Garzon, Stanley P. Franklin, Willian Baggett, William S. Boyd., Jr., and Dinah Dickerson. "Design and Testing of a General-Purpose Neurocomputer." Journal of Parallel and Distributed Computing 14, 203-220(1992). Appendix A. The timing control description for the amaniac design. Big Cycle, for each Activation: 1. The front end processor asserts a WAIT signal. The neural net does nothing. while the front end loads the RAM with the activations (initally) writes to RAM input buffers reads the RAM output buffers. write the direction buffer. While WAIT is assert stay in step 1. on WAIT go to step 1. Read in Direction byte. Each bit tells whether this will be a new value for activation or a lookup into RAM, see text. 2. RAM Address = DirectionByte ( =7FF(hex)) 3. Read in DirectionByte Read in Activation functions and inputs. Load IOCounter, IOC=5. 4. If Direction = 0, RAM Address = IOC*256+Accumulator (page IOC, offset Accum: Activation function) If Direction = 1, RAM Address= 7*256+IOC (page 7,offset IOC: Input buffer) 5. Read Back in the new activation value. While --IOCounter>0 go to step 4. (note: predecrement notation) 6. Reset the Accumulator. Initialize the token. Clear Addend. Shift in the new Activations from the Activation buffers, and check if the activation has changed. 7. For Bitcount=0 to 7 Shift one bit to Activation from the Activation Buffer (simultaneous for all IO Cells) next Small Cycle, for each level of bandwidth. 8,9 Advance tokens along activation bus. (It's 2 steps) 10. For Bitcount=0 to 7 Shift in Activation for 8 bits (As in 6,7) Forward and backward Activation. Fbus from S-cell shifts into Forward Activation Buffer. Bbus from S-cell shifts into Backward Activation Buffer. Shift up Activation next 11. For Bitcount=0 to 7 Shift in Weights for 8 bits (As in 6,7) Forward and backward weights. FBus shifts straight into the Addend. Bbus shifts into the Backward Activation Buffer. next Do the Multiplication. 12. For Bitcount=0 to 7 Add the Addend to the Store (Accumulator), with saved carries. According to Multiplier (last bit of FAB) shift Addend with sign extend. shift Multiplier. on 8th cycle, load BWB into addend. 13. Repeat as in step 12 14. Clear addend. Propagate Carry. 15. Tell the Bandwidth counter, we're done with this layer. If DEBUG, goto 16. If (not END_OF_BANDWIDTH) Bandwidth Counter says not end of activation goto step 8. END of Small Cycle Output Activations to output buffers. 16. For IOCounter=5 to 0 RAM Address= 7*256+8+IOCounter (Page 7,offset 8+IOCounter: Output buffer) 17. Write out accumulator indicated by IOCounter. next If (DEBUG) goto 19 18. Send DONE_WITH_BIG_CYCLE to front end goto step 1 19. (Waiting for Debug) While WAIT is set, stay in step 19. else go to step 1. END of BIG Cycle