Harp: A Low-Cost 25 MIPS Digital Processor

Abstract : Harp is an ultra-high performance processor motivated by the low- level processing requirements of machine perception tasks. It is designed for extreme speed, low cost, easy producibility, simplicity of structure and programming, and complete diagnosability. This paper discusses the design philosophy, architecture, instruction set, and performance of Harp.


Introduction
The real-time operation of algorithms that arise in speech and vision research is often limited by the speed of the low-level signal handling algorithms, which in turn are typically limited by the computation time of a critical inner loop. To speed up the execution times of these algorithms, dedicated special purpose hardware or very fast auxilliary processors have been used to implement the critical code.
In signal processing problems, and especially with the FFT, good results have been obtained with a special machine architecture relying heavily on instruction-cycle overlap (pipelining) and on multiple parallel arithmetic units capable of performing certain complex operations, notably the FFT -butterfly-multiply, very efficiently [FDP] [  [AP-120B]. These processors are almost invariably both expensive to build and difficult to program, due to their complex structure and parallelism. Their computational power on suitable tasks has nevertheless made them useful.
At Carnegie-Mellon University we are investigating artificial intelligence algorithms which, although they deal directly with input or generated signals, do not fall entirely within the realm of conventional "signal processing". Floating point and complex arithmetic are not usually required, and while the FFT is used in some problems, most algorithms cannot consistently utilize such special features as butterfly-multiply hardware efficiently. Low precision integers may be used for most computations; often only small bit fields are manipulated, as in vision tasks where pixels may be as small as 4 bits. When signal processing is done, multiple precision is as satisfactory as floating point. Considerable logical manipulation and decision-making capability are also called for. We have designed a new processor, Harp, which suits the needs of our research and is considerably more general-purpose in its orientation than the signal processors.

II. Architecture
Harp is an auxiliary processor to a host PDP-11. It is a "pure" 16-bit machine: instructions, data paths, and ALU are all 16 bits wide; there are no "byte" instructions or packed data representations.
The Harp processor operates from two small, very high speed memories-the data and instruction working stores. There are no data registers or accumulators in Harp has the following design goals, some of which are in keeping with the signal-processor approach, some not: A.
Extremely high speed (under 40 ns cycle).

B.
Low cost ($5000-$ 10,000 parts).  The working store contents can be transferred to or from a large buffer memory via a block transfer mechanism. The transfer rate of 0.8 Gbps (20 ns per word) avoids bottlenecking the fast processing of Harp. Buffer memory is expandable from 4K to 64K and is double-ported, one port permitting high speed transfers to the Harp working stores, the other compatible with a PDP-11 UNIBUS.
No single-word direct access to buffer memory is provided for several reasons: 1) this memory is interleaved, and the delay for initiating access is five times greater than the average transfer time; 2) allowing access an a cycle-to-cycle basis would have slowed Harp's execution rate considerably; 3) two words of working store are taken up by addresses for each reference to buffer memory.
The block transfer machanism is used to overlay programs too large to fit in the available instruction memory. Block transfers are fast enough to make frequent overlays acceptable in many situations: a 64-word transfer takes less than 1.5 us.
Since Harp is intended to operate in conjunction with a host computer, no input or output capability is provided other than the buffer memory connection to the UNIBUS. The UNIBUS bandwidth is sufficient for most I/O to real-time devices and mass storage. Since the relatively slow PDP-11 coordinates these transfers to the buffer memory, no hardware interrupt capability is included in Harp.

III. Instruction Set
Two-address instructions are advantageous in high speed processing, since one such instruction can accomplish as much as two or three single-address instructions. A special set of instructions dispose of the output of the separate multiplication processor, which performs 16x16 bit multiplications and retains the 32-bit product.
Instructions are provided to store the two halves of this result in data memory and to add them to data memory (to accumulate a double-precision inner product). The multiplication processor operates in parallel with the central processor but receives instructions and operands from it. The 16x16 multiply takes 80 ns to complete, so a program must execute at least one instruction after the MUL before accessing the product.
Though Harp is a pipelined machine, the pipe is invisible to all instructions except explicit state-register references. Branch lookahead hardware avoids delay while allowing natural instruction sequencing. Programmers need not be concerned with such complications as "clearing the pipe on a branch" which plague most signal processors.
simultaneously without the need for extender boards. Although the cross consumes considerable packaging volume, it is rackmountable at 17 x 17 x 12 inches.
Double-sided PC construction was chosen over wire-wrap because of cost, signal fidelity, and reproducibility. Multilayer PC, with its higher prototype turnaround time and expense is not justified by the present circuit density.

VI. General-Purpose Capabilities
The designers of Harp began with a view of creating a "functionally specialized architecture" for the problems encountered in speech and vision research. As we progressed it became clear that the only relevant common characteristics of the tasks are 1) fairly large amount of processing on each datum, and 2) small items, usually 4 to 12 bits of precision. The first observation is expressed in the size of data working store -64 words is large if considered as "registers", but small for a "main memory"; as "working store", it is well matched to Harp's tasks. The second led to the choice of small (16 bit) integers as the data type. The resulting design looks not at all like a "specialized" processor, but more like a clean vertically microcoded minicomputer with residual control. Notably, at least one recent signal processor design is billed by its builders as basically a "high performance minicomputer" [LDVT].
We expect Harp to be so cost-effective that it will be attractive whenever more is needed on a PDP-11 system or similar minicomputer. processing power

VII. Performance
The simulated performance of Harp on 8 representative tasks is shown in Figure   5, compared to the performance of two conventional computers on the same tasks.
The effectiveness of the Harp architecture and instruction set design can be seen in Figure 4 which shows frequencies of use of various machine features. Each field selects a pair of addressing modes that a source or destination operand can choose from. The high bi, o, the instruction address field selects one of the pair.

D
The low 6 bits directly address the operand. XI The low 6 bits are added to index register 1 to obtain the address. The low 6 bits are indexed with index register 2. The low 6 bits are indexed with XI, which is then incremented. The low 6 bits are indexed with X2, which is then incremented. The low 6 bits are indexed with XI, which is then decremented. The low 6 bits are indexed with X2, which is then decremented.