FOUR CAPACITY MODELS FOR COARSE-CODED SYMBOL MEMORIES

Coarse-coded symbol memories have appeared in several neural network symbol ing models. In order to determine how these models would scale, one must first have some understanding of the mathematics of coarse-coded representations. We define the t?en*ra 1 structure of coarse-coded symbol memories, and discuss their Memory schemes can be characterized by th*\r ™*~~ — ----ity. We derive mathematical schemes, using both analysis and numerical methods. capacity of one of the schemes with actual of DCPS, Touretzky and Hinton's dii Abstract Coarse-coded symbol memories have appeared in several neural network symbol processing models. In order to determine how these models would scale, one must first have some understanding of the mathematics of coarse-coded representations. We define the general structure of coarse-coded symbol memories, and discuss their strengths and weaknesses. Memory schemes can be characterized by their memory s\zt, symbol-set size and capacity. We derive mathematical relationships between these parameters for various memory schemes, using both analysis and numerical methods. Finally, we compare the predicted capacity of one of the schemes with actual measurements of the coarse-coded working memory of DCPS, Touretzky and Hinton's distributed connectionist production system.


Introduction
A distributed representation is a memory scheme in which each entity (concept, symbol) is represented by a pattern of activity over many units [3]. If each unit participates in the representation of many entities, it is said to be coarsely tuned, and the memory itself is called a coarse-coded memory.
Coarse-coded memories have been used for storing symbols in several neural network symbol processing models, such as Touretzky and Hinton's distributed connectionist production system DCPS [7,8], Touretzky's distributed implementation of Lisp S-expressions on a Boltzmann machine, BoltzCONS [9,10], and St. John and McClelland's PDP model of case role defaults [5]. In all of these models, memory capacity was measured empirically and parameters were adjusted by trial and error to obtain the desired behavior. We are now able to give a mathematical foundation to these experiments by analyzing the relationships among the fundamental memory parameters.
There are several paradigms for coarse-coded memories. In a feature-based representation, each unit stands for some semantic feature. Binary units can code features with binary values, whereas more complicated units or groups of units are required to code more complicated features, such as multi-valued properties or numerical values from a continuous scale. The units that form the representation of a concept define an intersection of features that constitutes that concept. Similarity between concepts composed of binary features can be measured by the Hamming distance between their representations. In a neural network implementation, relationships between concepts are implemented via connections among the units forming their representations. Certain types of generalization phenomena thereby emerge automatically.
A different paradigm is used when representing points in a multidimensional continuous space [2,3]. Each unit encodes values in some subset of the space. Typically the subsets are hypercubes or hyperspheres, but they may be more coarsely tuned along some dimensions than others [l]. The point to be represented is in the subspace formed by the intersection of all active units. As more units are turned on, the accuracy of the representation improves. The density and degree of overlap of the units' receptive fields determines the system's resolution [6].
Yet another paradigm for coarse-coded memories, and the one we will deal with exclusively, does not involve features. Each concept, or symbol, is represented by an arbitrary subset of the units, called its pattern. Unlike in feature-based representations, the units in the pattern bear no relationship to the meaning of the symbol represented. A symbol is stored in memory by turning on all the units in its pattern. A symbol is deemed present if all the units in its pattern are active. 1 The receptive field of each unit is defined as the set of all symbols in whose pattern it participates. We call such memories coarse-coded symbol memories (CCSMs). We use the term "symbol" instead of "concept" to emphasize that the internal structure of the entity to be represented is not involved in its representation. In CCSMs, a short Hamming distance between two symbols does not imply semantic similarity, and is in general an undesirable phenomenon.
Coarse-coded symbol memories can be further classified by the degree to which they are structured. In a completely unstructured CCSM, any subset of the units is a legitimate candidate for representing a symbol. A structured CCSM, on the other hand, imposes restrictions on the class of patterns that may be used. These restrictions can be articulated in terms of the patterns themselves or in terms of constraints on the receptive fields of the units. Some constraints are very simple, e.g., that all patterns be of the same size. Touretzky's and Hinton's DCPS [7] is an example of a CCSM with more complex constraints; BoltzCONS [9] uses the same memory scheme. In these systems' working memory, the receptive field of each unit is a three-dimensional cartesian space, randomly chosen from a common superspace. Imposing structure (i.e., constraints) on the receptive fields might be expected to reduce the capacity of the memory. When we measured this effect for DCPS by comparing its memory capacity to that of similar non-structured CCSMs, we found the actual penalty to be slight.
CCSMs can be very efficient for implementing large, sparse memories. By "large" we mean memories that are capable of representing many distinct symbols, and by "sparse" we mean that only a small fraction of these symbols will be simultaneously present in the memory. An extreme localist representation, in which each symbol is encoded by one unit and each unit is dedicated to encoding a single symbol, is very inefficient in such cases. For a given number of symbols, a, a localist representation requires exactly a units, whereas a CCSM can make do with far fewer than that. Alternatively, the advantage can be recast in terms of representational power: given N units, a localist representation can represent exactly N symbols, whereas a CCSM can potentially handle many more. The efficiency with which CCSMs handle sparse memories is the major reason they have been used in many connectionlst systems, and hence the major reason for studying them here.
The unit-sharing strategy that gives rise to efficient encoding in CCSMs is also the source of their major weakness. Symbols share units with other symbols. As more symbols are stored, more and more of the units are turned on. At some point, some symbol may be deemed present in memory because all of its units are turned on, even though it was not explicitly stored: a "ghost" is born. Ghosts are an unwanted phenomenon arising out of the overlap among the representations of the various symbols. The emergence of ghosts marks the limits of the system's capacity: the number of symbols it can store simultaneously and reliably.

Definitions and Fundamental Parameters
A coarse coded symbol memory in its most general form consists of: • A set of N binary state units.
• An alphabet of a symbols to be represented. Symbols in this context are atomic entities: they have no constituent structure.
• A memory scheme, which is a function that maps each symbol to a subset of the units -its pattern. The receptive field of a unit is defined as the set of all symbols to whose pattern it belongs (see Figure 1). The exact nature of the memory scheme mapping determines the properties of the memory, and is the central target of our investigation.
As symbols are stored, the memory fills up and ghosts eventually appear. It is not possible to detect a ghost simply by inspecting the contents of memory, since there is no general way of distinguishing a symbol that was stored from one that emerged out of overlaps with other symbols. (It is sometimes possible, however, to conclude that there are no ghosts. This is true when every symbol that is visible in memory has at least one unit that is not shared with any other visible symbol.) Furthermore, a symbol that emerged as a ghost at one time may not be a ghost at a later time if it was subsequently stored into memory. Thus the definition of a ghost depends not only on the state of the memory but also on its history.
Some memory schemes guarantee that no ghost will emerge as long as the number of symbols stored does not exceed some specified limit. In other schemes, the emergence of ghosts is an ever-present possibility, but its probability can be kept arbitrarily low by adjusting other parameters. We analyze systems of both types. First, two more bits of notation need to be introduced: : Probability of a ghost. The probability that at least one ghost will appear after some number of symbols have been stored. k: Capacity. The maximum number of symbols that can be stored simultaneously before the probability of a ghost exceeds a specified threshold. If the threshold is 0, we say that the capacity is guaranteed.
A localist representation, where every symbol is represented by a single unit and every unit is dedicated to the representation of a single symbol, can now be viewed as a special case of coarse-coded memory, where k = N = a and -P^Qst = 0. Localist representations are well suited for memories that are not sparse. In these cases, coarse-coded memories are at a disadvantage. In designing coarse-coded symbol memories we are interested in cases where k <ZL N <C a. The permissible probability for a ghost in these systems should be low enough so that its impact can be ignored, i.e., -Pghoet ^ *• Our task is to find memory schemes that will maximize the number of symbols a and the capacity k while minimizing iV, the number of units required. We are also interested in the tradeoff between a and k for a fixed N. We present four memory schemes, and analyze each of them in terms of the mathematical relationship among iV, a, k and 3 Analysis of Four Memory Schemes

Bounded Overlap (guaranteed capacity)
If we want to construct the memory scheme with the largest possible a (given N and k) while guaranteeing -PghoQt = 0> the problem can be stated formally as: Given a set of size iV, find the largest collection of subsets of it such that no union of k such subsets subsumes any other subset in the collection. This is a well known problem in Coding Theory, in slight disguise. Unfortunately, no complete analytical solution is known. We therefore simplify our task and consider only systems in which all symbols are represented by the same number of units (i.e. all patterns are of the same size). In mathematical terms, we restrict ourselves to constant weight codes. The problem then becomes: Given a set of size N 9 find the largest collection of subsets of size exactly L such that no union of k such subsets subsumes any other subset in the collection.
We wish to provide two arguments in support of this simplification. First, we believe it does not significantly reduce the size of the collection. This is because the solution to the original problem is likely to be composed of subsets of similar size. This can be seen by considering the effect too small or too large a subset would have on the capacity of the system. An unusually small subset will have a very high tendency to become a ghost, whereas an unusually large subset will have a high tendency to create one.
The second argument is a pragmatic one. In order for coarse-coded memories to be useful, they need to be accessed by some external mechanism. One such mechanism is the clause space of DCPS. Clause spaces use lateral inhibition to extract a single stored symbol from a coarse-coded memory. This competitive mechanism works best when patterns are of uniform size.
There are no known complete analytical solutions for the size of the largest collection of patterns even when the patterns are of a fixed size. Nor is any efficient procedure for constructing such a collection known. We therefore simplify the problem further. We now restrict our consideration to patterns whose pairwise overlap is bounded by a given number. For a given pattern size L and desired capacity k, we require that no two patterns overlap in more than m units, where: \m+l/ (Recall that m is a function of £ and k.) Thus the upper bound we derived depicts a simple exponential relationship between a and N/k. Next, we try to construct memory schemes of this type. A Common Lisp program using a modified depth-first search constructed memory schemes for various parameter values, whose a's came within 80% to 90% of the upper bound. These results are far from conclusive, however, since only a small portion of the parameter space was tested.
In evaluating the viability of this approach, its apparent optimality should be contrasted with two major weaknesses. First, this type of memory scheme is hard to construct computationally. It took our program several minutes of CPU time on a Symbolics 3600 to produce reasonable solutions for cases like N = 200,/: = 5,m = 1, with an exponential increase in computing time for larger values of m. Second, if CCSMs are used as models of memory in naturally evolving systems (such as the brain), this approach places too great a burden on developmental mechanisms.
The importance of the bounded overlap approach lies mainly in its role as an upper bound for ,U possible memory schemes, subject to the simplifications made earlier. All schemes with guaranteed capacities (-Pghost = ®) can ^e measure< i relative to equation 3.

Random Fixed Size Patterns (a stochastic approach)
Randomly produced memory schemes are easy to implement and are attractive because of their naturalness. However, if the patterns of two symbols coincide, the guaranteed capacity will be zero (storing one of these symbols will render the other a ghost). We therefore abandon the goal of guaranteeing a certain capacity, and instead establish a tolerance level for ghosts, -Pghogf ^o r ' ar 8 e enough memories, where stochastic behavior is more robust, we may expect reasonable capacity even with very small In the first stochastic approach we analyze, patterns are randomly selected subsets of a fixed size L. Unlike in the previous approach, choosing k does not bound a. We may define as many symbols as we wish, although at the cost of increased probability of a ghost (or, alternatively, decreased capacity). The probability of a ghost appearing after k symbols have been stored is given by Equation 4: is the probability that exactly c units will be active after k symbols have been stored. It is defined recursively by Equation 5:

The optimal pattern size for fixed values of N, fc, and a can be determined by binary search on Equation 4, since -Pghost(^) k* 3 eXdLC^V one maximum in the interval [l,iV]. However, this may be expensive for large N. A computational shortcut can be achieved by estimating the optimal L and searching in a small interval around it. A good initial estimate is derived by replacing the summation in Equation 4 with a single term involving E[c]: the expected value of the number of active units after k symbols have been stored.
The latter can be expressed as: An alternative formula, developed by Joseph Tebelskis, produces very good approximations to Eq. 4 and is much more efficient to compute. After storing k symbols in memory, the probability P x that a single arbitrary symbol x has become a ghost is given by:

E[c] =
If we now assume that each symbol's P z is independent of that of any other symbol, we obtain:^g

host«i -(i -^r* (7)
This assumption^ independence is not strictly true, but the relative error was less than 0.1% for the parameter ranges we considered, when -Pghogt was no greater than 0.01. -dimensional table 7Vx,(fc,c) for a wide range of (N,L)  values (70 < N < 1000, 7 < L < 43), and produced graphs of the relationships between N,  k, a, and ^ghoet ^o r °Ptinaum pattern sizes, as determined

Random Receptors (a stochastic approach)
A second stochastic approach is to have each unit assigned to each symbol with an independent fixed probability s. This method lends itself to easy mathematical analysis, resulting in a closed-form analytical solution.
After storing k symbols, the probability that a given unit is active is 1 -(1 -s) k (independent of any other unit). For a given symbol to be a ghost, every unit must either be active or else not belong to that symbol's pattern. That will happen with a probability [l -s • (1 -s) k \ , and thus the probability of a ghost is: Assuming -Pghoet ^ * anc * ^ ^ a (both hold in our case), the expression can be simplified to: from which a can be extracted: ,, (N, k, s, P gh06t ) We can now optimize by finding the value of s that maximizes a, given any desired upper bound on the expected value of -Pghost-^^s * s ^o ne straightforwardly by solving o da/ds = 0. Note that s • N corresponds to L in the previous approach. The solution is s = \/{k -f l), which yields:

W^-(ID
A comparison of the results using the two stochastic approaches reveals an interesting similarity. For large fc, with -Pghogt = ®*®^ *ke term 0.468/Jfe of Equation 8 can be seen as a numerical approximation to the log term in Equation 11, and the multiplicative factor of 0.0086 in Equation 8 approximates Pg^^t in Equation 11. This is hardly surprising, since the Law of Large Numbers implies that in the limit (JV, k -• oo, with s fixed) the two methods are equivalent.
Finally, it should be noted that the stochastic approaches we analyzed generate a family of memory schemes, with non-identical ghost-probabilities. -Pghoet * n our formulas is therefore better understood as an expected value, averaged over the entire family.

Partitioned Binary Coding (a reference point)
The last memory scheme we analyze is not strictly distributed. Rather, it is somewhere in between a distributed and a localist representation, and is presented for comparison with the previous results. For a given number of units N and desired capacity k, the units are partitioned into k equal-size "slots," each consisting of N/k units (for simplicity we assume that k divides N). Each slot is capable of storing exactly one symbol.
The most efficient representation for all possible symbols that may be stored into a slot is to assign them binary codes, using the N/k units of each slot as bits. This would allow 2 N / k symbols to be represented. Using binary coding, however, will not give us the required capacity of 1 symbol, since binary patterns subsume one another. For example, storing the code '10110' into one of the slots will cause the codes '10010', '10100' and '00010' (as well as several other codes) to become ghosts.
A possible solution is to use only half of the bits in each slot for a binary code, and set the other half to the binary complement of that code (we assume that N/k is even). This way, the codes are guaranteed not to subsume one another. Let a pbe denote the number of symbols representable using a partitioned binary coding scheme. Then, = e O3 * 7 r (12) Once again, a is exponential in N/k. The form of the result closely resembles the estimated upper bound on the Bounded Overlap method given in Equation 3. There is also a strong resemblance to Equations 8 and 11, except that the fractional multiplier in front of the exponential, corresponding to J^ghost' * 8 Aliasing. -Pghoet * s ° ^o r *^e Partitioned Binary Coding method, but this is enforced by dividing the memory into disjoint sets of units rather than adjusting the patterns to reduce overlap among symbols.
As mentioned previously, this memory scheme is not really distributed in the sense used in this paper, since there is no one pattern associated with a symbol. Instead, a symbol is represented by any one of a set of k patterns, each N/k bits long, corresponding to its appearance in one of the k slots. To check whether a symbol is present, all k slots must be examined. To store a new symbol in memory, one must scan the k slots until an empty one is found. Equation 12 should therefore be used only as a point of reference.

Measurement of DCPS
The three distributed schemes we have studied all use unstructured patterns (as discussed in the introduction), the only constraint being that patterns are at least roughly the same size. Imposing more complex structure on any of these schemes is likely to reduce the capacity somewhat. In order to quantify this effect, we measured the memory capacity of DCPS (BoltzCONS uses the same memory scheme) and compared the results with the theoretical models analyzed above.   3 • 2000 « 28. Touretzky and Hinton manipulated the receptive fields as described in [8] to artificially reduce the variance from this mean. In the current implementation of DCPS, pattern sizes vary from 23 to 33, but most symbols have patterns containing 26 to 29 units; the standard deviation is only 1.5. Figure 3 shows -Pghost as a f unct * on of k for DCPS (based on 10,000 trials) and for the Random Receptors method as estimated by Equation 11. N is 2000 and a is 15625. The two curves are quite close. Note that when ^Pghost * s 0.01, we observe an actual capacity of 48 symbols for DCPS 2 and an expected capacity of 51 symbols for the random receptors scheme. We thus conclude that for the parameter ranges discussed here, the structure in DCPS's fixed-size receptive fields (which have been manipulated to assure nearly fixed-size patterns) results in only a slight penalty relative to the random receptors approach. Table 1 summarizes the results obtained for the four methods analyzed. Some differences must be emphasized: and a^ deal with guaranteed capacity, whereas a r / p and a rr are meaningful only is only an upper bound.

Summary and Discussion
• a r f p is based on numerical estimates. 2 This measurement is based on a 100% visibility criterion, which we use throughout this paper. It therefore differs from previously-reported values where lower visibility criteria were used [8j. The similar functional form of all the results, although not surprising, is aesthetically pleasing. Some of the functional dependencies among the various parameters can be derived informally using qualitative arguments. Only a rigorous analysis, however, can provide the definite answers that are needed for a better understanding of these systems and their scaling properties.