A Visualization Tool for Atlas Collection Assessment

A collection assessment tool is described that allows the visualization of an atlas collection in order to determine the depth, scope, and needs for the collection. Using an online tool, Many Eyes, the degree of complexity and the relationships within a collection can be explored and assessed according to the specifics of the collecting institution's own criteria. The tool provides images of the data, and it is sufficiently flexible to permit multiple analyses as needed by the user.


139
information about that place such as its population, economic status, and geology. It is difficult to gain a good understanding of a collection of spatial materials when they are presented in tabular form or in a list format partly because this format is essentially uni-dimensional. A list or table is organized either by objects (geographic places) or by topical information. Thus the materials are organized from top to bottom either in an order related to the space (e.g., by continents, then countries within the continent, then divisions within the country, etc.) or by the type of information (e.g., economy-related maps, then geology, then population, etc.). The full nature of a collection of atlases is obscured because one does not have a good feel for all the relationships among the materials in the collection.
Whether librarians seeking to understand their collections are new to the job or have been in their position for a number of years, it is important that they understand the collection for which they are stewards. What is proposed in this paper is a description of an approach that allows the capacity for zooming in on particular areas of the collection of atlases for a closer inspection. A Web site called Many Eyes contains a variety of visualization tools that enable the user to assess a collection of information. With such a tool, a close inspection allows for an evaluation of the materials included in the collection. Because the visualization approach used is essentially multidimensional, it will allow a more complete understanding of the relationships in the collection, both geographic and topical. One can see, for instance, that a collection has linguistic atlases from Germany or Great Britain, but does not have them from other countries.

LITERATURE REVIEW
How can one determine if a collection of atlases is both within the guidelines of the collection development policy of an institution and at the same time is recent enough that it is a useful resource? Perry and Weber (2001) have observed that "it has become near impossible . . . to evaluate the effectiveness and the adequacy of library collections." Their comment is in response to the many changes in the nature of collections in the early twenty-first century, changes such as electronic databases and journals, increased Internet use, and agreements within consortia.
Change is also a characteristic that can make the situation more complicated in geographic information collections. For example, maps and atlases from twenty years ago are insufficient to reflect changes in eastern Europe and the breakup of the Soviet Union. These are not the kinds of changes that user statistics, such as circulation numbers, can readily address in an evaluation of a collection. Borin and Yi (2008) have said that "the literature on evaluation can be grouped into two camps-traditional (criteria-based) and new (usage based)." Various approaches have been suggested for collection comparison and assessment within the criteria-based methods. For instance, a criteriabased assessment of a collection would be to compare it with a standard bibliography, which is a list of materials considered to be the basic, essential items for a collection. There are some standard items that Larsgaard (1998) recommends for general-reference world atlases; in her 1998 book she says that there are bibliographies of state atlases, though the ones mentioned are no later than 1988. As to standard lists or bibliographies that could be used for topical areas or for specific disciplines such as economics, a general search has not yielded any current standard listing of recent date in WorldCat.
To determine what is the appropriate standard list with which a particular collection should be compared is thus difficult. As we have seen, there is a paucity of up-to-date standard bibliographies that cover the range of geographies from world to local atlases. Collections of maps and atlases in a particular library are often strongly influenced by the geographic location of the holding institution, the collection development policies of the institution, and, in the case of colleges and universities, the specifics of the academic curriculum. Finding a comparable library or set of libraries having the same general characteristics (size, type of locale, public vs. private, discipline strengths) to make a peer comparison can then be difficult as well.
WorldCat Collection Analysis (WCA), a tool to evaluate collections offered by the Online Computer Library Center (OCLC), can provide a comparison listing of holdings based on the Library of Congress (LC) classification number. The WCA is an online program that allows the user to create reports of frequencies with which certain LC ranges of items appear in WorldCat. The reports can be across the whole of the WorldCat database or include only sets of institutions that are of interest to the user. In addition, the frequencies can be subdivided by such categories as format, language, and publication date. If one were interested in an atlas collection, for example, a list of the LC "G" series classification numbers from G1000 to G3171 (which includes all the materials classified as atlases of various locations) could be produced and compared with the entire holdings in the atlas collection within the WorldCat database or with peer institution groups. This procedure can generate a set of reports about holdings in general geographic regions (Asia, North America, etc.) that can be broken down further by language, format, or audience. Additionally, once a level of specific geography, format, and language have been reached to suit the purpose of the analysis, a list of titles can be seen. Circulation statistics can also be generated to add further information to the report.
Another usage-based assessment method is one that examines borrowing and lending patterns based on the institution's library records. Usercentered measures such as circulation data are a primary approach recommended for collection evaluation (Lockett 1989). As Agee points out (2005), "Most online management systems collect circulation data that may be organized in report form to provide frequency of individual title or classification-area loan information." However, geospatial materials, such as atlases, are likely to be used as a reference material in a library and are often not borrowed from the library. 1 So, these use records are not going to reflect actual usage, and thus are not especially helpful for determining collection needs.
Both the traditional criteria-based and the newer usage-based assessments are essentially ordered lists. With the standard lists or bibliographies, there is a listing of suggested or recommended items; in the usage measures there are lists of items used and not used. The lists from either of these do not account for the multidimensional ways the atlases are related. This research proposes the use of a visualization tool that charts the atlases using both their geography and their information topic.
It is necessary to understand what a visualization tool is because the word "visualization" can be understood on many levels. Charts such as pie charts and bar charts that show percentages and frequencies are forms of visualization that turn data into an image. Libraries are using forms of visualization in information retrieval; the AquaBrowser is a visualization of a catalog search. Likewise, Internet search engines such as Quintura (http://www.quintura.com) use visualization in the form of a word cloud to help narrow or redirect results. Other instances of visualization range from the mundane such as floor plans of a library for assisting patrons, to "node and link" network diagrams that represent collaboration among groups of researchers based on co-authorship of books and articles. The Harvard Catalyst Profile (http://catalyst.harvard.edu/people.html) provides these latter types of visualizations for researchers.

METHOD
The purpose of this research was to explore the use of a particular visualization tool, Phrase Net from Many Eyes, in providing information about an atlas collection. The result should allow users to better understand the materials in their institution's collection and suggest areas in which the collection can be improved. To start the collection assessment, we decided that this initial test of a procedure would begin with a limited collection that could be handled readily. The atlas collection at the University of Illinois at Chicago (UIC) consists of 2,013 titles that have been entered into the catalog. A complete list of all atlas materials cataloged for the UIC Library was obtained from the catalog records in its online catalog, Voyager. All 2,013 records were downloaded to an Excel file to make manipulation and formatting easier. Each record consisted of the local identification number; LC call number; title; and MARC 500 field (Notes), 650 field (Topical term), and 651 field (Geographic name). The specific titles in the collection were used to provide a more complete description of the latter MARC fields. Although the 650 and 651 fields should be sufficient data to perform the analysis, it was found that missing information was of such an extent that neither of these fields alone could be used for the test data. It would have been optimal to use the 650 and 651 fields, as they represent a controlled vocabulary, and they would have made for the cleanest analysis of the data. In cases where there was not sufficient information in one of these two fields, geographic or topical information was used from the title to provide more complete information for analysis.
After checking that records were atlases based on both a call number and a title check, 1,918 records remained after the removal of eighty-six items that were irrelevant or incorrectly classified. These 1,918 atlases were used to create atlas-related terms from the title and MARC headings.
The following procedures were used to prepare the data for use in the analysis.
• The data were cleaned of stop terms including "the," "of," "and," "in," and other such grammatical terms that were not essential to organizing the information. • The words "atlas" and "map" and related terms such as "mapping" and "atlases" were removed because it was known that the materials were atlases and had maps, so these terms provided no new information. • Certain terms deemed to represent a single concept in multiple words (e.g., Great Britain) were edited to include an underline between the separate words (Great Britain) to keep them as a single term in the data. Initial tests had shown that the individual terms ("Great" and "Britain") occur close together in the analysis. These joined terms were limited to geographic names. Other terms such as "city" and "transportation" were retained as individual terms so that they might be seen in relation to topical terms and geographic entities, which were related to them by frequency.
The resulting cleaned data file was saved as a text file and used in the next step of the analysis. It took approximately three hours to create the data set from the time it was received as a report to the time it was ready to enter the data file into the online program.
The data were analyzed using the online service Many Eyes (http:// manyeyes.alphaworks.ibm.com/manyeyes/), which provides several types of visualization tools including Phrase Net, a tool for looking at relationships within text. The Web site has options to upload a data set or to use one that has been uploaded by someone else. With the data set you can create a number of types of visualizations, including traditional data charts such as pie, bar, and line graphs. It also contains U.S. county and state maps as well as world maps in which data can be displayed, such as for income levels.
Finally, there is a set of text analysis diagrams, including word clouds and tags, word trees, and Phrase Net.
The Web site describes Phrase Net: "A phrase net diagrams the relationships between different words used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained in a book, speech, or poem." The tool is considered to be somewhere between a tag cloud and a word tree. The relationship between terms that are to be diagramed can be defined in a number of ways, such as words connected by "and," "is," "at," or other customized relationships. The analysis for our data required a space between the words. Data from the text data file are pasted into a data box in Many Eyes and saved online. These data can then be visualized using any of the appropriate tools.
Another option in the Phrase Net is to show how many individual terms or words are in the diagram. In this analysis several levels of words were chosen, including the 10 most frequently repeated words, then the 25 most frequent, then the 50, and finally the 75 most frequent words. These different numbers of terms provide increasing depth to the analysis. The user is not limited to specific numbers of terms, so if he or she desired, the analysis could be on the 43 most frequent terms. Overall, Phrase Net provides a means to analyze the text from the atlas geographic and topical terms based on the proximity of the geographic name and the topic so that the relationships are maintained in the analysis. Because this is an open Web site, the readers can access the data discussed in the article and perform visualizations of whatever type they wish. The data can be found by searching in data sets for the word "atlas" and selecting the data set named "revised atlas title and subject terms from UIC." By clicking on the Visualize button on the right side of the list, the selection of a tool type can be made.

RESULTS
After cleaning and entering the data into Many Eyes, the Phrase Net tool was then used to create four initial sets of images of relationships. These images are shown in Figures 1-4. In Phrase Net diagrams there are a few conventions that make interpretation of the diagram easier. First, the words are connected by arrows if they are related by a space between them in the data file. So, only first-order connections (side-by-side terms) are shown with arrows. The size of a word in the diagram (i.e., its font size) is proportional to the number of times it occurs. Larger words represent more frequent use of those words in the data. The arrow between words has varying thickness depending on how many times those two words were related. The shading of a word varies depending on whether it was more likely to be found in the first or second position. Position is simply determined by which word came first in the pair of words. Given that the words are simply terms in no syntactical order, position does not play as important a part as it would in sentence-based diagrams. The darker the word, the more often it appeared in the first position. Table 1 gives the frequency of the top twenty-five terms used in the analysis. Figure 1 illustrates the top ten terms and provides a high-level view of the atlas collection. It is clear from this diagram that the atlas collection at UIC is highly oriented toward materials for Illinois and specifically Cook County, where the university is located. Also, the words "county," "real," and "property" suggest that the materials are weighted toward items that relate to local property and real estate matters. These indications are in line with the collection development goals for the library in general. The terms "historical," "history," and "geography," especially in relation to "united states," indicate that there are more than just local or current records in the collection. Even at this high-level view, it is clear that it is possible to see relationships between geography and topic and that there is a diversity of levels in geography, from local to national.
Expanding the PhraseNet diagram to include the top twenty-five commonly used words shows that even the local aspects of the collection are not restricted to Cook County, but also include other nearby Illinois counties (e.g., DuPage) and other aspects of local atlases especially related to outdoors, natural, resources, and recreation ( Figure 2). The "history" and "historical" terms with "geography" are now expanded to include both the United States and Europe, particularly British and German materials. There is a subbranch appearing on the right side indicating that language and dialects related to the German materials exist. There are also "economic" and "conditions" that connect to "geography." Clearly local geography is separated from larger areas, and the topics for the local versus the larger areas seem quite different at this point.
Increasing the Phrase Net diagram to fifty terms (Figure 3), one can see further expansion of local area terms such as additional Illinois counties near Chicago (McHenry, Lake, and Kane counties) as well as the city  of Chicago. There are also additional countries in Europe (such as The Netherlands, Finland, and the Soviet Union) as well as other parts of the world (e.g., Africa, the Pacific, Canada, and China) that now appear in the chart. It should be noted that places that one might logically think would be closely related geographically such as England, British, and Great Britain, are not directly connected to each other in the diagram. This suggests one of the limitations to the relationship rule used, that is, the immediate adjacency connection by one space between the terms in the data. Another interesting type of item to come to the fore is dates, such as 1945 and 1800, suggesting something about the recency as well as interests reflected in the collection.
The expansion to 75 terms ( Figure 4) becomes almost too complex to elicit helpful data. However, it can be more closely examined using some  of the Phrase Net tools. Although the drawing may seem complex and a list of an equivalent number of individual terms may be relatively easy to grasp, it should be remembered that this chart represents not just 75 terms. These are the top 75 most frequent terms from a list of thousands of words and phrases, and they are representative of many more specific atlases. Also, a chart includes not only frequencies but also relationships among these terms. This chart will be examined more closely. In general, the same types of additions hold here as have been seen in the previous expansion of terms-more countries and regions, along with increased terms that define the types of atlases such as "administrative," "streets," and "population." Moving the mouse over a term in the screen display with seventy-five terms will result in a box appearing that indicates the number of occurrences of that term. On the other hand, moving the mouse over the arrows provides the number of occurrences of the two terms connected by the arrow as well as the first ten instances of the word pairs ( Figure 5). Given the emphasis on the Chicago area in the UIC collection, it is somewhat surprising that the first ten occurrences show a variety of metropolitan areas for which there are atlases in the collection. However, given that the university has a College of Urban Planning and Public Affairs, it is more understandable that there would be collections from a number of metropolitan areas. Knowing about this relationship also allows one to begin making use of the chart's results. Although there are a number of metropolitan areas, it would be good to approach the College of Urban Planning faculty to determine if there are additional metropolitan area atlases that might be useful to them.
In examining the arrow between "housing" and "Netherlands" as shown in Figure 6, the first ten occurrences are identical, warranting further examination. Going to the library catalog, one finds a 20-volume atlas of the Netherlands that includes both housing and population data. This finding points to an issue that must be considered when examining the data and the diagrams. The data from the catalog had shown each of the volumes as a separate entity (record) rather than a single title for all volumes, so there are many instances in the diagrams. Although that many volumes do represent a big space in the collection, a larger question is left. One may not want to count it that many times in the data. Multivolume atlases generally represent a major work on the geography and the topic involved, so a decision has to be made by the user of this approach about the extent to which he or she wishes multiple volumes to be included as separate items. If the decision is to have it represented only once, then the data will need to be edited to remove the additional entries. Figure 7 indicates that there is only one occurrence of a pairing between England and Great Britain, a seemingly small number. There are limitations to the relationship definition in this type of visualization. Titles that refer to Great Britain will often not include the specifics of the countries within the nation such as England that are included in the broader category of Great Britain. From that point it may be useful to determine if there are differences in the nature or types of atlases that fall under each of the terms. Specific topics may be of more interest in England, Scotland, or Wales than they would be in Great Britain, such as an atlas of surnames where there is reason to believe that the surnames are more likely to be found in those specific countries.
In order to maintain the depth of relationships that occur in the seventyfive-item diagram without being overwhelmed by the complexity of the diagram, the Phrase Net service allows zooming to and panning of portions of the diagram. Figure 8 shows an example of a diagram where it has been enlarged in order to look more closely at a portion of the screen. In addition to being much easier to read, one can examine some specific relationships. For example, there is a connection from "1945" to "world" and from "world" to "war," suggesting that a portion of the collection is related to the Second World War. Also "Poland" has arrows indicating relationships out of this area, but when panning across, one finds they are relationships to "geology" and "historical." Panning is done by right-clicking and using the mouse to move in the desired direction. In addition to using the zoom tool on the screen, the user can select an area of the diagram by clicking and dragging with the left mouse button and then zooming in on that area.

DISCUSSION AND CONCLUSIONS
The use of freely available tools to assist in collection assessment can be a fruitful procedure. In this case all that is required is a small set of computer tools that are available at virtually all libraries-a spreadsheet program (such as Excel), a text editor like Notepad, and Internet access. Additionally, this approach is relatively simple and straightforward. Librarians work with materials they are familiar with and that they have some prior knowledge about. Titles of atlases and the subject headings associated with them are the basic materials used. Finally, the procedure provides information not readily available in other assessment procedures; it shows relationships between geographic entities and the topics addressed in the atlases. Where there are gaps in a collection, there will be "holes" in the chart. The librarian knows that the library needs to support a program in a specific discipline, and by looking for the types of information or the geographic entity involved, he or she can determined the presence or absence of a type of atlas.
Because this tool is freely available via the Internet, the user can easily manipulate and alter the data set. Multiple runs with corrections and changes to the data set (such as combining terms to more correctly represent a single entity) do not cost anything besides a little time. Instead, the user learns from these iterations. For example, the writer, in working with a different set of materials from a collection finding aid, found after the first run that the dates (in years) were overwhelming that analysis. When the charts were examined, very few terms besides years were showing up. But one is free to change the data files and make new ways to better represent or analyze the information.
Some of the results were surprising at first glance, and they needed analyzing, such as the non-Chicago "metropolitan areas" in the analysis. But this effort will serve to make the librarian more aware, not only of the collection, but also of the institution that it serves. There are strengths in the collection that should be followed up: What is the history of the linguistic atlases from various languages? Is this a strength in a foreign language or linguistic department that is still a significant asset to them or has it fallen into disuse and perhaps should be brought back to the attention of faculty and students? Further contacts with faculty and students are clearly in order.
When one examines the visualizations, the strengths of the collection become obvious; they literally appear in large print. The shortcomings of the collection will require more thoughtful examination. The materials not in the chart but should be-or should be more prominent-are the areas where the collection needs to be enhanced. This, of course, requires the librarian to know not just the collection, but also the collection plan of the library and the needs of the predominant disciplines of the institution. For example, in the seventy-five-word diagram of the UIC collection many Western European countries have shown up individually as well as with the United States and Canada, along with North America. However, very little is seen concerning specific African and Latin American countries. Given both the diverse nature of the UIC student body and some of the academic specialties of the university (such as African-American Studies, Latin American Studies, and Slavic and Baltic Languages and Literature) one would expect to see more than Mexico and Poland in the phrase diagram of the atlases. Clearly this points to collection priorities that need to be addressed.
The questions raised are, Can you create a hypothetical, idealized chart? Can we create a chart that represents what one wants the collection to look like and that includes all geographic entities in proportion to their importance as well as atlas topics in the areas of importance to the institution's disciplines? Such a chart is possible, but it would require a specific knowledge of not only the geographic entities and the topics to be ideally included but also the proportions of each. In a sense this leads back to the standard bibliographies and the issues associated with them. Perhaps this is an area to be considered in the future about how to best assess a collection. To understand a specific collection, map librarians need to understand the full scope of the field, the idealized, master collection. One of the consequences of the research here has been an increased desire to understand more about how to establish a standard that is adjustable, that can be calibrated to the parameters of a certain school's needs and strengths. This approach offers a new tool toward visualizing collection assessment. Several of its shortcomings have already been noted. First among them is the nature of the relationships diagrammed. Whenever relationships require side-by-side (first order) connections, the information is limited to the "neighborhood" in which the title words and subject headings occur. More statistically advanced procedures such as multivariate clustering can provide groupings of materials that are related by using many variables not limited to adjacent data items. In the long run, such techniques may be more amenable to this type of project, but they require more complex procedures to create visualizations of the results as easily understandable as our diagrams of the atlas terms. Kim, Lee, Yun, and Park (2009), for instance, have used pathfinder scaling, which has nodes and links based on the subjectusage data. However, these data are circulation based and are therefore not as satisfactory for materials such as atlases, which are generally referenced in the library.
Additionally, the testing of this procedure was conducted on a limited sample of related materials. How well this can be used in different materials and with a greater number of items remains to be seen. Given these constraints, this approach seems to provide a basic first step in the assessment of an atlas collection, and it provides the user with a means of grasping the breadth and depth of their collection. This approach can allow both a broad vision and a more in-depth assessment. As a collection assessment tool, it is a streamlined method for analyzing the strengths and needs of the collection wherein the scope of a collection can be delimited to a specific type of item.
Using the visualization procedure in a limited situation does not preclude its use in a number of other types of analyses. For other parts of the library's collection, obviously the same type of approach can be used. The nature of its multidimensional analysis and presentation makes it amenable to other types of materials as well. As indicated earlier, it is being tested using materials from finding aids for a manuscript collection in which there is no real limitation on the vocabulary as there is within a limited range of the LC classification system. The intent of that assessment is to determine whether the procedure has potential as a visual form of finding aid. Using the charts of the terms from the finding aid, the expectation is that relationships between different parts of the manuscript collection can be seen that might have been overlooked, or at least were hard to find in a list-based format. A user could follow connections from the initial vocabulary item(s) of interest via the arrows to other portions of the manuscript collection.
No one procedure-whether for listing or visualizing data-is the only correct approach. Having several means of looking at data, especially means that are suited to the multiple dimensions of the data, is important to advance an understanding of the data. Many Eyes provides a toolbox of approaches that address not only the question at hand but are flexible enough to be used in a variety of situations, bringing additional insights to other data as well. The use of the Phrase Net tool and its diagrams have provided insights and increased the researcher's knowledge of the atlas collection. The results of this study have shown a more varied collection of atlases and identified gaps that certainly need to be filled. Further use and manipulation of the information will lead to further insights and appreciation of this collection. The tools in Many Eyes can open up data to fuller and deeper understanding of the information, and it has potential across a range of other topics, making this a Web site that should strongly be considered for addition to the librarian's toolbox. NOTE 1. By way of example, there were only 48 atlases checked out in the calendar year 2010 at the University of Illinois at Chicago, less than one per week.