Validation of Secondary Commercial Data Sources for Physical Activity Facilities in Urban and Nonurban Settings

Background: Secondary data are often necessary to assess the availability of commercial physical activity (PA) facilities and examine its association with individual behaviors and outcomes, yet the validity of such sources has been explored only in a limited number of studies. Methods: Field data were collected on the presence and attributes of commercial PA facilities in a random sample of 30 urban, 15 suburban, and 15 rural Census tracts in the Chicago metropolitan statistical area and surrounding area. Results: Approximately 40% of PA establishments in the field data were listed for both urban and nonurban tracts in both lists except for nonurban tracts in D&B (35%), which was significantly improved in the combined list of D&B and InfoUSA. Approximately one-quarter of the PA facilities listed in D&B were found on the ground, whereas 40% to 50% of PA facilities listed in InfoUSA were found on the ground. PA establishments that offered instruction programs or lessons or that had a court or pool were less likely to be listed, particularly in the nonurban tracts. Conclusions: Secondary commercial business lists on PA facilities should be used with caution in assessing the built environment.

As the prevalence of obesity increased dramatically in the United States over the past few decades, researchers have explored neighborhood contextual environments as key modifiable factors to combat this epidemic. 1,2A number of recent studies explored the extent to which access to physical activity (PA) facilities was associated with PA-related behavior and individual body weight outcomes, particularly among youths.Previous literature reported a positive association of public and/or commercial local area PA facility availability with the level of PA. [3][4][5][6][7] In addition, perceived availability of PA facilities was associated with higher levels of PA among middleand high-school aged children. 8,9][12] Previous studies measured variations in the built environment, particularly for commercial PA facilities, with a variety of sources including direct field observations and secondary data sources, such as telephone directories and commercial business databases.Although data on PA facilities collected in the field by trained staff can be considered as the "gold standard," it is time intensive, costly, often focuses on a specific geographic area, and cannot be easily linked to existing national longitudinal or historical cross-sectional, individual-level data.Therefore, researchers often tend to rely on secondary data for studies examining the relationship of environmental factors for PA and individual behavioral or weight outcomes.Despite such dependence on secondary data sources, limited validation on commercial business lists of PA establishments with on the ground direct observations has been undertaken in the previous literature.
4][15] These studies overall reported poor to moderate levels of validity for the secondary sources of PA facilities.Our study built on the previous studies and validated PA facility data from the 2 most widely available U.S. commercial business lists, Dun & Bradstreet (D&B) and InfoUSA, in the Chicago-Joliet-Naperville, IL-IN-WI Metropolitan Statistical Area (Chicago MSA) and surrounding nonurban areas.Our study was the first to assess the validity of PA facilities for both D&B and InfoUSA using direct field observations.We examined the level of agreement between direct field observations and information in each of the 2 secondary commercial business lists.We also used the combination of those 2 sources to assess the advantage of using combined sources.Further building on the previous literature, we collected characteristics from the interior of each PA facility and reported whether any statistically significant differences were found in those characteristics by validation status.

Data
As part of a larger food outlet validation study, 16 secondary commercial business list data on commercial PA facilities were drawn from D&B and InfoUSA and were validated with direct field observations based on a random sample of 30 urban census tracts from the Chicago MSA and a random sample of 15 suburban and 15 rural census tracts from a 50-mile buffer surrounding the Chicago MSA.
The 60 census tracts represented 1472 square miles (53 and 1419 square miles in the urban and nonurban tracts, respectively) and covered approximately 5076 total road miles (578 and 4498 miles in the urban and nonurban tracts, respectively).
Field observations were conducted from May 2009 through July 2009.Observers received 24 hours of training on the content of the data collection instruments and procedures for conducting observations including the use of maps.Observers were instructed to observe all streets within each tract as shown on the maps.Two trained field staff surveyed each census tract by walking or driving all streets contained within the tract to 1) identify any commercial PA facilities and 2) collect the presence of specific attributes inside the premises of such facilities.For each facility, data were collected on the following specific attributes: whether they charged a fee, required membership, offered instruction programs or lessons, had multiple cardiovascular and strength training equipment, had a multipurpose room, had a gymnasium, had a playing court, or had a pool.A total of 54 and 52 PA facilities were identified on the ground in the urban and nonurban tracts, respectively.
We obtained commercial data on PA facilities from D&B and InfoUSA with a reference date of May 2009.We used data on establishments with primary or secondary PA facility-related Standard Industrial Classification (SIC) codes for both D&B (8-digit) and InfoUSA (6-digit; see Table 1 for a full list of SIC codes).
All establishments in the lists were geocoded at the roof top level (the polygon centroid of a known building footprint) or street address maintained in the ArcGIS Online Geocoding Task using ArcGIS 9.1.Because our geographic sampling unit was the census tract, the remaining outlets that could only be geocoded less accurately than at the roof top or street address to the centroid of the zip code were not eligible for inclusion in our study.The geocoded locations were then matched to the census tracts using ArcGIS.We also generated a combined list of D&B and InfoUSA by manually dropping duplicated facilities by name, address, and phone number between the 2 business lists.A total of 155, 94, and 192 PA facilities were listed in D&B, InfoUSA, and the combined list, respectively, for our sample census tracts.

Analyses
The validity of the 2 secondary commercial business lists and the combined list was assessed by 3 validity indices, including 1) Sensitivity: the proportion of establishments observed on the ground that was listed in the commercial databases; 2) Positive Predictive Value (PPV): the proportion of establishments listed in the commercial databases that was observed on the ground; and 3) Concordance: the proportion of listed PA facilities also verified on the ground among those that were either found on the ground or in the business lists.Standard errors were calculated for all indices.We used Fisher's Exact test (2-sided p) to test for significant differences in the validity indices by urbanicity and by commercial business type within each urbanicity.
For PA facilities found on the ground, we compared attributes of each facility by whether it was found in a business list.A 2-sided t test was performed to determine whether any statistically significant differences existed in the characteristics of those facilities listed in a business list versus those not listed.We also examined the distribution of primary SIC codes for PA facilities that were listed but not found on the ground and we reviewed the primary SIC codes to identify whether there were particular patterns of mismatch.

Results
Table 2 presents the validity statistics by business list and urbanicity.We found a total of 54 and 52 PA facilities on the ground in urban and nonurban tracts, respectively, whereas 77 and 78 PA facilities for D&B and 42 and 52 PA facilities for InfoUSA were listed in urban and nonurban tracts, respectively.
Sensitivity was 41% in urban tracts and 35% in nonurban tracts in D&B.Similarly in InfoUSA, sensitivity was higher in urban tracts at 39%, whereas the sensitivity in nonurban tracts was 40%.Combining the 2 commercial business lists largely improved the sensitivity across all tracts, which climbed to 52% in both the urban and nonurban tracts, respectively.The increase in the sensitivity for the urban tracts in the combined list was statistically significant compared with both D&B and InfoUSA.The differences in sensitivity between the urban and nonurban tracts were not statistically significantly different for D&B and InfoUSA, or the combined list.
Positive predictive value (PPV), the proportion of outlets listed that were observed in the field, was fair in both lists.PPV in urban compared with nonurban tracts was 29% versus 23% in D&B and 50% versus 40% in InfoUSA.PPV in the combined list in the urban tracts was 30%, whereas it was 27% in nonurban tracts.For each of the business lists and the combined list, the differences by urbanicity were not statistically significant.However, PPV in InfoUSA was statistically significantly higher than PPV in D&B (for both urban and nonurban tracts) and the combined list (urban tracts only; Table 2).Finally, we calculated the proportion of listed PA facilities also verified on the ground among those that were either found on the ground or in the business lists (concordance) as a measure of the overall level of agreement.For urban tracts, InfoUSA (28%) showed higher concordance compared with D&B (20%).Overall, the agreement levels also were less than 30% in nonurban tracts in both databases (16% for D&B and 23% for InfoUSA).Combining the 2 business lists generated overall agreement levels of 24% and 22% for urban and nonurban tracts, respectively.However, none of the differences between the urban and nonurban tracts were statistically significant nor were there any statistically significant differences in concordance, overall agreement, across list type (Table 2).
In Table 3, we show whether detailed characteristics of PA facilities on the ground were different by whether those facilities were found in the business lists.In D&B, in both urban and nonurban tracts, a higher proportion of PA facilities in the nonmatched group had a pool (28% for urban and 26% for nonurban tracts) compared with the matched group (6% for urban and 0% for nonurban tracts).Note.Sensitivity represents the proportion of physical activity facilities found on the ground that were listed in a business list.Positive Predictive Value is the proportion of PA facilities listed in a business list that were found on the ground survey.Concordance is the proportion of listed PA facilities also verified on the ground among those that were either found on the ground or in the business lists.Statistical significance denoted at (P < .05)within each characteristic as follows: a significant difference between D&B and InfoUSA; b significant difference between individual list (D&B and InfoUSA) and the combined list.
In InfoUSA no statistically significant differences were found in the characteristics of PA facilities on the ground in urban tracts compared with those listed in the commercial database except for that 32% of the nonmatched PA facilities had a pool whereas none of the matched PA facilities had a pool.PA facilities in nonurban tracts that had a pool were less likely to be listed in InfoUSA: none of the listed PA establishments had a pool, whereas pools were present in 27% of the PA establishments that were on the ground but not listed in InfoUSA (Table 3).
Significant differences in the characteristics of the PA facilities between the matched and the nonmatched groups persisted in the combined list.PA facilities in the matched group had a lower rate of having a pool (4%) compared with PA establishments in the nonmatched group (35%) in urban tracts.In addition, none of the matched PA establishments in the combined list had a court nor had a pool, whereas 18% and 35% of the nonmatched PA facilities had a court and a pool, respectively, in nonurban tracts (see Table 3).
Further, we reviewed the primary SIC codes for PA facilities that were not found on the ground but listed as such in each business list.For D&B, there were 55 and 60 such PA facilities in urban and nonurban tracts, respectively, of which the 2 most frequent types were Note.Reported numbers are proportions.PA facilities found on ground that were listed in a business list were classified as "match," and the remaining PA facilities that were not listed were classified as "nonmatch."* significant at the 10% level; ** significant at the 5% level; *** significant at the 1% level.
spas (9%) and membership sports & recreational clubs (9%) in urban tracts, and public golf courses (13%) and membership sports & recreation clubs (15%) in nonurban tracts.For InfoUSA, 21 and 31 PA facilities were not found on the ground but listed in the database in urban and nonurban tracts, respectively.The most frequent facility type listed in InfoUSA but not found on the ground in urban tracts was martial arts instruction facilities (24%) followed by stables (14%).Slightly less than one-quarter (23%) of those facilities in nonurban tracts were health club studios and gymnasiums, and 16% were public golf courses (results not shown in Tables).
We also examined the type of PA facilities that were identified on the ground but not in the list, which were 21 and 21 facilities for InfoUSA and 22 and 18 for D&B for urban and nonurban tracts, respectively.Of those, approximately 25% and 30% for nonurban and urban tracts, respectively, across the 2 business lists were local community-run recreation centers or pools, YMCAs or Jewish Community Centers (results not shown in Tables).

Discussion
We first showed that less than one-half (41% for D&B and 39% for InfoUSA) of the PA establishments found on the ground were listed for the urban tracts, and between onethird and one-half were listed in either business list for the nonurban tracts (35% for D&B and 40% for InfoUSA).Sensitivity was improved by combining the 2 business lists; the sensitivity increased to 52% for both urban and nonurban tracts in the combined list.Second, we also found that approximately only one-quarter (23% for nonurban and 29% for urban) of PA establishments listed were actually found on the ground in D&B whereas 40% (nonurban tracts) and 50% (urban tracts) of PA facilities were found on the ground in InfoUSA.Further, we found that PA establishments found on the ground that had a court or pool were less likely to be listed in either D&B or InfoUSA as well as the combined list, particularly in the nonurban tracts.The overall concordance was poor in both D&B (20% and 16% for urban and nonurban tracts) and InfoUSA (28% and 25%) and no statistically significant differences were found between the 2 lists.In sum, although InfoUSA overall showed significantly higher PPV than either D&B or the combined list, we remain cautious in suggesting which dataset is more useful given that the differences of sensitivity and concordance between D&B and InfoUSA were not statistically significant, and the sensitivity improved in the combined list.
2][13] Paquet and colleagues validated 1 commercial database and a business list based on internet searches in 2005 for food stores and PA facilities for 12 census tract boundaries in Montreal, Canada.
They reported 56% for sensitivity, 63% for PPV, and concordance of 25%. 11Boone and colleagues assessed 1 commercial database with commercial PA facilities in 80 census block groups in 1 urban and 1 nonurban community in the United States in 2005.They reported concordance of 39% and 46% for urban and nonurban communities, respectively, for all PA facility types combined. 12Lastly, Hoehner and Schootman 15 compared D&B and InfoUSA to examine agreement between the 2 sources in terms of the presence, locations, and characteristics of businesses for PA facilities as well as food stores and restaurants.Based on 4-digit SIC codes provided in each list, they matched businesses in each list in the St. Louis, Missouri area in 2007 by business name within specified distances.They reported that the level of agreement between the 2 secondary sources on commercial PA establishments was 63% and varied by population density in a given census-tract.However, this latter study did not validate each list based on actual field survey data. 13he quality of secondary data sources in evaluating PA environments is important to reach credible conclusions when using such databases.Accurately identifying PA facilities in a neighborhood using business lists may require more efforts compared with food stores and restaurants given the large variety of PA facility types with both public and private sources. 11A previous study that reported agreement between D&B and InfoUSA of 63% for PA facilities, reported a higher level of agreement (75% to 92%) for food stores and restaurants. 15Similarly, an international study also reported that all validation statistics including the percentage agreement, sensitivity, and PPV were all superior for food stores compared with PA-related businesses (65% to 77% for food stores versus 25% to 42% for PA facilities for the percentage agreement; 66% to 84% for food stores and 33% to 56% for PA facilities for sensitivity; and, 58% to 98% for food stores and 50% to 63% for PA facilities for PPV). 13he results of this study are subject to a few limitations.First, our ground-truth survey data are based on 1 metropolitan urban area and surrounding nonurban area in the United States at a single point in time, and thus, whether the results of this study can be generalized across the United States is not known.Second, our field staff could only identify stand-alone facilities with clear PA indications although we performed a census for the sampled census tracts to identify PA facilities and collected their attributes inside each establishment.Therefore, if there were any individuals who were offering PA instruction in their home, for example, and listed themselves as a PA related business (not as an individual trainer) in a business list, such places were not included in our ground survey.Third, for this study we purchased the commercial business lists by SIC codes and did not provide any text strings to further obtain the list of PA facilities in those lists.Our findings indicated that governmental and nonprofit PA facilities that were found on the ground, particularly in nonurban tracts, were unlikely to be present in the business lists.This suggests that the business lists are less accurate at capturing information on governmental and nonprofit PA facilities in nonurban areas, or that these outlets are not recorded with PArelated SIC codes.Given that the prevalence of obesity is found to be significantly higher [17][18][19] and PA is found to be significantly lower 17,19,20 in nonurban populations, it is particularly important to accurately identify whether arealevel disparities exist in PA resources including specific sports or activity features.However, we acknowledge that using text strings in addition to SIC codes could improve sensitivity.In addition, relying SIC codes in the commercial business lists to identify PA facilities restricted our understanding of the characteristics of listed facilities that were not discovered on the ground which may include closed businesses, misclassified listings, and corporate offices.Finally, our sample size is small; however, similar studies also had small sample sizes and our findings are similar to those found in another study in the United States. 12Despite these limitations, the current study built on the previous literature by validating the spatial patterning of PA establishments in 2 commercial secondary data sources in both urban and nonurban areas including information of the characteristics of the PA facilities surveyed on the ground.

Conclusions
Our study results indicate that the 2 secondary commercial business lists should be used with caution in assessing the built environment regarding PA facility availability.Combining the 2 business lists has an advantage in that researchers more accurately include what is actually on the ground when they use such lists to measure the PA environment.However, such improvement in the sensitivity is obtained at the expense of the deterioration of PPV and concordance.Therefore, the improved sensitivity in the combined lists is only useful when researchers can perform initial screening via phone to eliminate any closed PA facilities that remain in those lists.Such initial screening via phone would be a good alternative to the expense of ground truthing in a medium size study; but this may not be feasible for a large study.