35 files

Cleaned NHANES 1988-2018

Version 7 2024-05-16, 22:23
Version 6 2023-07-27, 20:01
Version 5 2023-01-15, 20:59
Version 4 2023-01-09, 19:53
Version 3 2023-01-09, 19:47
Version 2 2023-01-09, 19:46
Version 1 2023-01-09, 19:34
posted on 2024-05-16, 22:23 authored by Vy NguyenVy Nguyen, Lauren Y. M. Middleton, Neil Zhao, Lei Huang, Eliseu Verly, Jacob Kvasnicka, Luke Sagers, Chirag Patel, Justin Colacino, Olivier Jolliet

The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables convey

  1. demographics (281 variables),
  2. dietary consumption (324 variables),
  3. physiological functions (1,040 variables),
  4. occupation (61 variables),
  5. questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),
  6. medications (29 variables),
  7. mortality information linked from the National Death Index (15 variables),
  8. survey weights (857 variables),
  9. environmental exposure biomarker measurements (598 variables), and
  10. chemical comments indicating which measurements are below or above the lower limit of detection (505 variables).

csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.

  • The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments.
  • "dictionary\_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES.
  • "dictionary\_harmonized\_categories.csv" contains the harmonized categories for the categorical variables.
  • “dictionary\_drug\_codes.csv” contains the dictionary for descriptors on the drugs codes.
  • “nhanes\_inconsistencies\_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.

R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.

  • “w - nhanes_1988\_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.
  • “m - nhanes\_1988\_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.

Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.

  • “example\_0 - merge\_datasets\_together.Rmd” demonstrates how to merge the curated NHANES datasets together.
  • “example\_1 - account\_for\_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.
  • “example\_2 - calculate\_summary\_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.
  • “example\_3 - run\_multiple\_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.


Soremartec Italia S.R.L.

Ferrero SpA

Ravitz Family Foundation

University of Michigan Forbes Institute for Cancer Discovery

Harvard Data Science Initiative

National Institutes of Health R01 AG072396

National Institutes of Health R01 ES028802

National Institutes of Health P30 ES017885

National Institutes of Health P30 CA046592

National Institutes of Health UG3 CA267907
