Statistical Variable Selection: An Alternative Prioritization Strategy during the Nontarget Analysis of LC-HR-MS Data

Liquid chromatography coupled to high resolution mass spectrometry (LC-HR-MS) has been one of the main analytical tools for the analysis of small polar organic pollutants in the environment. LC-HR-MS typically produces a large amount of data for a single chromatogram. The analyst is therefore required to perform prioritization prior to nontarget structural elucidation. In the present study, we have combined the F-ratio statistical variable selection and the apex detection algorithms in order to perform prioritization in data sets produced via LC-HR-MS. The approach was validated through the use of semisynthetic data, which was a combination of real environmental data and the artificially added signal of 31 alkanes in that sample. We evaluated the performance of this method as a function of four false detection probabilities, namely: 0.01, 0.02, 0.05, and 0.1%. We generated 100 different semisynthetic data sets for each F-ratio and evaluated that data set using this method. This design of experiment created a population of 30 000 true positives and 32 000 true negatives for each F-ratio, which was considered sufficiently large enough in order to fully validate this method for analysis of LC-HR-MS data. The effect of both the F-ratio and signal-to-noise ratio (<i>S</i>/<i>N</i>) on the performance of the suggested approach were evaluated through normalized statistical tests. We also compared this method to the pixel-by-pixel as well as peak list approaches. More than 92% of features present in the final feature list via the F-ratio method were also present in the conventional peak list generated by MZmine. However, this method was the only approach successful in the classification of samples, and thus prioritization, when compared to the other evaluated approaches. The application potential and limitations of the suggested method are discussed.