Analyzing Large-Scale Proteomics Projects with Latent
Semantic Indexing

Klie, Sebastian; Martens, Lennart; Vizcaíno, Juan Antonio; Côté, Richard; Jones, Phil; Apweiler, Rolf; Hinneburg, Alexander; Hermjakob, Henning

doi:10.1021/pr070461k.s001

pr070461k_si_001.pdf (150.43 kB)

Analyzing Large-Scale Proteomics Projects with Latent Semantic Indexing

journal contribution

posted on 2008-01-04, 00:00 authored by Sebastian Klie, Lennart Martens, Juan Antonio Vizcaíno, Richard Côté, Phil Jones, Rolf Apweiler, Alexander Hinneburg, Henning Hermjakob

Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analyses have been performed and published on this data, leveling off the ultimate value of these projects far below their potential. A prominent reason published proteomics data is seldom reanalyzed lies in the heterogeneous nature of the original sample collection and the subsequent data recording and processing. To illustrate that at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Human Proteome Organization’s Plasma Proteome Project (HUPO PPP). Interestingly, despite the broad spectrum of instruments and methodologies applied in the HUPO PPP, our analysis reveals several obvious patterns that can be used to formulate concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in future experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data by noise-tolerant algorithms such as the latent semantic analysis holds great promise and is currently underexploited.