Background: Copy Number Variants (CNV) are differentiating events between individual genomes and play an important role in cancer and mendelian diseases like neurodegenerative diseases. The current CNV detection methodologies using Next Generation Sequencing (NGS) are based mainly on measuring the relative losses or gains of the reads coverage for a genome region. However, read counts usually present various types of random noise and biases, which need to be mitigated in order to accurately analyze copy numbers. Currently there is no consensus yet on methodologies or thresholds to use for accurately calling CNVs, resulting in many different tools calling too many CNV candidates with little concordance between their results.
Description: Here we evaluate different signal processing techniques (including novel Machine Learning methods) for data normalization and denoising. This work employs several techniques such as wavelet shrinkage denoising and total variation denoising to smoothen readcount data while preserving high frequency information such as breakpoints, and uses a principal component analysis and expectation-maximization based technique to reduce undesired biases between samples. The log2ratio is then segmented via CBS and fused lasso techniques.
Conclusions:
This work features a machine learning layer to clusterize a given sample with similar individuals and detect CNVs as anomalies that deviate from the assigned population, following a statistical approach that allows the user to select a desired sensitivity level for detection.
We have created a workflow for CNV assessment and implemented a web application for exploring and annotating the complexity of CNV detection.