In recent years, next-generation sequencing (NGS) datasets generated for biomedical research have grown exponentially. As the field of clinical genomics has expanded, so has the number of datasets generated from patient samples. One major challenge the field has encountered is the donor mislabelling of datasets, which might lead to incorrect conclusions in biomedical research studies and affect potential diagnosis and treatment choices in a clinical setting. To address this issue, Javed et al.1 present CrosscheckFingerprints (Crosscheck), a method to detect donor mislabelling of different types of NGS samples. Crosscheck uses linkage disequilibrium, which reflects that alleles at genomic loci associate in a non-random manner. Using the NGS-derived genotyping information, Crosscheck compares different datasets and calculates the likelihood that they come from the same donor. The authors benchmark Crosscheck to other existing methods and find that it can detect instances of donor mislabelling with fewer errors. They further demonstrate its usefulness by successfully applying it to datasets with low sequencing depth or of different data types. Finally, the authors apply Crosscheck to 8,851 ENCODE datasets and identify different types of mislabelling event. This demonstrates the scalability of Crosscheck and its potential to be used for quality control of large sequencing platforms and consortia projects. The code and documentation required to use Crosscheck is fully available online and will be a useful resource for the community.
- RESEARCH HIGHLIGHT
Crosscheck for labels of sequencing dataset
doi: https://doi.org/10.1038/d42859-020-00062-z
References
Javed, N. et al. Detecting sample swaps in diverse NGS data types using linkage disequilibrium. Nat. Commun. https://doi.org/10.1038/s41467-020-17453-5 (2020)