The All of Us (AoU) initiative aims to enhance personalized medical care by sequencing the genomes of over one million Americans of diverse ethnic backgrounds. To improve sequencing accuracy, we conducted a recent technical pilot comparing traditional short-read sequencing with long-read sequencing in a small cohort of samples from the HapMap project and two AoU control samples, representing eight datasets. Our analysis revealed significant differences in the accuracy of these technologies in sequencing complex, medically relevant genes, particularly in terms of gene coverage and identification of pathogenic variants. We also evaluated the advantages and challenges of using low-coverage sequencing to increase the number of samples in large cohort analysis. Our findings show that HiFi reads provide better results for both single nucleotide variants (SNVs) and structural variants (SVs). We developed a cloud-based pipeline optimized for long-read analysis, which includes tools for SNV, indel, and SV calling at scale. These results have significant implications for improving the accuracy and efficiency of sequencing in the AoU initiative and beyond. Furthermore, we extended our findings to produce a complex medically relevant gene analysis panel encompassing 389 genes, including genes related to cardiovascular, neurodegenerative diseases, and cancer-related genes (e.g., LPA, MSN1/2). By sequencing 12 samples per SMRT cell, we achieved highly accurate variant calling, reaching up to 98.95% for substitutions, 94.44% for indels, and 94% for SVs while reducing the sequencing costs per sample. We also developed a tailored publicly available pipeline and gene-specific tools that exploit mapping and assembly-based methods to achieve optimal results.
In summary, our study demonstrates the advantages of using HiFi sequencing technology and provides a pipeline and gene-specific tools that can be used to improve sequencing accuracy and reduce costs in large-scale genomic studies.
Learning objectives:
1. Demonstrate understanding of the purposes and goals of the All of Us program and learn how to utilize the program's data for research purposes.
2. Differentiate between short-reads and long-reads and identify the limitations of short-read sequencing compared to long-read sequencing.
3. Discover the best practices for calling Single Nucleotide Variants (SNVs) and Structural Variants (SVs) using long-read sequencing data and explore ways to reduce costs while still benefiting from long-read sequencing technology.