This curriculum spans the full lifecycle of NGS data analysis, comparable in scope to a multi-phase bioinformatics capability program implemented across research and clinical sequencing teams, covering experimental design through cloud-scale deployment and governance.
Module 1: NGS Platform Selection and Experimental Design
- Selecting between Illumina, PacBio, and Oxford Nanopore based on required read length, error profile, and throughput for targeted versus whole-genome applications.
- Determining optimal sequencing depth for variant calling in tumor-normal paired samples, balancing sensitivity and cost.
- Designing multiplexing strategies using dual indexing to minimize cross-contamination and index hopping in high-throughput runs.
- Choosing between whole-exome and whole-genome sequencing based on clinical validity, coverage uniformity, and downstream analytical burden.
- Integrating spike-in controls and technical replicates to assess batch effects and library preparation variability.
- Aligning experimental goals with institutional IRB requirements, particularly when handling germline versus somatic variants.
- Planning for data storage and transfer bottlenecks during high-volume sequencing runs, especially with long-read platforms.
- Specifying RNA-seq library preparation protocols (e.g., poly-A selection vs. rRNA depletion) based on sample integrity and transcriptome complexity.
Module 2: Raw Data Quality Control and Preprocessing
- Interpreting FastQC reports to detect adapter contamination, overrepresented sequences, and per-base quality degradation.
- Implementing Trimmomatic or Cutadapt with platform-specific parameters to remove adapters while preserving informative reads.
- Deciding whether to apply quality-based trimming or hard clipping based on downstream variant sensitivity requirements.
- Filtering low-complexity reads in AT/GC-rich genomes to prevent alignment artifacts in repetitive regions.
- Assessing and correcting for PCR duplication rates using UMI-aware deduplication in amplicon-based panels.
- Validating FASTQ integrity using checksums and metadata logging before initiating large-scale processing pipelines.
- Handling mixed read lengths from degraded clinical samples in RNA-seq without introducing bias during trimming.
- Configuring parallelized QC workflows using Snakemake or Nextflow to manage compute load across clusters.
Module 3: Read Alignment and Reference Genome Management
- Selecting aligners (BWA-MEM, STAR, minimap2) based on data type (short-read, long-read, spliced RNA-seq).
- Choosing between linear and graph-based reference genomes (e.g., GRCh38 vs. PGGB) to improve alignment in structurally variable regions.
- Managing reference genome versioning across teams to prevent reproducibility issues in longitudinal studies.
- Indexing custom reference genomes with decoy sequences to reduce false alignments in HLA or KIR regions.
- Optimizing alignment parameters for high-identity paralogous genes to minimize mis-mapping in disease-associated loci.
- Validating alignment performance using spike-in controls or synthetic reads with known variants.
- Handling multimapping reads in repetitive regions during ChIP-seq or methylation analysis with probabilistic assignment.
- Precomputing alignment indices on shared storage to reduce redundant compute in multi-user environments.
Module 4: Variant Calling and Genotype Refinement
- Configuring GATK HaplotypeCaller for germline SNVs/indels with cohort-based recalibration in population studies.
- Selecting between MuTect2, VarScan2, and Strelka2 for somatic variant calling based on tumor purity and ploidy assumptions.
- Applying joint calling across cohorts to improve genotype consistency while managing computational scaling.
- Filtering false positives in low-coverage regions using depth, strand bias, and mapping quality thresholds.
- Integrating local reassembly to resolve complex indels in homopolymer regions from long-read data.
- Validating CNV calls from exome data using off-target read depth and B-allele frequency from SNP arrays.
- Adjusting VQSR (Variant Quality Score Recalibration) training sets when working with non-European populations.
- Implementing ensemble calling strategies with cross-tool consensus to increase precision in clinical reporting.
Module 5: Functional Annotation and Interpretation
- Selecting annotation databases (ClinVar, gnomAD, COSMIC, dbSNP) based on clinical actionability and population relevance.
- Resolving conflicting interpretations of VUS (Variants of Uncertain Significance) using ACMG/AMP guidelines in diagnostic settings.
- Integrating tissue-specific expression data from GTEx to prioritize non-coding variants in regulatory regions.
- Using CADD, REVEL, and SIFT scores to rank missense variants when functional assays are unavailable.
- Mapping splicing variants using SpliceAI or MaxEntScan to predict impact on canonical and cryptic splice sites.
- Filtering population-specific polymorphisms using local allele frequency databases to reduce false positives.
- Automating annotation pipelines with VEP or ANNOVAR while maintaining audit trails for clinical reporting.
- Handling novel gene-disease associations in research contexts without overinterpreting preliminary evidence.
Module 6: Transcriptomic and Epigenomic Analysis
- Normalizing RNA-seq count data using TPM or DESeq2 methods depending on comparison scope (within vs. across samples).
- Correcting for batch effects in large expression datasets using ComBat or RUV without removing biological signal.
- Selecting differential expression tools (edgeR, limma-voom) based on count distribution and experimental design.
- Validating alternative splicing events with junction-spanning reads in STAR or rMATS outputs.
- Integrating ATAC-seq and ChIP-seq peaks with promoter/enhancer annotations to infer regulatory networks.
- Calling methylation levels from bisulfite sequencing with Bismark or BSMAP, correcting for incomplete conversion.
- Clustering single-cell RNA-seq data using Seurat or Scanpy while preserving batch-corrected biological variation.
- Defining pseudotime trajectories in developmental datasets with Monocle3 or PAGA, validating with known markers.
Module 7: Data Integration and Multi-Omics Workflows
- Mapping genomic variants to expression QTLs using GTEx or eQTL Catalogue to identify regulatory mechanisms.
- Performing pathway enrichment analysis with GSEA or Enrichr while correcting for gene set size and overlap.
- Integrating copy number, mutation, and expression data in cancer samples to identify driver events.
- Using WGCNA to construct co-expression networks and link modules to clinical phenotypes.
- Aligning proteomic abundance data with transcript levels to assess post-transcriptional regulation.
- Implementing MOFA+ for unsupervised integration of heterogeneous omics layers in cohort studies.
- Resolving discordance between DNA methylation and gene expression in imprinted regions.
- Managing data harmonization across platforms (e.g., microarray vs. RNA-seq) in meta-analyses.
Module 8: Data Governance, Reproducibility, and Compliance
- Implementing audit trails for variant calling pipelines using version-controlled CWL or WDL workflows.
- Enforcing data access controls in multi-institutional collaborations using DUOS or dbGaP compliance checks.
- Applying de-identification protocols for genomic data under HIPAA and GDPR, including removal of quasi-identifiers.
- Archiving raw and processed data in compliant repositories (e.g., SRA, EGA) with metadata in MINSEQE format.
- Documenting pipeline parameters and software versions using RO-Crate or similar standards for reproducibility.
- Establishing change control procedures for updating reference genomes or annotation databases in production.
- Conducting periodic data integrity checks using checksum validation across storage tiers.
- Designing disaster recovery plans for high-value genomic datasets with geographically distributed backups.
Module 9: Scalable Infrastructure and Cloud Deployment
- Provisioning compute clusters with appropriate CPU, memory, and I/O profiles for alignment versus variant calling stages.
- Migrating legacy pipelines to containerized environments (Docker, Singularity) for cross-platform consistency.
- Configuring cloud bursting strategies using AWS Batch or Google Life Sciences API during peak loads.
- Optimizing storage costs by tiering raw FASTQs to cold storage and retaining CRAMs for active analysis.
- Implementing role-based access and encryption for data in transit and at rest on cloud platforms.
- Monitoring pipeline performance with Prometheus and Grafana to detect bottlenecks in distributed systems.
- Selecting between managed services (e.g., Terra, DNAnexus) and self-hosted solutions based on customization needs.
- Estimating egress costs and transfer times when sharing large datasets across international collaborators.