Description

This curriculum spans the full lifecycle of NGS data analysis, comparable in scope to a multi-phase bioinformatics capability program implemented across research and clinical sequencing teams, covering experimental design through cloud-scale deployment and governance.

Module 1: NGS Platform Selection and Experimental Design

Selecting between Illumina, PacBio, and Oxford Nanopore based on required read length, error profile, and throughput for targeted versus whole-genome applications.
Determining optimal sequencing depth for variant calling in tumor-normal paired samples, balancing sensitivity and cost.
Designing multiplexing strategies using dual indexing to minimize cross-contamination and index hopping in high-throughput runs.
Choosing between whole-exome and whole-genome sequencing based on clinical validity, coverage uniformity, and downstream analytical burden.
Integrating spike-in controls and technical replicates to assess batch effects and library preparation variability.
Aligning experimental goals with institutional IRB requirements, particularly when handling germline versus somatic variants.
Planning for data storage and transfer bottlenecks during high-volume sequencing runs, especially with long-read platforms.
Specifying RNA-seq library preparation protocols (e.g., poly-A selection vs. rRNA depletion) based on sample integrity and transcriptome complexity.

Module 2: Raw Data Quality Control and Preprocessing

Interpreting FastQC reports to detect adapter contamination, overrepresented sequences, and per-base quality degradation.
Implementing Trimmomatic or Cutadapt with platform-specific parameters to remove adapters while preserving informative reads.
Deciding whether to apply quality-based trimming or hard clipping based on downstream variant sensitivity requirements.
Filtering low-complexity reads in AT/GC-rich genomes to prevent alignment artifacts in repetitive regions.
Assessing and correcting for PCR duplication rates using UMI-aware deduplication in amplicon-based panels.
Validating FASTQ integrity using checksums and metadata logging before initiating large-scale processing pipelines.
Handling mixed read lengths from degraded clinical samples in RNA-seq without introducing bias during trimming.
Configuring parallelized QC workflows using Snakemake or Nextflow to manage compute load across clusters.

Module 3: Read Alignment and Reference Genome Management

Selecting aligners (BWA-MEM, STAR, minimap2) based on data type (short-read, long-read, spliced RNA-seq).
Choosing between linear and graph-based reference genomes (e.g., GRCh38 vs. PGGB) to improve alignment in structurally variable regions.
Managing reference genome versioning across teams to prevent reproducibility issues in longitudinal studies.
Indexing custom reference genomes with decoy sequences to reduce false alignments in HLA or KIR regions.
Optimizing alignment parameters for high-identity paralogous genes to minimize mis-mapping in disease-associated loci.
Validating alignment performance using spike-in controls or synthetic reads with known variants.
Handling multimapping reads in repetitive regions during ChIP-seq or methylation analysis with probabilistic assignment.
Precomputing alignment indices on shared storage to reduce redundant compute in multi-user environments.

Module 4: Variant Calling and Genotype Refinement

Configuring GATK HaplotypeCaller for germline SNVs/indels with cohort-based recalibration in population studies.
Selecting between MuTect2, VarScan2, and Strelka2 for somatic variant calling based on tumor purity and ploidy assumptions.
Applying joint calling across cohorts to improve genotype consistency while managing computational scaling.
Filtering false positives in low-coverage regions using depth, strand bias, and mapping quality thresholds.
Integrating local reassembly to resolve complex indels in homopolymer regions from long-read data.
Validating CNV calls from exome data using off-target read depth and B-allele frequency from SNP arrays.
Adjusting VQSR (Variant Quality Score Recalibration) training sets when working with non-European populations.
Implementing ensemble calling strategies with cross-tool consensus to increase precision in clinical reporting.

Module 5: Functional Annotation and Interpretation

Selecting annotation databases (ClinVar, gnomAD, COSMIC, dbSNP) based on clinical actionability and population relevance.
Resolving conflicting interpretations of VUS (Variants of Uncertain Significance) using ACMG/AMP guidelines in diagnostic settings.
Integrating tissue-specific expression data from GTEx to prioritize non-coding variants in regulatory regions.
Using CADD, REVEL, and SIFT scores to rank missense variants when functional assays are unavailable.
Mapping splicing variants using SpliceAI or MaxEntScan to predict impact on canonical and cryptic splice sites.
Filtering population-specific polymorphisms using local allele frequency databases to reduce false positives.
Automating annotation pipelines with VEP or ANNOVAR while maintaining audit trails for clinical reporting.
Handling novel gene-disease associations in research contexts without overinterpreting preliminary evidence.

Module 6: Transcriptomic and Epigenomic Analysis

Normalizing RNA-seq count data using TPM or DESeq2 methods depending on comparison scope (within vs. across samples).
Correcting for batch effects in large expression datasets using ComBat or RUV without removing biological signal.
Selecting differential expression tools (edgeR, limma-voom) based on count distribution and experimental design.
Validating alternative splicing events with junction-spanning reads in STAR or rMATS outputs.
Integrating ATAC-seq and ChIP-seq peaks with promoter/enhancer annotations to infer regulatory networks.
Calling methylation levels from bisulfite sequencing with Bismark or BSMAP, correcting for incomplete conversion.
Clustering single-cell RNA-seq data using Seurat or Scanpy while preserving batch-corrected biological variation.
Defining pseudotime trajectories in developmental datasets with Monocle3 or PAGA, validating with known markers.

Module 7: Data Integration and Multi-Omics Workflows

Mapping genomic variants to expression QTLs using GTEx or eQTL Catalogue to identify regulatory mechanisms.
Performing pathway enrichment analysis with GSEA or Enrichr while correcting for gene set size and overlap.
Integrating copy number, mutation, and expression data in cancer samples to identify driver events.
Using WGCNA to construct co-expression networks and link modules to clinical phenotypes.
Aligning proteomic abundance data with transcript levels to assess post-transcriptional regulation.
Implementing MOFA+ for unsupervised integration of heterogeneous omics layers in cohort studies.
Resolving discordance between DNA methylation and gene expression in imprinted regions.
Managing data harmonization across platforms (e.g., microarray vs. RNA-seq) in meta-analyses.

Module 8: Data Governance, Reproducibility, and Compliance

Implementing audit trails for variant calling pipelines using version-controlled CWL or WDL workflows.
Enforcing data access controls in multi-institutional collaborations using DUOS or dbGaP compliance checks.
Applying de-identification protocols for genomic data under HIPAA and GDPR, including removal of quasi-identifiers.
Archiving raw and processed data in compliant repositories (e.g., SRA, EGA) with metadata in MINSEQE format.
Documenting pipeline parameters and software versions using RO-Crate or similar standards for reproducibility.
Establishing change control procedures for updating reference genomes or annotation databases in production.
Conducting periodic data integrity checks using checksum validation across storage tiers.
Designing disaster recovery plans for high-value genomic datasets with geographically distributed backups.

Module 9: Scalable Infrastructure and Cloud Deployment

Provisioning compute clusters with appropriate CPU, memory, and I/O profiles for alignment versus variant calling stages.
Migrating legacy pipelines to containerized environments (Docker, Singularity) for cross-platform consistency.
Configuring cloud bursting strategies using AWS Batch or Google Life Sciences API during peak loads.
Optimizing storage costs by tiering raw FASTQs to cold storage and retaining CRAMs for active analysis.
Implementing role-based access and encryption for data in transit and at rest on cloud platforms.
Monitoring pipeline performance with Prometheus and Grafana to detect bottlenecks in distributed systems.
Selecting between managed services (e.g., Terra, DNAnexus) and self-hosted solutions based on customization needs.
Estimating egress costs and transfer times when sharing large datasets across international collaborators.