This curriculum spans the technical and operational complexity of a multi-year bioinformatics capability program, covering the full lifecycle of variant analysis from raw data ingestion to clinical reporting and scalable infrastructure management.
Module 1: Foundations of Genomic Data Standards and File Formats
- Select appropriate compression and indexing strategies for BAM, CRAM, and VCF files based on access patterns and storage constraints.
- Implement consistent metadata tagging across sequencing runs using MIxS or GA4GH Phenopackets standards.
- Validate VCF file integrity using bcftools and Ensembl’s validator to detect format deviations and reference genome mismatches.
- Design a file naming convention that encodes sample ID, assay type, processing version, and data modality for auditability.
- Choose between hg19 and hg38 reference builds based on cohort ancestry, annotation availability, and legacy data compatibility.
- Establish checksum policies (SHA-256) for raw FASTQ files to ensure data provenance across transfer points.
- Integrate sequence read archive (SRA) metadata parsing into ingestion pipelines for public dataset reuse.
Module 2: High-Throughput Variant Calling Workflows
- Compare GATK HaplotypeCaller, FreeBayes, and DeepVariant for sensitivity in low-coverage regions and structural variant detection.
- Configure joint calling pipelines to minimize batch effects across multi-cohort studies.
- Optimize gVCF merging strategies to balance computational load and cohort scalability in population-level analyses.
- Adjust base quality recalibration (BQSR) parameters when working with non-standard sequencing chemistries or damaged DNA.
- Implement hard filtering thresholds for SNPs and indels when variant quality score recalibration (VQSR) is infeasible due to small cohort size.
- Validate germline variant calls using orthogonal technologies (e.g., genotyping arrays or Sanger sequencing).
- Manage memory and I/O bottlenecks in variant calling on high-coverage WGS data using containerized pipeline scaling.
Module 3: Structural Variant and Copy Number Analysis
- Integrate multiple callers (Manta, Delly, CNVnator) to improve detection sensitivity and reduce platform-specific biases.
- Resolve discordant SV calls across callers using breakpoint clustering and local assembly validation.
- Adjust read-depth normalization methods for tumor-normal pairs with variable ploidy and tumor purity.
- Apply gc-content and mappability correction to CNV segmentations in low-complexity genomic regions.
- Classify complex rearrangements (chromothripsis, breakage-fusion-bridge) using pattern-based heuristics and cytogenetic correlation.
- Validate large deletions or duplications with qPCR or MLPA in clinical reporting contexts.
- Handle low-pass whole-genome sequencing data in population-scale CNV studies with imputation-aware segmentation.
Module 4: Functional Annotation and Pathogenicity Assessment
- Select annotation sources (VEP, ANNOVAR, SnpEff) based on species support, plugin ecosystem, and regulatory region coverage.
- Customize transcript selection rules (MANE, canonical, tissue-specific) to align with clinical or research use cases.
- Integrate CADD, REVEL, and MetaLR scores into prioritization workflows with calibrated thresholds per variant class.
- Flag variants in non-coding regions with regulatory potential using ENCODE, Roadmap Epigenomics, and promoter capture Hi-C data.
- Resolve conflicting pathogenicity assertions from ClinVar submitters using evidence weighting and submission date filtering.
- Implement HGVS nomenclature compliance for variant reporting to meet ACMG and LOINC standards.
- Cache and version annotation databases locally to ensure reproducibility across analysis batches.
Module 5: Population Genetics and Allele Frequency Filtering
- Choose population-matched controls from gnomAD, TOPMed, or HGDP to avoid spurious filtering in underrepresented groups.
- Adjust allele frequency thresholds for recessive vs. dominant inheritance models in rare disease analysis.
- Account for relatedness and cryptic population structure in cohort-level frequency estimation using KING or PC-Relate.
- Implement stratified filtering to preserve variants enriched in specific subpopulations without discarding true positives.
- Quantify batch effects in allele frequencies across sequencing centers using principal component analysis on common variants.
- Use linkage disequilibrium-aware pruning in GWAS preprocessing to reduce multicollinearity in regression models.
- Update internal frequency databases with project-specific data to refine filtering in longitudinal studies.
Module 6: Clinical Interpretation and ACMG Guidelines Implementation
- Map variant evidence codes (PS1, PM2, PP3, etc.) to automated rules while preserving manual override paths for complex cases.
- Integrate RNA-seq or splicing assay data into PVS1 strength assessment for predicted null variants.
- Configure automated review of de novo variants using trio phasing and Mendelian inconsistency checks.
- Document rationale for downgrading strong evidence (e.g., PM1 in low-specificity domains) in clinical reports.
- Implement audit trails for classification changes across reanalysis cycles in diagnostic pipelines.
- Align classification workflows with CAP/ACMG reporting requirements for somatic and germline variants.
- Manage reclassification policies for variants of uncertain significance (VUS) in longitudinal patient care.
Module 7: Data Integration and Multi-Omics Correlation
- Align genomic variant coordinates with methylation array probes (e.g., EPIC) to assess cis-regulatory effects.
- Integrate eQTL databases (GTEx, eQTLGen) to prioritize non-coding variants affecting gene expression.
- Perform allele-specific expression analysis using RNA-seq data from matched tumor-normal samples.
- Map structural variants to 3D chromatin interaction domains (Hi-C) to identify disrupted enhancer-promoter loops.
- Correlate mutational signatures from WGS with transcriptomic subtypes in cancer cohorts.
- Resolve tissue-specific effects by filtering multi-omics associations using cell-type deconvolution results.
- Manage batch effects across omics layers using ComBat or surrogate variable analysis (SVA).
Module 8: Regulatory Compliance and Data Governance
- Classify genomic data under GDPR, HIPAA, or CCPA based on identifiability and re-identification risk assessments.
- Implement dynamic consent tracking for data reuse in biobank-scale research infrastructures.
- Configure access controls for tiered data (raw reads, variants, phenotypes) using attribute-based access control (ABAC).
- Design audit logs that capture data access, variant classification changes, and pipeline execution provenance.
- Apply data minimization principles when sharing variant data via Beacon or federated networks.
- Establish data retention and destruction policies aligned with IRB protocols and funding requirements.
- Navigate export control regulations (e.g., USML, Wassenaar) when transferring pathogen or dual-use genomic data.
Module 9: Scalable Infrastructure and Pipeline Orchestration
- Select workflow languages (WDL, Nextflow, Snakemake) based on team expertise, cloud portability, and debugging tooling.
- Optimize autoscaling policies for bursty variant calling workloads on Kubernetes or AWS Batch.
- Implement checkpointing and resume functionality for long-running annotation jobs on interrupted nodes.
- Version control pipeline configurations using Git with semantic tagging and dependency pinning.
- Containerize tools with Singularity or Docker to ensure reproducibility across HPC and cloud environments.
- Monitor pipeline performance using metrics (CPU, memory, runtime) to identify bottlenecks in joint calling stages.
- Design disaster recovery strategies for pipeline metadata and intermediate files in distributed storage systems.