Description

This curriculum spans the technical and operational complexity of a multi-year bioinformatics capability program, covering the full lifecycle of variant analysis from raw data ingestion to clinical reporting and scalable infrastructure management.

Module 1: Foundations of Genomic Data Standards and File Formats

Select appropriate compression and indexing strategies for BAM, CRAM, and VCF files based on access patterns and storage constraints.
Implement consistent metadata tagging across sequencing runs using MIxS or GA4GH Phenopackets standards.
Validate VCF file integrity using bcftools and Ensembl’s validator to detect format deviations and reference genome mismatches.
Design a file naming convention that encodes sample ID, assay type, processing version, and data modality for auditability.
Choose between hg19 and hg38 reference builds based on cohort ancestry, annotation availability, and legacy data compatibility.
Establish checksum policies (SHA-256) for raw FASTQ files to ensure data provenance across transfer points.
Integrate sequence read archive (SRA) metadata parsing into ingestion pipelines for public dataset reuse.

Module 2: High-Throughput Variant Calling Workflows

Compare GATK HaplotypeCaller, FreeBayes, and DeepVariant for sensitivity in low-coverage regions and structural variant detection.
Configure joint calling pipelines to minimize batch effects across multi-cohort studies.
Optimize gVCF merging strategies to balance computational load and cohort scalability in population-level analyses.
Adjust base quality recalibration (BQSR) parameters when working with non-standard sequencing chemistries or damaged DNA.
Implement hard filtering thresholds for SNPs and indels when variant quality score recalibration (VQSR) is infeasible due to small cohort size.
Validate germline variant calls using orthogonal technologies (e.g., genotyping arrays or Sanger sequencing).
Manage memory and I/O bottlenecks in variant calling on high-coverage WGS data using containerized pipeline scaling.

Module 3: Structural Variant and Copy Number Analysis

Integrate multiple callers (Manta, Delly, CNVnator) to improve detection sensitivity and reduce platform-specific biases.
Resolve discordant SV calls across callers using breakpoint clustering and local assembly validation.
Adjust read-depth normalization methods for tumor-normal pairs with variable ploidy and tumor purity.
Apply gc-content and mappability correction to CNV segmentations in low-complexity genomic regions.
Classify complex rearrangements (chromothripsis, breakage-fusion-bridge) using pattern-based heuristics and cytogenetic correlation.
Validate large deletions or duplications with qPCR or MLPA in clinical reporting contexts.
Handle low-pass whole-genome sequencing data in population-scale CNV studies with imputation-aware segmentation.

Module 4: Functional Annotation and Pathogenicity Assessment

Select annotation sources (VEP, ANNOVAR, SnpEff) based on species support, plugin ecosystem, and regulatory region coverage.
Customize transcript selection rules (MANE, canonical, tissue-specific) to align with clinical or research use cases.
Integrate CADD, REVEL, and MetaLR scores into prioritization workflows with calibrated thresholds per variant class.
Flag variants in non-coding regions with regulatory potential using ENCODE, Roadmap Epigenomics, and promoter capture Hi-C data.
Resolve conflicting pathogenicity assertions from ClinVar submitters using evidence weighting and submission date filtering.
Implement HGVS nomenclature compliance for variant reporting to meet ACMG and LOINC standards.
Cache and version annotation databases locally to ensure reproducibility across analysis batches.

Module 5: Population Genetics and Allele Frequency Filtering

Choose population-matched controls from gnomAD, TOPMed, or HGDP to avoid spurious filtering in underrepresented groups.
Adjust allele frequency thresholds for recessive vs. dominant inheritance models in rare disease analysis.
Account for relatedness and cryptic population structure in cohort-level frequency estimation using KING or PC-Relate.
Implement stratified filtering to preserve variants enriched in specific subpopulations without discarding true positives.
Quantify batch effects in allele frequencies across sequencing centers using principal component analysis on common variants.
Use linkage disequilibrium-aware pruning in GWAS preprocessing to reduce multicollinearity in regression models.
Update internal frequency databases with project-specific data to refine filtering in longitudinal studies.

Module 6: Clinical Interpretation and ACMG Guidelines Implementation

Map variant evidence codes (PS1, PM2, PP3, etc.) to automated rules while preserving manual override paths for complex cases.
Integrate RNA-seq or splicing assay data into PVS1 strength assessment for predicted null variants.
Configure automated review of de novo variants using trio phasing and Mendelian inconsistency checks.
Document rationale for downgrading strong evidence (e.g., PM1 in low-specificity domains) in clinical reports.
Implement audit trails for classification changes across reanalysis cycles in diagnostic pipelines.
Align classification workflows with CAP/ACMG reporting requirements for somatic and germline variants.
Manage reclassification policies for variants of uncertain significance (VUS) in longitudinal patient care.

Module 7: Data Integration and Multi-Omics Correlation

Align genomic variant coordinates with methylation array probes (e.g., EPIC) to assess cis-regulatory effects.
Integrate eQTL databases (GTEx, eQTLGen) to prioritize non-coding variants affecting gene expression.
Perform allele-specific expression analysis using RNA-seq data from matched tumor-normal samples.
Map structural variants to 3D chromatin interaction domains (Hi-C) to identify disrupted enhancer-promoter loops.
Correlate mutational signatures from WGS with transcriptomic subtypes in cancer cohorts.
Resolve tissue-specific effects by filtering multi-omics associations using cell-type deconvolution results.
Manage batch effects across omics layers using ComBat or surrogate variable analysis (SVA).

Module 8: Regulatory Compliance and Data Governance

Classify genomic data under GDPR, HIPAA, or CCPA based on identifiability and re-identification risk assessments.
Implement dynamic consent tracking for data reuse in biobank-scale research infrastructures.
Configure access controls for tiered data (raw reads, variants, phenotypes) using attribute-based access control (ABAC).
Design audit logs that capture data access, variant classification changes, and pipeline execution provenance.
Apply data minimization principles when sharing variant data via Beacon or federated networks.
Establish data retention and destruction policies aligned with IRB protocols and funding requirements.
Navigate export control regulations (e.g., USML, Wassenaar) when transferring pathogen or dual-use genomic data.

Module 9: Scalable Infrastructure and Pipeline Orchestration

Select workflow languages (WDL, Nextflow, Snakemake) based on team expertise, cloud portability, and debugging tooling.
Optimize autoscaling policies for bursty variant calling workloads on Kubernetes or AWS Batch.
Implement checkpointing and resume functionality for long-running annotation jobs on interrupted nodes.
Version control pipeline configurations using Git with semantic tagging and dependency pinning.
Containerize tools with Singularity or Docker to ensure reproducibility across HPC and cloud environments.
Monitor pipeline performance using metrics (CPU, memory, runtime) to identify bottlenecks in joint calling stages.
Design disaster recovery strategies for pipeline metadata and intermediate files in distributed storage systems.