Description

This curriculum spans the technical and operational complexity of a multi-year bioinformatics initiative, comparable to establishing an internal sequencing analysis program within a research hospital or biotech startup, where platform selection, data integrity, regulatory compliance, and cross-omics integration must be systematically operationalized across diverse use cases.

Module 1: Foundations of DNA Sequencing Technologies

Selecting between short-read (Illumina) and long-read (PacBio, Oxford Nanopore) platforms based on project goals such as genome completeness versus cost-efficiency.
Evaluating error profiles of sequencing platforms when designing experiments for variant detection in low-frequency alleles.
Integrating multiplexing strategies with sample-specific barcodes to maximize throughput while minimizing cross-contamination risks.
Assessing DNA input requirements and library preparation kits for degraded or low-yield samples, such as FFPE tissue.
Designing sequencing depth based on application: 30x for human whole-genome, >100x for tumor-normal pairs, or variable depth in metagenomics.
Documenting instrument run parameters and metadata for auditability in regulated research environments.
Managing data transfer from sequencers to secure storage, including handling real-time streaming data from nanopore devices.
Establishing protocols for instrument calibration and quality control between sequencing runs.

Module 2: Raw Data Quality Control and Preprocessing

Implementing FastQC and MultiQC workflows to detect sequencing artifacts such as adapter contamination or overrepresented sequences.
Choosing trimming tools (Trimmomatic, Cutadapt) and parameters based on library type and downstream analysis requirements.
Deciding whether to remove PCR duplicates at the read level for amplicon-based versus whole-genome sequencing.
Applying quality score recalibration in high-precision applications like clinical variant calling.
Validating base quality scores using known control samples to detect systematic biases.
Automating preprocessing pipelines with containerization (Docker/Singularity) for reproducibility across compute environments.
Setting pass/fail thresholds for sample inclusion based on metrics like Q30 scores and read length distribution.
Managing metadata alignment between raw FASTQ files and experimental records in LIMS systems.

Module 4: Genome Assembly and Structural Analysis

Selecting de novo assemblers (SPAdes, Canu, Flye) based on sequencing technology and genome complexity.
Optimizing k-mer size selection in short-read assembly to balance contiguity and misassembly rates.
Hybrid assembly strategies combining short and long reads to improve scaffold N50 while minimizing computational cost.
Validating assembly completeness using BUSCO or Merqury against lineage-specific gene sets.
Resolving repetitive regions using long-read data and manual curation in medically relevant loci.
Generating assembly metrics (N50, L50, contiguity) for reporting and comparison across projects.
Handling polyploid or highly heterozygous genomes by adjusting assembler parameters or using specialized tools.
Archiving assembly versions with provenance tracking for reproducibility in longitudinal studies.

Module 5: Variant Calling and Annotation

Choosing between haplotype-aware (GATK HaplotypeCaller) and pileup-based (BCFtools) variant callers based on ploidy and data type.
Implementing joint calling workflows for cohort studies to ensure consistent variant representation across samples.
Filtering variants using depth, quality scores, and strand bias metrics tailored to sequencing protocol.
Integrating population frequency databases (gnomAD, 1000 Genomes) to prioritize rare variants in clinical interpretation.
Using VEP or SnpEff to annotate functional consequences while managing local versus remote database access.
Handling structural variants (SVs) with specialized callers (Manta, Delly) and validating breakpoints using split-read evidence.
Addressing allelic imbalance in RNA-seq derived variants due to expression bias.
Documenting filtering rationale and thresholds for audit in diagnostic or regulatory submissions.

Module 6: Functional Genomics and Regulatory Element Analysis

Integrating ChIP-seq or ATAC-seq data with variant calls to assess regulatory impact of non-coding SNPs.
Defining peak calling parameters (FDR thresholds, control inputs) in epigenomic assays to minimize false positives.
Mapping open chromatin regions to gene promoters using genome annotation databases like ENCODE or Roadmap.
Linking eQTL data to GWAS hits for functional prioritization in complex trait studies.
Using motif analysis (HOMER, MEME) to evaluate transcription factor binding disruption by variants.
Normalizing signal across samples in functional assays using input controls or spike-ins.
Managing batch effects in multi-experiment regulatory datasets through ComBat or similar methods.
Storing and querying large functional genomics datasets using specialized databases like BigWig or TileDB.

Module 7: Metagenomics and Microbiome Analysis

Selecting between marker-gene (16S rRNA) and shotgun metagenomic approaches based on resolution and cost constraints.
Removing host DNA reads from microbiome samples using reference-based subtraction (Bowtie2, Kraken2).
Choosing taxonomic classifiers (QIIME2, MetaPhlAn) based on database comprehensiveness and runtime.
Normalizing abundance data using rarefaction or CSS to enable cross-sample comparisons.
Assessing alpha and beta diversity with appropriate statistical tests and correcting for confounding variables.
Reconstructing metagenome-assembled genomes (MAGs) using binning tools (MetaBAT2, MaxBin) and evaluating completeness.
Handling contamination and strain heterogeneity in low-biomass microbiome samples.
Managing privacy risks when sharing microbiome data due to potential host DNA leakage.

Module 8: Data Integration and Multi-Omics Workflows

Aligning genomic variants with transcriptomic data to identify expression outliers (eQTL mapping).
Using WGS and RNA-seq jointly to detect fusion genes and splicing aberrations in cancer genomics.
Integrating methylation (WGBS) and gene expression data to infer epigenetic regulation mechanisms.
Applying dimensionality reduction (PCA, UMAP) to visualize concordance across omics layers.
Building predictive models (LASSO, random forests) that combine variant, expression, and clinical data.
Resolving data resolution mismatches, such as linking single-nucleotide variants to pathway-level proteomics.
Using MOFA or iCluster for unsupervised integration of heterogeneous omics datasets.
Implementing version-controlled pipelines to ensure reproducibility when updating reference databases.

Module 9: Data Governance, Security, and Compliance

Classifying genomic data under GDPR, HIPAA, or CLIA based on identifiability and use context.
Implementing role-based access controls (RBAC) in shared analysis environments to restrict sensitive data access.
Encrypting genomic data at rest and in transit, especially when using cloud-based compute resources.
Establishing data retention and deletion policies aligned with IRB-approved protocols.
Auditing data access and pipeline execution logs for compliance in clinical reporting workflows.
Managing informed consent metadata to restrict data usage to approved research domains.
De-identifying genomic datasets using tools like GATK’s FilterVariants while preserving analytical utility.
Documenting data provenance using W3C PROV or similar standards for regulatory submissions.