This curriculum spans the full workflow of a population genetics research programme, comparable in scope to a multi-phase bioinformatics project involving cohort design, sequencing analysis, association testing, and ethical data governance.
Module 1: Study Design and Cohort Selection in Genomic Research
- Determine inclusion and exclusion criteria for population cohorts based on ancestry, geographic origin, and phenotypic homogeneity to minimize confounding in association studies.
- Select appropriate sampling strategies (e.g., random, stratified, case-control) based on research objectives and expected allele frequency distributions.
- Address batch effects by planning sample processing order and integrating technical replicates across sequencing runs.
- Balance representation across subpopulations to avoid bias in downstream analyses while maintaining statistical power.
- Obtain informed consent that explicitly covers genomic data sharing, reanalysis, and potential identification risks.
- Design longitudinal sampling protocols when studying allele frequency changes over time or in response to selection pressures.
- Integrate metadata collection standards (e.g., MIxS, PhenX) to ensure interoperability with public databases.
Module 2: High-Throughput Sequencing Data Acquisition and Quality Control
- Choose sequencing platforms (Illumina, PacBio, Oxford Nanopore) based on required read length, accuracy, and variant detection goals.
- Implement FASTQ-level quality assessment using tools like FastQC and MultiQC to detect adapter contamination and base quality decay.
- Apply trimming and filtering protocols using Trimmomatic or Cutadapt to remove low-quality bases and sequencing adapters.
- Monitor sequencing depth per sample to ensure sufficient coverage for rare variant detection in population samples.
- Flag samples with abnormal GC content or duplication rates for reprocessing or exclusion.
- Validate library preparation consistency across batches using principal component analysis on k-mer frequencies.
- Document sequencing parameters and instrument runs for auditability and reproducibility.
Module 3: Read Alignment and Variant Calling Pipelines
- Select reference genomes (e.g., GRCh38, T2T-CHM13) based on population ancestry and structural variant content.
- Align sequencing reads using BWA-MEM or minimap2, adjusting parameters for read length and error profiles.
- Index reference genomes and alignment files using samtools and HTSlib for efficient data access.
- Perform local realignment around indels using GATK or bcftools to improve variant calling accuracy.
- Call SNPs and indels using joint calling workflows in GATK or FreeBayes to leverage population-level information.
- Apply hard filtering or VQSR (Variant Quality Score Recalibration) based on training resource availability and cohort size.
- Validate variant calls using known control samples (e.g., NA12878) and concordance metrics against gold-standard sets.
Module 4: Population Structure and Ancestry Inference
- Generate genotype matrices in PLINK or VCF format for use in population structure analyses.
- Perform PCA using EIGENSOFT to identify major axes of genetic variation and detect outliers.
- Estimate individual ancestry proportions using ADMIXTURE or STRUCTURE with cross-validation to select K.
- Compare inferred clusters against known population labels to assess data integrity.
- Correct for population stratification in association studies using principal components as covariates.
- Identify cryptic relatedness using KING or PLINK to exclude or adjust for familial relationships.
- Interpret ADMIXTURE results in light of historical migration and admixture events relevant to the cohort.
Module 5: Allele Frequency Estimation and Hardy-Weinberg Equilibrium Testing
- Calculate allele frequencies per population subgroup to identify variants with large inter-population differences.
- Apply stratified frequency estimation when subpopulations are known to avoid spurious signals.
- Test for Hardy-Weinberg equilibrium using PLINK or bcftools, filtering variants with significant deviations.
- Adjust HWE p-value thresholds based on multiple testing burden and minor allele frequency bins.
- Investigate HWE violations for potential genotyping errors, selection pressure, or inbreeding effects.
- Report frequency estimates with confidence intervals to reflect sampling uncertainty in smaller cohorts.
- Compare observed frequencies against gnomAD or 1000 Genomes to contextualize findings.
Module 6: Detection of Natural Selection and Evolutionary Signatures
- Compute FST values between populations using Weir & Cockerham’s estimator to identify loci under differential selection.
- Scan for extended haplotype homozygosity using iHS or EHH to detect recent positive selection.
- Apply Tajima’s D tests per genomic window to infer deviations from neutral evolution.
- Integrate environmental or phenotypic data to interpret selection signals in functional context.
- Correct for demographic history using coalescent simulations to distinguish selection from drift.
- Validate selection candidates with replication in independent cohorts or functional assays.
- Use composite likelihood ratio tests (e.g., SweepFinder2) to improve power in detecting selective sweeps.
Module 7: Genome-Wide Association Studies (GWAS) and Burden Testing
- Perform logistic or linear regression in PLINK or REGENIE, adjusting for covariates including principal components.
- Apply genomic control or LD score regression to correct for residual population structure.
- Define gene-based units for burden tests using canonical transcripts and regulatory regions.
- Aggregate rare variants using SKAT or burden tests with MAF thresholds tailored to study power.
- Account for relatedness using mixed models (e.g., BOLT-LMM, SAIGE) in structured cohorts.
- Control for multiple testing using Bonferroni, FDR, or gene-based correction strategies.
- Validate top associations in replication cohorts with similar ancestry and phenotyping protocols.
Module 8: Data Integration, Annotation, and Functional Interpretation
- Annotate variants using Ensembl VEP or ANNOVAR with custom plugins for regulatory and non-coding elements.
- Integrate eQTL and chromatin interaction data (e.g., GTEx, Hi-C) to prioritize causal genes.
- Map significant loci to pathways using tools like g:Profiler or Enrichr with ancestry-matched background sets.
- Use CADD or Eigen scores to rank non-coding variants by predicted functional impact.
- Link GWAS hits to drug targets using Open Targets or DisGeNET for translational insights.
- Visualize genomic regions with LocusZoom or IGV to inspect linkage disequilibrium and annotation context.
- Maintain version-controlled annotation pipelines to ensure reproducible results across analyses.
Module 9: Data Sharing, Privacy, and Ethical Governance
- De-identify genomic datasets by removing direct identifiers and limiting metadata granularity.
- Apply data use limitations (DUOs) in accordance with consent agreements and institutional review board requirements.
- Submit summary statistics to GWAS Catalog with standardized phenotypes and ancestry descriptors.
- Use controlled-access repositories (e.g., dbGaP, EGA) for individual-level data sharing.
- Implement data access committees (DACs) with defined review procedures and conflict-of-interest policies.
- Monitor for re-identification risks using tools like k-anonymity checks on genotype data.
- Develop data transfer agreements that specify security standards and permitted use cases.