Description

This curriculum spans the full workflow of a population genetics research programme, comparable in scope to a multi-phase bioinformatics project involving cohort design, sequencing analysis, association testing, and ethical data governance.

Module 1: Study Design and Cohort Selection in Genomic Research

Determine inclusion and exclusion criteria for population cohorts based on ancestry, geographic origin, and phenotypic homogeneity to minimize confounding in association studies.
Select appropriate sampling strategies (e.g., random, stratified, case-control) based on research objectives and expected allele frequency distributions.
Address batch effects by planning sample processing order and integrating technical replicates across sequencing runs.
Balance representation across subpopulations to avoid bias in downstream analyses while maintaining statistical power.
Obtain informed consent that explicitly covers genomic data sharing, reanalysis, and potential identification risks.
Design longitudinal sampling protocols when studying allele frequency changes over time or in response to selection pressures.
Integrate metadata collection standards (e.g., MIxS, PhenX) to ensure interoperability with public databases.

Module 2: High-Throughput Sequencing Data Acquisition and Quality Control

Choose sequencing platforms (Illumina, PacBio, Oxford Nanopore) based on required read length, accuracy, and variant detection goals.
Implement FASTQ-level quality assessment using tools like FastQC and MultiQC to detect adapter contamination and base quality decay.
Apply trimming and filtering protocols using Trimmomatic or Cutadapt to remove low-quality bases and sequencing adapters.
Monitor sequencing depth per sample to ensure sufficient coverage for rare variant detection in population samples.
Flag samples with abnormal GC content or duplication rates for reprocessing or exclusion.
Validate library preparation consistency across batches using principal component analysis on k-mer frequencies.
Document sequencing parameters and instrument runs for auditability and reproducibility.

Module 3: Read Alignment and Variant Calling Pipelines

Select reference genomes (e.g., GRCh38, T2T-CHM13) based on population ancestry and structural variant content.
Align sequencing reads using BWA-MEM or minimap2, adjusting parameters for read length and error profiles.
Index reference genomes and alignment files using samtools and HTSlib for efficient data access.
Perform local realignment around indels using GATK or bcftools to improve variant calling accuracy.
Call SNPs and indels using joint calling workflows in GATK or FreeBayes to leverage population-level information.
Apply hard filtering or VQSR (Variant Quality Score Recalibration) based on training resource availability and cohort size.
Validate variant calls using known control samples (e.g., NA12878) and concordance metrics against gold-standard sets.

Module 4: Population Structure and Ancestry Inference

Generate genotype matrices in PLINK or VCF format for use in population structure analyses.
Perform PCA using EIGENSOFT to identify major axes of genetic variation and detect outliers.
Estimate individual ancestry proportions using ADMIXTURE or STRUCTURE with cross-validation to select K.
Compare inferred clusters against known population labels to assess data integrity.
Correct for population stratification in association studies using principal components as covariates.
Identify cryptic relatedness using KING or PLINK to exclude or adjust for familial relationships.
Interpret ADMIXTURE results in light of historical migration and admixture events relevant to the cohort.

Module 5: Allele Frequency Estimation and Hardy-Weinberg Equilibrium Testing

Calculate allele frequencies per population subgroup to identify variants with large inter-population differences.
Apply stratified frequency estimation when subpopulations are known to avoid spurious signals.
Test for Hardy-Weinberg equilibrium using PLINK or bcftools, filtering variants with significant deviations.
Adjust HWE p-value thresholds based on multiple testing burden and minor allele frequency bins.
Investigate HWE violations for potential genotyping errors, selection pressure, or inbreeding effects.
Report frequency estimates with confidence intervals to reflect sampling uncertainty in smaller cohorts.
Compare observed frequencies against gnomAD or 1000 Genomes to contextualize findings.

Module 6: Detection of Natural Selection and Evolutionary Signatures

Compute FST values between populations using Weir & Cockerham’s estimator to identify loci under differential selection.
Scan for extended haplotype homozygosity using iHS or EHH to detect recent positive selection.
Apply Tajima’s D tests per genomic window to infer deviations from neutral evolution.
Integrate environmental or phenotypic data to interpret selection signals in functional context.
Correct for demographic history using coalescent simulations to distinguish selection from drift.
Validate selection candidates with replication in independent cohorts or functional assays.
Use composite likelihood ratio tests (e.g., SweepFinder2) to improve power in detecting selective sweeps.

Module 7: Genome-Wide Association Studies (GWAS) and Burden Testing

Perform logistic or linear regression in PLINK or REGENIE, adjusting for covariates including principal components.
Apply genomic control or LD score regression to correct for residual population structure.
Define gene-based units for burden tests using canonical transcripts and regulatory regions.
Aggregate rare variants using SKAT or burden tests with MAF thresholds tailored to study power.
Account for relatedness using mixed models (e.g., BOLT-LMM, SAIGE) in structured cohorts.
Control for multiple testing using Bonferroni, FDR, or gene-based correction strategies.
Validate top associations in replication cohorts with similar ancestry and phenotyping protocols.

Module 8: Data Integration, Annotation, and Functional Interpretation

Annotate variants using Ensembl VEP or ANNOVAR with custom plugins for regulatory and non-coding elements.
Integrate eQTL and chromatin interaction data (e.g., GTEx, Hi-C) to prioritize causal genes.
Map significant loci to pathways using tools like g:Profiler or Enrichr with ancestry-matched background sets.
Use CADD or Eigen scores to rank non-coding variants by predicted functional impact.
Link GWAS hits to drug targets using Open Targets or DisGeNET for translational insights.
Visualize genomic regions with LocusZoom or IGV to inspect linkage disequilibrium and annotation context.
Maintain version-controlled annotation pipelines to ensure reproducible results across analyses.

Module 9: Data Sharing, Privacy, and Ethical Governance

De-identify genomic datasets by removing direct identifiers and limiting metadata granularity.
Apply data use limitations (DUOs) in accordance with consent agreements and institutional review board requirements.
Submit summary statistics to GWAS Catalog with standardized phenotypes and ancestry descriptors.
Use controlled-access repositories (e.g., dbGaP, EGA) for individual-level data sharing.
Implement data access committees (DACs) with defined review procedures and conflict-of-interest policies.
Monitor for re-identification risks using tools like k-anonymity checks on genotype data.
Develop data transfer agreements that specify security standards and permitted use cases.