This curriculum spans the full workflow of molecular evolution analysis, comparable in scope to a multi-phase bioinformatics consultancy or a structured internal genomics capability program, covering experimental design through to reproducible reporting.
Module 1: Defining Evolutionary Research Questions and Study Design
- Select appropriate phylogenetic scope (intra-species, inter-species, or pan-genomic) based on biological question and data availability.
- Justify inclusion or exclusion of taxa to balance representativeness and computational tractability.
- Determine whether to pursue gene tree or species tree inference based on expected levels of incomplete lineage sorting or horizontal gene transfer.
- Choose between de novo sequencing and public database reuse, weighing data quality against cost and novelty constraints.
- Establish criteria for outgroup selection to ensure rooting accuracy without introducing long-branch attraction artifacts.
- Design sampling strategies that account for geographic, temporal, and phenotypic diversity to avoid biased evolutionary interpretations.
- Define thresholds for sequence coverage and quality to ensure reliable variant calling in downstream analyses.
- Document metadata standards for sequences to support reproducibility and integration across studies.
Module 2: Sequence Acquisition, Curation, and Quality Control
- Implement automated pipelines to retrieve orthologous sequences from GenBank, RefSeq, or SRA using BioPython or Entrez Direct.
- Apply strict filtering criteria for sequence completeness, excluding entries with large unsequenced regions or ambiguous annotations.
- Identify and remove chimeric sequences using BLAST-based validation or k-mer anomaly detection.
- Standardize sequence naming conventions to prevent misalignment during concatenation or batch processing.
- Assess contamination risks in metagenomic or environmental samples using taxon-specific k-mer profiling.
- Integrate quality scores from FASTQ files into trimming decisions using tools like Trimmomatic or Cutadapt.
- Validate coding sequence (CDS) annotations through ORF prediction and comparison with reference proteomes.
- Document provenance and versioning of all sequence datasets to support auditability and reanalysis.
Module 3: Multiple Sequence Alignment and Homology Assessment
- Select alignment algorithm (MAFFT, Clustal Omega, or MUSCLE) based on dataset size and expected divergence.
- Decide between progressive and iterative alignment methods when dealing with highly divergent sequences.
- Apply masking strategies to remove poorly aligned regions using Gblocks or TrimAl without over-trimming conserved motifs.
- Validate alignment accuracy by inspecting conserved domain structures using Pfam or InterPro annotations.
- Assess homology at the amino acid versus nucleotide level depending on evolutionary distance and selection pressure.
- Handle frame shifts and indels in coding sequences by aligning at the protein level and back-translating to nucleotides.
- Integrate structural alignment data (e.g., from PDB) when available to guide homology modeling in ambiguous regions.
- Quantify alignment uncertainty using posterior probability scores from probabilistic aligners like PRANK.
Module 4: Phylogenetic Tree Inference and Model Selection
- Compare substitution models (GTR, HKY, etc.) using AIC or BIC scores to balance fit and overparameterization.
- Determine whether to use maximum likelihood (RAxML, IQ-TREE) or Bayesian (MrBayes, BEAST) methods based on dataset size and uncertainty requirements.
- Set branch support thresholds (e.g., bootstrap ≥70%, posterior probability ≥0.95) for clade interpretation.
- Partition data by gene, codon position, or functional domain and test for partition heterogeneity using PartitionFinder.
- Account for rate variation across sites using gamma-distributed rate categories or invariant sites models.
- Monitor MCMC convergence in Bayesian analyses using ESS values and trace plots in Tracer.
- Address long-branch attraction through taxon addition, model refinement, or site-heterogeneous models like CAT.
- Validate tree topology robustness via jackknife resampling or posterior predictive simulations.
Module 5: Molecular Clock Analysis and Divergence Time Estimation
- Select calibration points using fossil records, biogeographic events, or known sampling dates with documented uncertainty distributions.
- Choose between strict and relaxed molecular clock models based on empirical rate variation across branches.
- Assess clock-likeness using root-to-tip regression in TempEst before applying time-scaled models.
- Integrate tip-dating in BEAST for ancient DNA datasets with known radiocarbon dates.
- Define priors for substitution rates based on empirical data from related clades, avoiding overly informative assumptions.
- Quantify uncertainty in node ages by analyzing 95% highest posterior density intervals.
- Validate temporal signal by performing date-randomization tests to rule out spurious time correlations.
- Report clock model fit statistics (e.g., marginal likelihoods) when comparing alternative evolutionary scenarios.
Module 6: Detection of Selection and Adaptive Evolution
- Apply codon-based models (e.g., PAML, HyPhy) to estimate dN/dS ratios across sites, branches, or clades.
- Interpret ω (dN/dS) values with caution, recognizing limitations in power for weak or episodic selection.
- Differentiate between pervasive purifying selection and episodic positive selection using branch-site models.
- Control for recombination by screening alignments with GARD or Phi-test before selection analysis.
- Validate signals of positive selection with complementary methods such as FUBAR or MEME.
- Integrate population genetic data (e.g., Tajima’s D, Fay & Wu’s H) to distinguish selection from demographic effects.
- Map positively selected sites onto protein structures to assess functional plausibility.
- Document multiple testing corrections (e.g., FDR) when scanning genome-wide datasets for selection signals.
Module 7: Handling Recombination and Horizontal Gene Transfer
- Screen alignments for recombination breakpoints using Phi-test, GENECONV, or RDP5.
- Decide whether to exclude recombinant sequences, partition them, or use recombination-aware phylogenetic models.
- Apply phylogenetic incongruence tests (e.g., AU test) to quantify conflict between gene trees.
- Use ClonalFrameML or Gubbins to infer recombination events and correct phylogenies in bacterial genomes.
- Interpret horizontal gene transfer (HGT) candidates by assessing anomalous GC content, codon usage, or phylogenetic placement.
- Validate HGT events with synteny analysis across closely related genomes.
- Adjust substitution rate estimates to account for recombination-induced homoplasy.
- Document recombination filters and thresholds in methods sections to ensure reproducibility.
Module 8: Integration of Phenotypic and Functional Data
- Map discrete phenotypic traits (e.g., drug resistance, host specificity) onto phylogenies using ancestral state reconstruction.
- Test for phylogenetic signal in continuous traits using Pagel’s λ or Blomberg’s K.
- Perform phylogenetic generalized least squares (PGLS) to control for non-independence in trait evolution studies.
- Correlate molecular evolutionary rates with phenotypic innovation using clade-based rate tests.
- Integrate gene expression or protein abundance data to contextualize selection signals.
- Use phylotranscriptomic approaches to infer evolutionary changes in regulatory networks.
- Validate functional predictions from evolutionary analysis with in vitro or in vivo assays when feasible.
- Link positively selected sites to known functional domains using databases like UniProt or GO.
Module 9: Data Visualization, Reproducibility, and Reporting
- Design publication-ready phylogenies using ggtree or FigTree, ensuring accurate scale bars and support values.
- Generate time-scaled trees with annotated traits and uncertainty intervals using BEAST output and IcyTree.
- Use interactive visualization tools (e.g., Microreact) for sharing spatiotemporal evolutionary patterns.
- Implement version-controlled analysis pipelines using Git and containerization (Docker/Singularity).
- Archive raw data, scripts, and intermediate files in public repositories (e.g., Zenodo, Dryad) with DOIs.
- Adopt workflow languages (Snakemake, Nextflow) to ensure reproducibility across computing environments.
- Report model assumptions, software versions, and parameter settings in detail for auditability.
- Produce supplementary materials that include alignment files, tree files, and model fit statistics for peer review.