This curriculum spans the full workflow of a phylogenomic research project, comparable in scope to a multi-phase bioinformatics initiative involving data curation, model testing, species tree reconstruction, divergence dating, and reproducible workflow deployment, as conducted in academic or institutional bioinformatics cores.
Module 1: Foundations of Molecular Sequence Data Acquisition and Curation
- Select appropriate sequencing technologies (e.g., Sanger vs. NGS) based on taxon sampling depth and required read accuracy for downstream phylogenetic inference.
- Implement quality control pipelines using FastQC and Trimmomatic to remove adapter contamination and low-quality bases from raw sequence reads.
- Choose orthology detection methods (e.g., OrthoFinder, InParanoid) to identify single-copy gene families suitable for species tree estimation.
- Resolve ambiguities in sequence metadata (e.g., mislabeled taxa, inconsistent nomenclature) by cross-referencing with authoritative databases like NCBI Taxonomy.
- Decide on sequence alignment inclusion criteria, such as minimum sequence length and maximum gap proportion, to balance taxon coverage and alignment reliability.
- Document provenance and versioning of sequence datasets using structured metadata formats (e.g., NeXML) to ensure reproducibility across analysis stages.
- Assess the impact of missing data patterns on phylogenetic signal by conducting subsampling experiments across loci and taxa.
Module 2: Multiple Sequence Alignment Strategies and Evaluation
- Select alignment algorithms (e.g., MAFFT, MUSCLE, Clustal Omega) based on dataset size, sequence divergence, and computational constraints.
- Apply consistency-based refinement methods in T-Coffee or PRANK to improve alignment accuracy in regions with high indel rates.
- Partition nucleotide vs. amino acid alignment strategies depending on evolutionary divergence and substitution saturation levels.
- Use GUIDANCE2 or ZORRO to identify and mask alignment columns with low confidence scores prior to tree inference.
- Compare structural alignment outputs (e.g., using Infernal for rRNA genes) against sequence-only methods to evaluate functional conservation.
- Integrate secondary structure constraints into RNA alignments using tools like LocARNA or MAFFT --localpair.
- Validate alignment robustness by running replicate alignments with varied gap opening/extension penalties.
Module 3: Substitution Model Selection and Model Fit Assessment
- Run PartitionFinder or ModelTest-NG to identify best-fit nucleotide substitution models per gene or codon position partition.
- Decide between time-reversible and non-reversible models based on tree topology stability and biological plausibility of root placement.
- Evaluate model adequacy using posterior predictive simulations in IQ-TREE or PhyloBayes to detect systematic model violations.
- Address rate heterogeneity across sites by implementing gamma-distributed rates (+G) or invariant sites (+I) based on likelihood improvement.
- Assess amino acid substitution model fit (e.g., LG, WAG, JTT) using cross-validation in large protein datasets.
- Balance model complexity against overfitting by applying AIC, BIC, or AICc criteria when comparing nested models.
- Monitor branch-length artifacts caused by poor model fit, such as long-branch attraction, through simulation-based diagnostics.
Module 4: Phylogenetic Inference Using Maximum Likelihood and Bayesian Methods
- Configure IQ-TREE for large-scale analyses using ultrafast bootstrapping (UFBoot) and edge-linked partition models to reduce computation time.
- Set MCMC parameters in MrBayes or BEAST2 (e.g., chain length, sampling frequency) based on effective sample size (ESS) diagnostics.
- Diagnose convergence in Bayesian runs using Tracer to evaluate ESS values across likelihood and topological parameters.
- Compare tree topologies from ML and Bayesian analyses to identify robust clades supported by both methods.
- Implement checkpointing in BEAST2 to resume interrupted runs without loss of sampling progress.
- Optimize parallelization strategies (e.g., MPI, GPU acceleration) for RAxML-NG on HPC clusters.
- Handle polytomies in inferred trees by assessing whether they reflect uncertainty or true evolutionary radiations.
Module 5: Species Tree Estimation in the Presence of Gene Tree Discordance
- Choose between concatenation and coalescent-based species tree methods (e.g., ASTRAL, SVDquartets) based on levels of incomplete lineage sorting.
- Quantify gene tree discordance using quartet scores in ASTRAL to identify loci contributing to topological conflict.
- Filter outlier gene trees influenced by paralogy or horizontal gene transfer before species tree inference.
- Integrate SNP-based methods like SVDquartets for phylogenomic datasets with high missing data.
- Assess the impact of taxon sampling density on coalescent variance in species tree branch support.
- Compare results from summary methods (ASTRAL) versus full-likelihood methods (STAR-BEAST) under different demographic scenarios.
- Interpret branch lengths in coalescent units as population-scaled divergence times, not absolute time without calibration.
Module 6: Molecular Dating and Divergence Time Estimation
- Select appropriate clock models (strict vs. relaxed) in BEAST2 based on root-to-tip regression and coefficient of variation of rates.
- Define calibration priors using fossil constraints with justified minimum bounds and soft maximum bounds to avoid overconfidence.
- Apply multiple fossil calibrations across the tree to improve precision and test temporal congruence.
- Use tip-dating methods in morphological datasets with combined molecular and fossil taxa in BEAST2.
- Assess the impact of calibration placement by running sensitivity analyses with alternative fossil placements.
- Integrate biogeographic events (e.g., land bridge formation) as secondary calibration points when fossil data are sparse.
- Report HPD intervals for node ages with explicit justification of prior distributions and model assumptions.
Module 7: Phylogenetic Comparative Methods and Trait Evolution
- Fit models of continuous trait evolution (Brownian motion, Ornstein-Uhlenbeck) using phytools or nlme in R.
- Test for phylogenetic signal in discrete traits using Pagel’s λ or Blomberg’s K with significance assessed via permutation.
- Reconstruct ancestral states for categorical traits using stochastic character mapping in SIMMAP.
- Control for phylogenetic non-independence in regression models using PGLS for macroevolutionary hypotheses.
- Identify shifts in evolutionary rates using BAMM or l1ou, validating results against tree-wide rate homogeneity tests.
- Account for uncertainty in tree topology and branch lengths by conducting analyses across posterior tree distributions.
- Evaluate model fit of state-dependent diversification using BiSSE or HiSSE with likelihood ratio tests.
Module 8: Visualization, Annotation, and Communication of Phylogenetic Results
- Generate publication-ready tree figures using ggtree in R, incorporating bootstrap values, divergence times, and trait data.
- Integrate geographic data into phylogenies using phylogeographic mapping tools in Microreact or iTOL.
- Export annotated trees in standard formats (e.g., Newick, Nexus) with embedded metadata for sharing in Dryad or TreeBASE.
- Use color schemes and layout types (radial, rectangular) to emphasize clades of interest without distorting branch lengths.
- Embed uncertainty in visualizations by showing credible sets of trees or heatmaps of clade support across analyses.
- Design interactive web displays using Phylo.io or OneZoom for large trees intended for collaborative review.
- Ensure accessibility of figures by adhering to colorblind-safe palettes and scalable vector formats (SVG, PDF).
Module 9: Reproducibility, Workflow Management, and High-Performance Computing
- Containerize analysis pipelines using Docker or Singularity to ensure software environment consistency.
- Orchestrate phylogenomic workflows using Nextflow or Snakemake to manage dependencies and parallelize tasks.
- Store intermediate files and final outputs in structured directory trees with standardized naming conventions.
- Version-control scripts and configuration files using Git, with annotated commits reflecting analytical decisions.
- Optimize memory and CPU allocation for BEAST2 or RAxML runs on shared HPC systems using job scheduler directives (e.g., SLURM).
- Implement checkpointing and error recovery mechanisms in long-running analyses to minimize reprocessing.
- Archive complete analysis workflows in repositories like Zenodo to ensure long-term reproducibility and DOI assignment.