This curriculum spans the full workflow of a multi-investigator phylogenomics project, comparable in scope to an internal bioinformatics capability program that supports study design, data curation, model-based inference, and reproducible workflow automation across distributed research teams.
Module 1: Study Design and Data Acquisition in Phylogenomics
- Select appropriate sequencing strategies (e.g., whole-genome, targeted capture, transcriptome) based on taxon sampling and evolutionary divergence.
- Determine inclusion criteria for operational taxonomic units (OTUs) to balance phylogenetic breadth with data quality.
- Evaluate trade-offs between sequencing depth and number of taxa when constrained by budget and computational resources.
- Establish metadata standards for sample provenance, sequencing platform, and library preparation to ensure reproducibility.
- Assess contamination risks in environmental or ancient DNA samples prior to alignment and orthology detection.
- Implement data versioning and access protocols for multi-investigator collaborations involving distributed datasets.
- Navigate ethical and legal considerations for using sequence data from protected species or indigenous biota.
Module 2: Sequence Alignment and Orthology Inference
- Choose between de novo and reference-based assembly methods for non-model organisms with limited genomic resources.
- Configure alignment parameters in MAFFT or MUSCLE to balance speed and accuracy for large multi-sequence datasets.
- Apply domain-aware masking (e.g., using HMMER) to remove low-complexity or non-homologous regions from alignments.
- Compare orthology inference methods (e.g., OrthoFinder, OrthoMCL) based on scalability and sensitivity for gene family clustering.
- Resolve paralogy through gene tree-species tree reconciliation when constructing species-level phylogenies.
- Integrate synteny information to validate ortholog calls in closely related species with recent duplications.
- Document alignment curation steps to maintain auditability in regulatory or publication contexts.
Module 3: Alignment Curation and Data Filtering
- Apply Gblocks or BMGE to remove ambiguously aligned regions while preserving phylogenetically informative sites.
- Quantify and filter alignment positions with excessive missing data per taxon to avoid topological artifacts.
- Assess compositional heterogeneity across taxa using Chi-square or posterior predictive checks in PhyloBayes.
- Decide on inclusion/exclusion of fast-evolving sites based on site-rate heterogeneity models.
- Implement partition schemes based on gene, codon position, or functional domain prior to model selection.
- Use RogueNaRok to identify unstable taxa that degrade tree resolution and support values.
- Balance data retention with signal-to-noise ratio when filtering low-information partitions.
Module 4: Substitution Model Selection and Partitioning
- Run ModelFinder or jModelTest2 to identify best-fit nucleotide or amino acid substitution models per partition.
- Compare BIC, AIC, and AICc for model selection under different dataset sizes and parameter counts.
- Decide between linked and unlinked branch length models across partitions based on empirical fit.
- Test for site-rate heterogeneity using gamma distributions or invariant sites models.
- Justify use of codon models versus amino acid models for detecting selection in protein-coding sequences.
- Validate model adequacy using posterior predictive simulations in Bayesian frameworks.
- Document model decisions for peer review and reproducibility in collaborative phylogenies.
Module 5: Phylogenetic Inference Using Maximum Likelihood and Bayesian Methods
- Configure RAxML-NG or IQ-TREE for parallel execution on high-performance computing clusters.
- Set bootstrap replicates and thoroughness criteria to achieve convergence in support values.
- Monitor MCMC chain convergence in MrBayes or PhyloBayes using ESS values and trace plots.
- Adjust heating parameters and chain length in Bayesian analyses to avoid trapping in local optima.
- Compare topology outputs from ML and Bayesian methods to assess robustness under different assumptions.
- Manage memory and runtime constraints when analyzing large supermatrices (>10,000 sites, >100 taxa).
- Implement checkpointing and job resubmission workflows for long-running tree searches.
Module 6: Species Tree Estimation and Gene Tree Discordance
- Apply ASTRAL or ASTRID to infer species trees from gene trees while accounting for incomplete lineage sorting.
- Quantify gene tree discordance using internode certainty or quartet similarity measures.
- Diagnose sources of discordance (e.g., ILS, hybridization, HGT) using PhyParts or DiscoVista.
- Integrate coalescent-based methods when population-level sampling is available within species.
- Assess impact of gene tree estimation error on species tree accuracy using simulation benchmarks.
- Use D-statistics (ABBA-BABA) to test for introgression between non-sister lineages.
- Balance computational cost with model realism when choosing between concatenation and coalescent frameworks.
Module 7: Phylogenetic Comparative Methods and Trait Evolution
- Reconstruct ancestral states for discrete traits using stochastic mapping in phytools or corHMM.
- Fit evolutionary models (Brownian, OU, early burst) to continuous traits using maximum likelihood.
- Test for phylogenetic signal using Blomberg’s K or Pagel’s λ across different clades.
- Control for phylogenetic non-independence in regression models using PGLS.
- Identify shifts in evolutionary rates using BAMM or l1ou, with careful prior specification.
- Validate model assumptions (e.g., normality of residuals, tree calibration) before inference.
- Interpret clade-specific results in light of fossil calibration and biogeographic context.
Module 8: Visualization, Annotation, and Data Sharing
- Generate publication-quality tree figures using ggtree or ITOL with consistent color and label schemes.
- Annotate trees with metadata (e.g., geography, phenotype, divergence times) for exploratory analysis.
- Export trees in NHX or NeXML formats to preserve annotations and support interoperability.
- Submit final phylogenies and alignments to curated repositories (e.g., TreeBASE, GenBank) with MIAPA compliance.
- Use FigTree or Dendroscope for interactive exploration of large or complex topologies.
- Version-control tree files and analysis scripts using Git or similar systems for audit trails.
- Design scalable visualization strategies for consensus networks or phylogenetic placement results.
Module 9: Scalability, Reproducibility, and Workflow Automation
- Containerize analysis pipelines using Docker or Singularity for environment consistency.
- Orchestrate multi-step workflows using Snakemake or Nextflow with error handling and logging.
- Optimize job scheduling on HPC systems using SLURM or PBS for memory and CPU-intensive steps.
- Implement checksums and data integrity checks for large alignment and tree files.
- Design modular pipeline components to enable reuse across projects with different taxon sets.
- Integrate continuous integration testing for pipeline updates using synthetic or benchmark datasets.
- Archive intermediate results and logs to support audit, debugging, and reanalysis.