Description

This curriculum spans the full workflow of a multi-investigator phylogenomics project, comparable in scope to an internal bioinformatics capability program that supports study design, data curation, model-based inference, and reproducible workflow automation across distributed research teams.

Module 1: Study Design and Data Acquisition in Phylogenomics

Select appropriate sequencing strategies (e.g., whole-genome, targeted capture, transcriptome) based on taxon sampling and evolutionary divergence.
Determine inclusion criteria for operational taxonomic units (OTUs) to balance phylogenetic breadth with data quality.
Evaluate trade-offs between sequencing depth and number of taxa when constrained by budget and computational resources.
Establish metadata standards for sample provenance, sequencing platform, and library preparation to ensure reproducibility.
Assess contamination risks in environmental or ancient DNA samples prior to alignment and orthology detection.
Implement data versioning and access protocols for multi-investigator collaborations involving distributed datasets.
Navigate ethical and legal considerations for using sequence data from protected species or indigenous biota.

Module 2: Sequence Alignment and Orthology Inference

Choose between de novo and reference-based assembly methods for non-model organisms with limited genomic resources.
Configure alignment parameters in MAFFT or MUSCLE to balance speed and accuracy for large multi-sequence datasets.
Apply domain-aware masking (e.g., using HMMER) to remove low-complexity or non-homologous regions from alignments.
Compare orthology inference methods (e.g., OrthoFinder, OrthoMCL) based on scalability and sensitivity for gene family clustering.
Resolve paralogy through gene tree-species tree reconciliation when constructing species-level phylogenies.
Integrate synteny information to validate ortholog calls in closely related species with recent duplications.
Document alignment curation steps to maintain auditability in regulatory or publication contexts.

Module 3: Alignment Curation and Data Filtering

Apply Gblocks or BMGE to remove ambiguously aligned regions while preserving phylogenetically informative sites.
Quantify and filter alignment positions with excessive missing data per taxon to avoid topological artifacts.
Assess compositional heterogeneity across taxa using Chi-square or posterior predictive checks in PhyloBayes.
Decide on inclusion/exclusion of fast-evolving sites based on site-rate heterogeneity models.
Implement partition schemes based on gene, codon position, or functional domain prior to model selection.
Use RogueNaRok to identify unstable taxa that degrade tree resolution and support values.
Balance data retention with signal-to-noise ratio when filtering low-information partitions.

Module 4: Substitution Model Selection and Partitioning

Run ModelFinder or jModelTest2 to identify best-fit nucleotide or amino acid substitution models per partition.
Compare BIC, AIC, and AICc for model selection under different dataset sizes and parameter counts.
Decide between linked and unlinked branch length models across partitions based on empirical fit.
Test for site-rate heterogeneity using gamma distributions or invariant sites models.
Justify use of codon models versus amino acid models for detecting selection in protein-coding sequences.
Validate model adequacy using posterior predictive simulations in Bayesian frameworks.
Document model decisions for peer review and reproducibility in collaborative phylogenies.

Module 5: Phylogenetic Inference Using Maximum Likelihood and Bayesian Methods

Configure RAxML-NG or IQ-TREE for parallel execution on high-performance computing clusters.
Set bootstrap replicates and thoroughness criteria to achieve convergence in support values.
Monitor MCMC chain convergence in MrBayes or PhyloBayes using ESS values and trace plots.
Adjust heating parameters and chain length in Bayesian analyses to avoid trapping in local optima.
Compare topology outputs from ML and Bayesian methods to assess robustness under different assumptions.
Manage memory and runtime constraints when analyzing large supermatrices (>10,000 sites, >100 taxa).
Implement checkpointing and job resubmission workflows for long-running tree searches.

Module 6: Species Tree Estimation and Gene Tree Discordance

Apply ASTRAL or ASTRID to infer species trees from gene trees while accounting for incomplete lineage sorting.
Quantify gene tree discordance using internode certainty or quartet similarity measures.
Diagnose sources of discordance (e.g., ILS, hybridization, HGT) using PhyParts or DiscoVista.
Integrate coalescent-based methods when population-level sampling is available within species.
Assess impact of gene tree estimation error on species tree accuracy using simulation benchmarks.
Use D-statistics (ABBA-BABA) to test for introgression between non-sister lineages.
Balance computational cost with model realism when choosing between concatenation and coalescent frameworks.

Module 7: Phylogenetic Comparative Methods and Trait Evolution

Reconstruct ancestral states for discrete traits using stochastic mapping in phytools or corHMM.
Fit evolutionary models (Brownian, OU, early burst) to continuous traits using maximum likelihood.
Test for phylogenetic signal using Blomberg’s K or Pagel’s λ across different clades.
Control for phylogenetic non-independence in regression models using PGLS.
Identify shifts in evolutionary rates using BAMM or l1ou, with careful prior specification.
Validate model assumptions (e.g., normality of residuals, tree calibration) before inference.
Interpret clade-specific results in light of fossil calibration and biogeographic context.

Module 8: Visualization, Annotation, and Data Sharing

Generate publication-quality tree figures using ggtree or ITOL with consistent color and label schemes.
Annotate trees with metadata (e.g., geography, phenotype, divergence times) for exploratory analysis.
Export trees in NHX or NeXML formats to preserve annotations and support interoperability.
Submit final phylogenies and alignments to curated repositories (e.g., TreeBASE, GenBank) with MIAPA compliance.
Use FigTree or Dendroscope for interactive exploration of large or complex topologies.
Version-control tree files and analysis scripts using Git or similar systems for audit trails.
Design scalable visualization strategies for consensus networks or phylogenetic placement results.

Module 9: Scalability, Reproducibility, and Workflow Automation

Containerize analysis pipelines using Docker or Singularity for environment consistency.
Orchestrate multi-step workflows using Snakemake or Nextflow with error handling and logging.
Optimize job scheduling on HPC systems using SLURM or PBS for memory and CPU-intensive steps.
Implement checksums and data integrity checks for large alignment and tree files.
Design modular pipeline components to enable reuse across projects with different taxon sets.
Integrate continuous integration testing for pipeline updates using synthetic or benchmark datasets.
Archive intermediate results and logs to support audit, debugging, and reanalysis.