This curriculum spans the full lifecycle of genome annotation, equivalent in scope to a multi-phase bioinformatics project involving iterative assembly, curation, and cross-team collaboration on complex eukaryotic genomes.
Module 1: Foundations of Genome Structure and Sequencing Technologies
- Select appropriate sequencing platforms (e.g., Illumina vs. PacBio) based on required read length, error profiles, and genome complexity.
- Evaluate trade-offs between short-read accuracy and long-read utility in resolving repetitive regions during genome assembly.
- Implement quality control protocols for raw sequencing data using tools like FastQC and Trimmomatic to filter low-quality bases and adapter contamination.
- Assess the impact of sequencing depth on assembly completeness and false duplication rates in diploid or polyploid genomes.
- Integrate multiple sequencing libraries (e.g., paired-end, mate-pair) to improve scaffold contiguity in de novo assemblies.
- Document metadata standards for sequencing runs to ensure reproducibility across annotation pipelines.
- Configure compute environments optimized for high-throughput sequence processing using containerization (e.g., Singularity/Docker).
- Establish data retention policies for raw reads and intermediate files in compliance with institutional and funding body requirements.
Module 2: Genome Assembly and Quality Assessment
- Choose assemblers (e.g., SPAdes, Flye, Canu) based on input data type, genome size, and expected repeat content.
- Optimize assembler parameters such as k-mer size or error correction thresholds to balance contiguity and misassembly rates.
- Use QUAST to generate comparative reports on N50, L50, and total assembly size across multiple assembly attempts.
- Validate assembly completeness using BUSCO against lineage-specific gene sets to detect missing or fragmented genes.
- Identify and resolve misassemblies through read mapping and manual inspection in visualization tools like IGV.
- Implement hybrid assembly strategies combining short-read accuracy with long-read scaffolding for complex genomes.
- Assess heterozygosity levels in diploid assemblies and decide whether to collapse haplotypes or maintain allelic variants.
- Archive assembly versions with detailed provenance tracking for downstream annotation consistency.
Module 3: Repeat Identification and Masking Strategies
- Run RepeatModeler to generate de novo repeat libraries tailored to non-model organisms lacking reference repeat databases.
- Combine homology-based (RepeatMasker) and de novo (RepeatScout) methods to maximize repeat detection sensitivity.
- Adjust stringency thresholds in RepeatMasker to balance false positives (over-masking) and false negatives (under-masking).
- Classify identified repeats into known categories (e.g., LINEs, SINEs, LTRs) for functional interpretation.
- Preserve masked sequences in soft-masked format (lowercase) to allow gene prediction tools to consider them when appropriate.
- Update species-specific repeat libraries in institutional repositories to support future annotation projects.
- Handle nested transposable elements by configuring hierarchical masking workflows to avoid overlapping annotations.
- Document masking coverage statistics to inform downstream gene prediction reliability in repeat-rich regions.
Module 4: Ab Initio and Evidence-Based Gene Prediction
- Select gene predictors (e.g., Augustus, GeneMark, SNAP) based on organism-specific training requirements and available evidence.
- Train ab initio predictors using high-confidence gene models from RNA-seq or homology evidence to improve accuracy.
- Integrate RNA-seq alignments (via HISAT2/StringTie) as extrinsic evidence to guide splice site and UTR predictions.
- Resolve conflicting gene models from multiple predictors using evidence consensus tools like EVidenceModeler.
- Adjust prediction parameters (e.g., intron length, GC content) to match genomic characteristics of the target species.
- Exclude pseudogenes and transposon-derived ORFs during prediction by cross-referencing with repeat annotations.
- Validate predicted CDS regions against known protein domains using InterProScan to detect likely false starts/stops.
- Maintain separate gene model tracks for ab initio, evidence-supported, and consensus annotations for auditability.
Module 5: Functional Annotation and Ontology Mapping
- Perform BLAST searches (BLASTP, BLASTX) against curated databases (e.g., UniProt, RefSeq) to assign putative functions.
- Use DIAMOND for accelerated homology searches on large datasets while maintaining alignment sensitivity.
- Assign Gene Ontology (GO) terms based on sequence similarity and domain architecture using tools like InterPro2GO.
- Resolve ambiguous functional assignments by evaluating domain composition and phylogenetic context.
- Map enzymes to metabolic pathways using KEGG or MetaCyc, flagging gaps for experimental validation.
- Integrate multiple annotation sources into a unified database using Chado or Apollo for consistent querying.
- Flag hypothetical proteins lacking functional domains for prioritization in downstream experimental studies.
- Update local annotation databases on a defined schedule to incorporate new functional discoveries.
Module 6: Comparative Genomics and Orthology Analysis
- Construct ortholog groups across related species using OrthoFinder or InParanoid to infer evolutionary relationships.
- Distinguish orthologs from paralogs when transferring functional annotations to avoid erroneous assignments.
- Identify lineage-specific gene expansions or losses by comparing gene family sizes across phylogenies.
- Use synteny analysis (e.g., MCScanX) to validate gene models and detect structural rearrangements.
- Align whole genomes using Mauve or LASTZ to detect conserved non-coding elements and regulatory regions.
- Quantify selective pressure (dN/dS ratios) on orthologous gene pairs using PAML or HyPhy.
- Integrate pan-genome analyses for bacterial or fungal species to capture core and accessory gene content.
- Visualize comparative results using Circos or genoPlotR for internal review and collaboration.
Module 7: Non-Coding RNA and Regulatory Element Annotation
- Detect tRNAs using tRNAscan-SE with organism-specific parameters for bacteria, archaea, or eukaryotes.
- Identify rRNAs through BLASTN searches against the SILVA database with stringent coverage and identity thresholds.
- Use Infernal with Rfam covariance models to annotate snRNAs, snoRNAs, and other structured ncRNAs.
- Incorporate small RNA-seq data to validate and refine miRNA and siRNA predictions.
- Predict promoter regions using sequence motifs (e.g., TATA box, Inr) and epigenetic marks when available.
- Annotate CpG islands and methylation-prone regions using tools like CpGPlot for epigenetic context.
- Integrate ATAC-seq or DNase-seq data to identify open chromatin regions in higher eukaryotes.
- Flag conserved non-coding elements from comparative genomics as candidate regulatory sequences for functional testing.
Module 8: Curation, Visualization, and Collaborative Annotation
- Deploy Apollo instances for manual curation, enabling teams to refine gene models with community input.
- Establish curation guidelines defining evidence standards for modifying automated predictions.
- Resolve conflicting evidence (e.g., RNA-seq vs. homology) through consensus workflows with documented rationale.
- Track annotation changes using version-controlled GFF3 files and curation logs for audit trails.
- Coordinate distributed annotation efforts using role-based access controls in shared databases.
- Generate genome browser tracks (e.g., JBrowse, UCSC) for visual validation of gene structures and regulatory features.
- Export annotations in standard formats (GFF3, GenBank) for submission to public repositories like GenBank or ENA.
- Implement pre-submission validation checks using tools like GenomeQC to meet INSDC requirements.
Module 9: Data Management, Reproducibility, and Scalability
- Design hierarchical directory structures to organize raw data, intermediate files, and final annotations.
- Use workflow managers (Snakemake, Nextflow) to encode annotation pipelines for portability and versioning.
- Integrate checksums (e.g., MD5, SHA256) into data transfer and storage protocols to ensure file integrity.
- Estimate computational resource needs (CPU, RAM, storage) for each pipeline stage to avoid bottlenecks.
- Configure job schedulers (SLURM, PBS) to manage high-throughput annotation tasks on HPC clusters.
- Implement backup and disaster recovery plans for annotation databases and primary data assets.
- Document pipeline parameters and software versions using RO-Crate or similar packaging standards.
- Adapt annotation workflows for cloud environments (AWS, GCP) when local infrastructure is insufficient.