Description

This curriculum spans the full lifecycle of genome annotation, equivalent in scope to a multi-phase bioinformatics project involving iterative assembly, curation, and cross-team collaboration on complex eukaryotic genomes.

Module 1: Foundations of Genome Structure and Sequencing Technologies

Select appropriate sequencing platforms (e.g., Illumina vs. PacBio) based on required read length, error profiles, and genome complexity.
Evaluate trade-offs between short-read accuracy and long-read utility in resolving repetitive regions during genome assembly.
Implement quality control protocols for raw sequencing data using tools like FastQC and Trimmomatic to filter low-quality bases and adapter contamination.
Assess the impact of sequencing depth on assembly completeness and false duplication rates in diploid or polyploid genomes.
Integrate multiple sequencing libraries (e.g., paired-end, mate-pair) to improve scaffold contiguity in de novo assemblies.
Document metadata standards for sequencing runs to ensure reproducibility across annotation pipelines.
Configure compute environments optimized for high-throughput sequence processing using containerization (e.g., Singularity/Docker).
Establish data retention policies for raw reads and intermediate files in compliance with institutional and funding body requirements.

Module 2: Genome Assembly and Quality Assessment

Choose assemblers (e.g., SPAdes, Flye, Canu) based on input data type, genome size, and expected repeat content.
Optimize assembler parameters such as k-mer size or error correction thresholds to balance contiguity and misassembly rates.
Use QUAST to generate comparative reports on N50, L50, and total assembly size across multiple assembly attempts.
Validate assembly completeness using BUSCO against lineage-specific gene sets to detect missing or fragmented genes.
Identify and resolve misassemblies through read mapping and manual inspection in visualization tools like IGV.
Implement hybrid assembly strategies combining short-read accuracy with long-read scaffolding for complex genomes.
Assess heterozygosity levels in diploid assemblies and decide whether to collapse haplotypes or maintain allelic variants.
Archive assembly versions with detailed provenance tracking for downstream annotation consistency.

Module 3: Repeat Identification and Masking Strategies

Run RepeatModeler to generate de novo repeat libraries tailored to non-model organisms lacking reference repeat databases.
Combine homology-based (RepeatMasker) and de novo (RepeatScout) methods to maximize repeat detection sensitivity.
Adjust stringency thresholds in RepeatMasker to balance false positives (over-masking) and false negatives (under-masking).
Classify identified repeats into known categories (e.g., LINEs, SINEs, LTRs) for functional interpretation.
Preserve masked sequences in soft-masked format (lowercase) to allow gene prediction tools to consider them when appropriate.
Update species-specific repeat libraries in institutional repositories to support future annotation projects.
Handle nested transposable elements by configuring hierarchical masking workflows to avoid overlapping annotations.
Document masking coverage statistics to inform downstream gene prediction reliability in repeat-rich regions.

Module 4: Ab Initio and Evidence-Based Gene Prediction

Select gene predictors (e.g., Augustus, GeneMark, SNAP) based on organism-specific training requirements and available evidence.
Train ab initio predictors using high-confidence gene models from RNA-seq or homology evidence to improve accuracy.
Integrate RNA-seq alignments (via HISAT2/StringTie) as extrinsic evidence to guide splice site and UTR predictions.
Resolve conflicting gene models from multiple predictors using evidence consensus tools like EVidenceModeler.
Adjust prediction parameters (e.g., intron length, GC content) to match genomic characteristics of the target species.
Exclude pseudogenes and transposon-derived ORFs during prediction by cross-referencing with repeat annotations.
Validate predicted CDS regions against known protein domains using InterProScan to detect likely false starts/stops.
Maintain separate gene model tracks for ab initio, evidence-supported, and consensus annotations for auditability.

Module 5: Functional Annotation and Ontology Mapping

Perform BLAST searches (BLASTP, BLASTX) against curated databases (e.g., UniProt, RefSeq) to assign putative functions.
Use DIAMOND for accelerated homology searches on large datasets while maintaining alignment sensitivity.
Assign Gene Ontology (GO) terms based on sequence similarity and domain architecture using tools like InterPro2GO.
Resolve ambiguous functional assignments by evaluating domain composition and phylogenetic context.
Map enzymes to metabolic pathways using KEGG or MetaCyc, flagging gaps for experimental validation.
Integrate multiple annotation sources into a unified database using Chado or Apollo for consistent querying.
Flag hypothetical proteins lacking functional domains for prioritization in downstream experimental studies.
Update local annotation databases on a defined schedule to incorporate new functional discoveries.

Module 6: Comparative Genomics and Orthology Analysis

Construct ortholog groups across related species using OrthoFinder or InParanoid to infer evolutionary relationships.
Distinguish orthologs from paralogs when transferring functional annotations to avoid erroneous assignments.
Identify lineage-specific gene expansions or losses by comparing gene family sizes across phylogenies.
Use synteny analysis (e.g., MCScanX) to validate gene models and detect structural rearrangements.
Align whole genomes using Mauve or LASTZ to detect conserved non-coding elements and regulatory regions.
Quantify selective pressure (dN/dS ratios) on orthologous gene pairs using PAML or HyPhy.
Integrate pan-genome analyses for bacterial or fungal species to capture core and accessory gene content.
Visualize comparative results using Circos or genoPlotR for internal review and collaboration.

Module 7: Non-Coding RNA and Regulatory Element Annotation

Detect tRNAs using tRNAscan-SE with organism-specific parameters for bacteria, archaea, or eukaryotes.
Identify rRNAs through BLASTN searches against the SILVA database with stringent coverage and identity thresholds.
Use Infernal with Rfam covariance models to annotate snRNAs, snoRNAs, and other structured ncRNAs.
Incorporate small RNA-seq data to validate and refine miRNA and siRNA predictions.
Predict promoter regions using sequence motifs (e.g., TATA box, Inr) and epigenetic marks when available.
Annotate CpG islands and methylation-prone regions using tools like CpGPlot for epigenetic context.
Integrate ATAC-seq or DNase-seq data to identify open chromatin regions in higher eukaryotes.
Flag conserved non-coding elements from comparative genomics as candidate regulatory sequences for functional testing.

Module 8: Curation, Visualization, and Collaborative Annotation

Deploy Apollo instances for manual curation, enabling teams to refine gene models with community input.
Establish curation guidelines defining evidence standards for modifying automated predictions.
Resolve conflicting evidence (e.g., RNA-seq vs. homology) through consensus workflows with documented rationale.
Track annotation changes using version-controlled GFF3 files and curation logs for audit trails.
Coordinate distributed annotation efforts using role-based access controls in shared databases.
Generate genome browser tracks (e.g., JBrowse, UCSC) for visual validation of gene structures and regulatory features.
Export annotations in standard formats (GFF3, GenBank) for submission to public repositories like GenBank or ENA.
Implement pre-submission validation checks using tools like GenomeQC to meet INSDC requirements.

Module 9: Data Management, Reproducibility, and Scalability

Design hierarchical directory structures to organize raw data, intermediate files, and final annotations.
Use workflow managers (Snakemake, Nextflow) to encode annotation pipelines for portability and versioning.
Integrate checksums (e.g., MD5, SHA256) into data transfer and storage protocols to ensure file integrity.
Estimate computational resource needs (CPU, RAM, storage) for each pipeline stage to avoid bottlenecks.
Configure job schedulers (SLURM, PBS) to manage high-throughput annotation tasks on HPC clusters.
Implement backup and disaster recovery plans for annotation databases and primary data assets.
Document pipeline parameters and software versions using RO-Crate or similar packaging standards.
Adapt annotation workflows for cloud environments (AWS, GCP) when local infrastructure is insufficient.