Description

This curriculum spans the full technical workflow of bioinformatics sequence assembly, comparable in scope to a multi-phase internal capability program for establishing genome and metagenome assembly pipelines in a research or clinical sequencing lab.

Module 1: Understanding Sequencing Technologies and Data Formats

Selecting between Illumina, PacBio, and Oxford Nanopore data based on required read length, error profile, and project budget
Validating FASTQ file integrity by checking header consistency, sequence-to-quality score alignment, and encoding version (Sanger vs. Illumina 1.5)
Assessing the impact of sequencing depth on assembly completeness and chimera formation in metagenomic samples
Converting and normalizing raw base calls from native formats (e.g., .bcl to FASTQ) using bcl2fastq or Guppy with appropriate demultiplexing
Evaluating the trade-off between real-time sequencing (MinION) and batch processing for time-sensitive clinical applications
Handling dual-indexed libraries to resolve index hopping in high-throughput runs on patterned flow cells
Implementing checksum verification and metadata logging for raw data handoff in regulated environments

Module 2: Quality Control and Preprocessing of Raw Reads

Configuring FastQC parameters to detect overrepresented sequences and adapter contamination in non-model organisms
Choosing between Trimmomatic, Cutadapt, and fastp based on CPU efficiency and required trimming logic (sliding window vs. adapter-specific)
Setting dynamic quality thresholds for Phred scores depending on downstream use (e.g., variant calling vs. de novo assembly)
Removing PCR duplicates in amplicon-based datasets while preserving low-frequency variants in heterogeneous populations
Implementing length-based filtering to exclude reads below k-mer size limits for planned assemblers
Correcting erroneous k-mers using tools like Rcorrector without over-smoothing rare biological variants
Validating preprocessing outputs with MultiQC to ensure consistency across large sample cohorts

Module 3: Read Error Correction and Normalization

Running error correction with Lighter or Lordec using hybrid short- and long-read data to improve long-read accuracy
Tuning k-mer size in error correction tools to balance sensitivity and false-positive correction rates
Applying digital normalization with BBNorm to reduce memory footprint without losing low-abundance transcript signals
Assessing the impact of normalization on heterozygous site representation in diploid genomes
Integrating error-corrected reads into downstream workflows while maintaining traceability to raw inputs
Managing memory allocation for in-memory correction tools on large metagenomic datasets using disk-backed alternatives
Validating correction efficacy by mapping corrected reads back to preliminary contigs and analyzing mismatch rates

Module 4: De Novo Assembly Algorithms and Tool Selection

Choosing between overlap-layout-consensus (Flye, Canu) and de Bruijn graph (SPAdes, MEGAHIT) assemblers based on data type and genome complexity
Configuring Canu with appropriate correction and assembly parameters for highly repetitive plant genomes
Running SPAdes in careful mode with mismatch correction to reduce false duplications in bacterial isolates
Adjusting k-mer sizes in MEGAHIT for optimal contiguity in low-coverage metagenomes
Handling hybrid assemblies by merging Illumina and Nanopore reads in Unicycler with proper weight calibration
Managing memory and runtime trade-offs when assembling large eukaryotic genomes on HPC clusters
Monitoring assembly progress through checkpoint files to enable restart after node failure

Module 5: Hybrid and Long-Read Assembly Strategies

Aligning long reads to short-read contigs using minimap2 for scaffolding with LINKS or OPERA-MS
Polishing hybrid assemblies with Pilon using high-accuracy short reads to correct indels in homopolymer regions
Resolving structural variants in diploid genomes using phased assembly with HiFi reads and hifiasm
Integrating Hi-C or BioNano data for chromosome-scale scaffolding in vertebrate genomes
Assessing chimeric junctions in scaffolds by validating with mate-pair read support and optical maps
Optimizing long-read polishing with Medaka by selecting model versions matching basecaller and chemistry
Managing compute resources for Canu’s correction step, which can consume hundreds of GB per sample

Module 6: Assembly Evaluation and Quality Metrics

Calculating N50 and L50 while interpreting their limitations in highly fragmented or repetitive genomes
Running QUAST to compare assemblies across multiple tools and identify misassemblies using reference genomes
Using BUSCO to assess gene space completeness with lineage-specific ortholog sets
Interpreting k-mer spectrum analysis with Merqury to estimate consensus quality (QV) and detect haplotype duplications
Validating contiguity improvements against expected genome size to avoid over-merging due to repeats
Generating assembly reports with Assemblytics to visualize structural variants in pairwise comparisons
Documenting evaluation metrics in standardized formats for audit trails in clinical or regulatory submissions

Module 7: Post-Assembly Processing and Gap Closure

Filling gaps in scaffolds using GapFiller or Sealer with remaining unaligned reads and k-mer databases
Resolving ambiguous regions by targeted reassembly with local SPAdes on unmapped read clusters
Trimming vector and adapter sequences from plasmid or viral genome termini
Splitting chimeric contigs identified through inconsistent coverage or GC bias profiles
Annotating telomeric repeats in eukaryotic assemblies to define true ends versus assembly artifacts
Revising gene predictions after gap closure to correct truncated open reading frames
Validating closed regions by PCR and Sanger sequencing in high-stakes applications like reference genome curation

Module 8: Metagenomic and Single-Cell Assembly Challenges

Binning contigs by coverage and composition using MetaBAT2 or MaxBin2 in complex microbial communities
Employing differential coverage across multiple samples to separate co-assembled strains
Assembling low-input single-cell genomes with specialized pipelines like SCAVAGE or metaSPAdes
Handling amplification bias in MDA-based single-cell data by down-weighting high-coverage regions
Estimating contamination and completeness of metagenome-assembled genomes (MAGs) with CheckM
Resolving strain heterogeneity by iterative reassembly of binned contigs with strain-aware tools
Managing cross-sample contamination risks in shared sequencing lanes for clinical microbiome studies

Module 9: Workflow Integration and Reproducible Pipelines

Containerizing assembly workflows using Docker or Singularity for environment consistency across clusters
Orchestrating multi-step pipelines with Snakemake or Nextflow to manage dependencies and resume on failure
Version-controlling pipeline scripts and configuration files using Git with protected branches for production use
Logging resource usage (CPU, memory, disk I/O) per sample to forecast infrastructure needs
Implementing checksum-based caching to avoid redundant processing in iterative development
Generating audit-compliant execution reports with timestamps, software versions, and parameter settings
Deploying pipelines on cloud platforms (AWS, GCP) with spot instance fallback and data egress cost controls

Sequence Assembly in Bioinformatics - From Data to Discovery