This curriculum spans the full technical workflow of bioinformatics sequence assembly, comparable in scope to a multi-phase internal capability program for establishing genome and metagenome assembly pipelines in a research or clinical sequencing lab.
Module 1: Understanding Sequencing Technologies and Data Formats
- Selecting between Illumina, PacBio, and Oxford Nanopore data based on required read length, error profile, and project budget
- Validating FASTQ file integrity by checking header consistency, sequence-to-quality score alignment, and encoding version (Sanger vs. Illumina 1.5)
- Assessing the impact of sequencing depth on assembly completeness and chimera formation in metagenomic samples
- Converting and normalizing raw base calls from native formats (e.g., .bcl to FASTQ) using bcl2fastq or Guppy with appropriate demultiplexing
- Evaluating the trade-off between real-time sequencing (MinION) and batch processing for time-sensitive clinical applications
- Handling dual-indexed libraries to resolve index hopping in high-throughput runs on patterned flow cells
- Implementing checksum verification and metadata logging for raw data handoff in regulated environments
Module 2: Quality Control and Preprocessing of Raw Reads
- Configuring FastQC parameters to detect overrepresented sequences and adapter contamination in non-model organisms
- Choosing between Trimmomatic, Cutadapt, and fastp based on CPU efficiency and required trimming logic (sliding window vs. adapter-specific)
- Setting dynamic quality thresholds for Phred scores depending on downstream use (e.g., variant calling vs. de novo assembly)
- Removing PCR duplicates in amplicon-based datasets while preserving low-frequency variants in heterogeneous populations
- Implementing length-based filtering to exclude reads below k-mer size limits for planned assemblers
- Correcting erroneous k-mers using tools like Rcorrector without over-smoothing rare biological variants
- Validating preprocessing outputs with MultiQC to ensure consistency across large sample cohorts
Module 3: Read Error Correction and Normalization
- Running error correction with Lighter or Lordec using hybrid short- and long-read data to improve long-read accuracy
- Tuning k-mer size in error correction tools to balance sensitivity and false-positive correction rates
- Applying digital normalization with BBNorm to reduce memory footprint without losing low-abundance transcript signals
- Assessing the impact of normalization on heterozygous site representation in diploid genomes
- Integrating error-corrected reads into downstream workflows while maintaining traceability to raw inputs
- Managing memory allocation for in-memory correction tools on large metagenomic datasets using disk-backed alternatives
- Validating correction efficacy by mapping corrected reads back to preliminary contigs and analyzing mismatch rates
Module 4: De Novo Assembly Algorithms and Tool Selection
- Choosing between overlap-layout-consensus (Flye, Canu) and de Bruijn graph (SPAdes, MEGAHIT) assemblers based on data type and genome complexity
- Configuring Canu with appropriate correction and assembly parameters for highly repetitive plant genomes
- Running SPAdes in careful mode with mismatch correction to reduce false duplications in bacterial isolates
- Adjusting k-mer sizes in MEGAHIT for optimal contiguity in low-coverage metagenomes
- Handling hybrid assemblies by merging Illumina and Nanopore reads in Unicycler with proper weight calibration
- Managing memory and runtime trade-offs when assembling large eukaryotic genomes on HPC clusters
- Monitoring assembly progress through checkpoint files to enable restart after node failure
Module 5: Hybrid and Long-Read Assembly Strategies
- Aligning long reads to short-read contigs using minimap2 for scaffolding with LINKS or OPERA-MS
- Polishing hybrid assemblies with Pilon using high-accuracy short reads to correct indels in homopolymer regions
- Resolving structural variants in diploid genomes using phased assembly with HiFi reads and hifiasm
- Integrating Hi-C or BioNano data for chromosome-scale scaffolding in vertebrate genomes
- Assessing chimeric junctions in scaffolds by validating with mate-pair read support and optical maps
- Optimizing long-read polishing with Medaka by selecting model versions matching basecaller and chemistry
- Managing compute resources for Canu’s correction step, which can consume hundreds of GB per sample
Module 6: Assembly Evaluation and Quality Metrics
- Calculating N50 and L50 while interpreting their limitations in highly fragmented or repetitive genomes
- Running QUAST to compare assemblies across multiple tools and identify misassemblies using reference genomes
- Using BUSCO to assess gene space completeness with lineage-specific ortholog sets
- Interpreting k-mer spectrum analysis with Merqury to estimate consensus quality (QV) and detect haplotype duplications
- Validating contiguity improvements against expected genome size to avoid over-merging due to repeats
- Generating assembly reports with Assemblytics to visualize structural variants in pairwise comparisons
- Documenting evaluation metrics in standardized formats for audit trails in clinical or regulatory submissions
Module 7: Post-Assembly Processing and Gap Closure
- Filling gaps in scaffolds using GapFiller or Sealer with remaining unaligned reads and k-mer databases
- Resolving ambiguous regions by targeted reassembly with local SPAdes on unmapped read clusters
- Trimming vector and adapter sequences from plasmid or viral genome termini
- Splitting chimeric contigs identified through inconsistent coverage or GC bias profiles
- Annotating telomeric repeats in eukaryotic assemblies to define true ends versus assembly artifacts
- Revising gene predictions after gap closure to correct truncated open reading frames
- Validating closed regions by PCR and Sanger sequencing in high-stakes applications like reference genome curation
Module 8: Metagenomic and Single-Cell Assembly Challenges
- Binning contigs by coverage and composition using MetaBAT2 or MaxBin2 in complex microbial communities
- Employing differential coverage across multiple samples to separate co-assembled strains
- Assembling low-input single-cell genomes with specialized pipelines like SCAVAGE or metaSPAdes
- Handling amplification bias in MDA-based single-cell data by down-weighting high-coverage regions
- Estimating contamination and completeness of metagenome-assembled genomes (MAGs) with CheckM
- Resolving strain heterogeneity by iterative reassembly of binned contigs with strain-aware tools
- Managing cross-sample contamination risks in shared sequencing lanes for clinical microbiome studies
Module 9: Workflow Integration and Reproducible Pipelines
- Containerizing assembly workflows using Docker or Singularity for environment consistency across clusters
- Orchestrating multi-step pipelines with Snakemake or Nextflow to manage dependencies and resume on failure
- Version-controlling pipeline scripts and configuration files using Git with protected branches for production use
- Logging resource usage (CPU, memory, disk I/O) per sample to forecast infrastructure needs
- Implementing checksum-based caching to avoid redundant processing in iterative development
- Generating audit-compliant execution reports with timestamps, software versions, and parameter settings
- Deploying pipelines on cloud platforms (AWS, GCP) with spot instance fallback and data egress cost controls