This curriculum spans the full lifecycle of an RNA-seq project, comparable in scope to a multi-phase bioinformatics initiative involving experimental design, data processing, statistical analysis, and cross-team collaboration in academic or industry research settings.
Module 1: Study Design and Experimental Planning for RNA-Seq
- Determine appropriate sample size based on expected effect size, biological variability, and statistical power using pilot data or published benchmarks.
- Select between bulk RNA-seq, single-cell RNA-seq, or spatial transcriptomics based on research question and tissue heterogeneity.
- Decide on paired versus unpaired experimental designs when comparing conditions (e.g., tumor vs. normal, pre- vs. post-treatment).
- Implement randomization of sample processing order to minimize batch effects during library preparation and sequencing runs.
- Define inclusion and exclusion criteria for patient or model organism samples to ensure cohort homogeneity and reproducibility.
- Coordinate with wet-lab teams to standardize RNA extraction methods, RNA integrity number (RIN) thresholds, and preservation protocols.
- Choose stranded versus non-stranded library preparation based on need to resolve antisense transcription or overlapping gene annotations.
- Allocate sequencing depth per sample considering transcriptome complexity and detection goals (e.g., 20M–40M reads for mRNA, higher for lncRNA).
Module 2: Raw Data Acquisition and Quality Control
- Validate FASTQ file integrity by verifying read pairing, header formatting, and absence of adapter contamination.
- Evaluate per-base sequence quality using FastQC and set thresholds for trimming (e.g., Phred score < 20).
- Detect and quantify adapter sequences using tools like FastQ Screen or Skewer to inform trimming strategy.
- Assess GC content distribution across samples to identify potential library preparation biases or contamination.
- Compare quality metrics across sequencing batches to detect systematic technical variation.
- Implement automated quality control pipelines using MultiQC to aggregate reports across large cohorts.
- Decide whether to exclude samples based on low read counts, high duplication rates, or poor RIN correlation.
- Document quality control decisions in metadata logs for auditability and reproducibility.
Module 3: Read Alignment and Transcript Assembly
- Select reference genome build (e.g., GRCh38 vs. T2T) and annotation source (e.g., GENCODE, RefSeq) based on species and research context.
- Choose between splice-aware aligners (STAR, HISAT2) based on speed, memory requirements, and sensitivity for novel junction detection.
- Configure aligner parameters such as maximum intron length, seed length, and mismatch tolerance based on organism biology.
- Generate genome indexes locally to ensure version control and reproducibility across compute environments.
- Validate alignment rates and splice junction counts to detect mapping artifacts or contamination.
- Use transcript assembly tools (StringTie, Cufflinks) when working with non-model organisms or investigating novel isoforms.
- Assess chimeric read rates in STAR output to identify potential fusion genes or technical artifacts.
- Filter multimapping reads based on downstream application (e.g., retain for gene-level counts, exclude for isoform analysis).
Module 4: Quantification and Normalization Strategies
- Choose between gene-level (featureCounts, HTSeq) and transcript-level (Salmon, kallisto) quantification based on analysis goals.
- Decide whether to use alignment-based or pseudoalignment methods based on computational resources and need for speed.
- Apply TPM, FPKM, or counts for downstream analysis based on compatibility with statistical models (e.g., counts for DESeq2).
- Correct for gene length and sequencing depth during normalization to enable cross-sample comparisons.
- Address GC bias in count data using conditional quantile normalization (CQN) when observed in PCA plots.
- Integrate spike-in controls (e.g., ERCC) for absolute quantification when comparing across experiments with variable RNA input.
- Handle overlapping gene features by defining counting strategies (e.g., union, intersection, fractional counting).
- Validate quantification consistency by comparing technical replicates before proceeding to differential expression.
Module 5: Differential Expression and Statistical Modeling
- Select appropriate statistical framework (DESeq2, edgeR, limma-voom) based on sample size, dispersion estimation, and count distribution.
- Model batch effects as covariates in the design matrix to prevent confounding in differential expression results.
- Set significance thresholds using adjusted p-values (e.g., FDR < 0.05) and log2 fold change cutoffs (e.g., |log2FC| > 1).
- Assess mean-variance relationship in count data to validate dispersion estimates and model fit.
- Handle zero-inflated data by filtering low-count genes using minimum expression thresholds across samples.
- Validate model assumptions using residual plots and Cook’s distance to identify influential outliers.
- Perform contrast testing for complex designs (e.g., time-series, multi-factor experiments) using interaction terms.
- Generate diagnostic plots (MA plots, PCA, heatmaps) to interpret global patterns and detect technical artifacts.
Module 6: Functional Enrichment and Pathway Analysis
- Select gene set databases (e.g., GO, KEGG, Reactome, MSigDB) based on biological context and pathway granularity.
- Choose between over-representation analysis (ORA) and gene set enrichment analysis (GSEA) based on hypothesis structure.
- Adjust for gene length bias in enrichment results when analyzing RNA-seq data with position-dependent coverage.
- Define background gene sets for enrichment tests to reflect detectable transcripts in the experiment.
- Interpret enrichment results in light of directionality (up vs. downregulated genes) and effect size.
- Validate enrichment findings using complementary tools (e.g., Enrichr, g:Profiler) to assess robustness.
- Integrate pathway topology using tools like SPIA or PathwayRunner when mechanistic insight is required.
- Report enrichment results with precise gene sets, statistical methods, and multiple testing corrections to avoid overinterpretation.
Module 7: Alternative Splicing and Isoform Analysis
- Select splicing analysis tools (rMATS, SUPPA2, LeafCutter) based on ability to detect specific event types (e.g., exon skipping, intron retention).
- Define minimum read coverage thresholds for splice junctions to ensure reliable detection of alternative events.
- Quantify percent spliced in (PSI) values and test for significant differences between conditions using appropriate statistical models.
- Validate novel splice junctions using independent methods (e.g., RT-PCR) when pursuing experimental follow-up.
- Integrate isoform-level expression from Salmon or StringTie to assess differential transcript usage (DTU).
- Resolve ambiguity in isoform assignment using long-read sequencing data when short-read evidence is inconclusive.
- Filter low-abundance isoforms to reduce false positives in differential splicing analysis.
- Visualize splicing events using Sashimi plots to communicate complex patterns to collaborators.
Module 8: Data Integration and Multi-Omics Correlation
- Match RNA-seq samples with corresponding genomic (e.g., WES), epigenomic (e.g., ChIP-seq), or proteomic datasets using sample identifiers and metadata.
- Normalize and batch-correct multi-omics data using ComBat-seq or similar methods before integration.
- Perform correlation analysis between gene expression and copy number variation (CNV) to identify dosage effects.
- Use WGCNA to construct co-expression networks and identify modules correlated with clinical traits or other molecular data.
- Apply integrative clustering (iCluster, MOFA) to discover molecular subtypes across data modalities.
- Map eQTLs using genotype and expression data to identify regulatory variants influencing transcript levels.
- Validate integrative findings using orthogonal datasets or public repositories (e.g., GTEx, TCGA).
- Maintain traceability of data versions and processing steps to ensure reproducibility in cross-platform analyses.
Module 9: Reproducibility, Reporting, and Data Sharing
- Containerize analysis pipelines using Docker or Singularity to ensure computational reproducibility.
- Version-control code and workflows using Git with descriptive commit messages and branching strategies.
- Use workflow managers (Snakemake, Nextflow) to orchestrate complex, multi-step RNA-seq analyses.
- Generate comprehensive metadata using MINSEQE or ISA-Tab standards for public data deposition.
- Deposit raw and processed data in public repositories (e.g., GEO, SRA, EGA) with appropriate access controls.
- Share analysis code via public repositories (e.g., GitHub, GitLab) with detailed READMEs and dependency specifications.
- Produce automated reports using R Markdown or Jupyter Notebooks to document analytical decisions and results.
- Implement checksum validation for data transfers and storage to detect corruption or version mismatches.