Description

This curriculum spans the full lifecycle of an RNA-seq project, comparable in scope to a multi-phase bioinformatics initiative involving experimental design, data processing, statistical analysis, and cross-team collaboration in academic or industry research settings.

Module 1: Study Design and Experimental Planning for RNA-Seq

Determine appropriate sample size based on expected effect size, biological variability, and statistical power using pilot data or published benchmarks.
Select between bulk RNA-seq, single-cell RNA-seq, or spatial transcriptomics based on research question and tissue heterogeneity.
Decide on paired versus unpaired experimental designs when comparing conditions (e.g., tumor vs. normal, pre- vs. post-treatment).
Implement randomization of sample processing order to minimize batch effects during library preparation and sequencing runs.
Define inclusion and exclusion criteria for patient or model organism samples to ensure cohort homogeneity and reproducibility.
Coordinate with wet-lab teams to standardize RNA extraction methods, RNA integrity number (RIN) thresholds, and preservation protocols.
Choose stranded versus non-stranded library preparation based on need to resolve antisense transcription or overlapping gene annotations.
Allocate sequencing depth per sample considering transcriptome complexity and detection goals (e.g., 20M–40M reads for mRNA, higher for lncRNA).

Module 2: Raw Data Acquisition and Quality Control

Validate FASTQ file integrity by verifying read pairing, header formatting, and absence of adapter contamination.
Evaluate per-base sequence quality using FastQC and set thresholds for trimming (e.g., Phred score < 20).
Detect and quantify adapter sequences using tools like FastQ Screen or Skewer to inform trimming strategy.
Assess GC content distribution across samples to identify potential library preparation biases or contamination.
Compare quality metrics across sequencing batches to detect systematic technical variation.
Implement automated quality control pipelines using MultiQC to aggregate reports across large cohorts.
Decide whether to exclude samples based on low read counts, high duplication rates, or poor RIN correlation.
Document quality control decisions in metadata logs for auditability and reproducibility.

Module 3: Read Alignment and Transcript Assembly

Select reference genome build (e.g., GRCh38 vs. T2T) and annotation source (e.g., GENCODE, RefSeq) based on species and research context.
Choose between splice-aware aligners (STAR, HISAT2) based on speed, memory requirements, and sensitivity for novel junction detection.
Configure aligner parameters such as maximum intron length, seed length, and mismatch tolerance based on organism biology.
Generate genome indexes locally to ensure version control and reproducibility across compute environments.
Validate alignment rates and splice junction counts to detect mapping artifacts or contamination.
Use transcript assembly tools (StringTie, Cufflinks) when working with non-model organisms or investigating novel isoforms.
Assess chimeric read rates in STAR output to identify potential fusion genes or technical artifacts.
Filter multimapping reads based on downstream application (e.g., retain for gene-level counts, exclude for isoform analysis).

Module 4: Quantification and Normalization Strategies

Choose between gene-level (featureCounts, HTSeq) and transcript-level (Salmon, kallisto) quantification based on analysis goals.
Decide whether to use alignment-based or pseudoalignment methods based on computational resources and need for speed.
Apply TPM, FPKM, or counts for downstream analysis based on compatibility with statistical models (e.g., counts for DESeq2).
Correct for gene length and sequencing depth during normalization to enable cross-sample comparisons.
Address GC bias in count data using conditional quantile normalization (CQN) when observed in PCA plots.
Integrate spike-in controls (e.g., ERCC) for absolute quantification when comparing across experiments with variable RNA input.
Handle overlapping gene features by defining counting strategies (e.g., union, intersection, fractional counting).
Validate quantification consistency by comparing technical replicates before proceeding to differential expression.

Module 5: Differential Expression and Statistical Modeling

Select appropriate statistical framework (DESeq2, edgeR, limma-voom) based on sample size, dispersion estimation, and count distribution.
Model batch effects as covariates in the design matrix to prevent confounding in differential expression results.
Set significance thresholds using adjusted p-values (e.g., FDR < 0.05) and log2 fold change cutoffs (e.g., |log2FC| > 1).
Assess mean-variance relationship in count data to validate dispersion estimates and model fit.
Handle zero-inflated data by filtering low-count genes using minimum expression thresholds across samples.
Validate model assumptions using residual plots and Cook’s distance to identify influential outliers.
Perform contrast testing for complex designs (e.g., time-series, multi-factor experiments) using interaction terms.
Generate diagnostic plots (MA plots, PCA, heatmaps) to interpret global patterns and detect technical artifacts.

Module 6: Functional Enrichment and Pathway Analysis

Select gene set databases (e.g., GO, KEGG, Reactome, MSigDB) based on biological context and pathway granularity.
Choose between over-representation analysis (ORA) and gene set enrichment analysis (GSEA) based on hypothesis structure.
Adjust for gene length bias in enrichment results when analyzing RNA-seq data with position-dependent coverage.
Define background gene sets for enrichment tests to reflect detectable transcripts in the experiment.
Interpret enrichment results in light of directionality (up vs. downregulated genes) and effect size.
Validate enrichment findings using complementary tools (e.g., Enrichr, g:Profiler) to assess robustness.
Integrate pathway topology using tools like SPIA or PathwayRunner when mechanistic insight is required.
Report enrichment results with precise gene sets, statistical methods, and multiple testing corrections to avoid overinterpretation.

Module 7: Alternative Splicing and Isoform Analysis

Select splicing analysis tools (rMATS, SUPPA2, LeafCutter) based on ability to detect specific event types (e.g., exon skipping, intron retention).
Define minimum read coverage thresholds for splice junctions to ensure reliable detection of alternative events.
Quantify percent spliced in (PSI) values and test for significant differences between conditions using appropriate statistical models.
Validate novel splice junctions using independent methods (e.g., RT-PCR) when pursuing experimental follow-up.
Integrate isoform-level expression from Salmon or StringTie to assess differential transcript usage (DTU).
Resolve ambiguity in isoform assignment using long-read sequencing data when short-read evidence is inconclusive.
Filter low-abundance isoforms to reduce false positives in differential splicing analysis.
Visualize splicing events using Sashimi plots to communicate complex patterns to collaborators.

Module 8: Data Integration and Multi-Omics Correlation

Match RNA-seq samples with corresponding genomic (e.g., WES), epigenomic (e.g., ChIP-seq), or proteomic datasets using sample identifiers and metadata.
Normalize and batch-correct multi-omics data using ComBat-seq or similar methods before integration.
Perform correlation analysis between gene expression and copy number variation (CNV) to identify dosage effects.
Use WGCNA to construct co-expression networks and identify modules correlated with clinical traits or other molecular data.
Apply integrative clustering (iCluster, MOFA) to discover molecular subtypes across data modalities.
Map eQTLs using genotype and expression data to identify regulatory variants influencing transcript levels.
Validate integrative findings using orthogonal datasets or public repositories (e.g., GTEx, TCGA).
Maintain traceability of data versions and processing steps to ensure reproducibility in cross-platform analyses.

Module 9: Reproducibility, Reporting, and Data Sharing

Containerize analysis pipelines using Docker or Singularity to ensure computational reproducibility.
Version-control code and workflows using Git with descriptive commit messages and branching strategies.
Use workflow managers (Snakemake, Nextflow) to orchestrate complex, multi-step RNA-seq analyses.
Generate comprehensive metadata using MINSEQE or ISA-Tab standards for public data deposition.
Deposit raw and processed data in public repositories (e.g., GEO, SRA, EGA) with appropriate access controls.
Share analysis code via public repositories (e.g., GitHub, GitLab) with detailed READMEs and dependency specifications.
Produce automated reports using R Markdown or Jupyter Notebooks to document analytical decisions and results.
Implement checksum validation for data transfers and storage to detect corruption or version mismatches.