This curriculum spans the full workflow of gene expression analysis in bioinformatics, comparable to the technical depth and decision-making structure of a multi-phase research initiative integrating experimental design, multi-omics data processing, and reproducible analysis frameworks used in academic consortia and pharmaceutical discovery programs.
Module 1: Foundations of Gene Expression Technologies
- Select RNA-seq over microarrays when detecting novel transcripts or requiring a broader dynamic range in expression quantification.
- Choose stranded RNA-seq protocols to resolve antisense transcription and overlapping gene annotations in complex genomes.
- Decide between bulk and single-cell RNA-seq based on biological question—tissue heterogeneity versus population-level expression trends.
- Implement spike-in controls (e.g., ERCC) to normalize technical variation in low-input or degraded RNA samples.
- Evaluate library preparation kits for compatibility with degraded samples (e.g., FFPE tissues) and sequencing platform constraints.
- Design multiplexed sequencing runs balancing sample throughput, read depth, and cost per sample.
- Establish minimum read depth thresholds (e.g., 20–30 million reads) based on transcriptome complexity and detection sensitivity needs.
- Document batch information during sample processing to enable downstream batch correction in analysis.
Module 2: Experimental Design and Sample Quality Control
- Randomize sample processing order to minimize batch effects confounded with biological conditions.
- Set RIN (RNA Integrity Number) thresholds (e.g., ≥7) for inclusion in downstream analysis, excluding degraded samples.
- Include biological replicates (minimum n=3 per condition) to enable statistical power for differential expression detection.
- Integrate negative controls (e.g., no-template RT controls) to monitor contamination in library prep.
- Balance cohort composition across covariates (e.g., age, sex, batch) to avoid confounding in analysis.
- Use PCA on preliminary expression data to identify outliers prior to formal analysis.
- Define exclusion criteria for samples based on low alignment rates or high ribosomal RNA content.
- Implement blinding during sample processing to reduce operator bias in handling.
Module 3: Raw Data Processing and Alignment
- Select alignment tools (e.g., STAR vs. HISAT2) based on speed, memory footprint, and splice junction sensitivity.
- Build custom genome indices when working with non-reference strains or engineered organisms.
- Trim adapter sequences and low-quality bases using tools like Trimmomatic or Cutadapt before alignment.
- Assess alignment metrics (e.g., % uniquely mapped reads, splice junctions detected) for quality assurance.
- Handle multimapping reads based on study goals—exclude for gene-level counts or resolve with probabilistic methods.
- Filter ribosomal RNA alignments using pre-mapping or post-alignment subtraction with reference databases.
- Standardize file formats (e.g., BAM, CRAM) and indexing for efficient data access and sharing.
- Validate alignment reproducibility across replicates using correlation of coverage profiles.
Module 4: Quantification and Normalization Strategies
- Choose featureCounts or HTSeq for gene-level counts when prioritizing simplicity and compatibility with DE tools.
- Use transcript-level quantifiers (e.g., Salmon, kallisto) with alignment-free methods for improved isoform resolution.
- Apply TMM normalization in edgeR for library size and composition bias correction in differential expression.
- Compare normalization methods (e.g., TPM, FPKM, DESeq2’s median-of-ratios) based on downstream use case.
- Adjust for gene length and GC content when comparing expression across genes or studies.
- Retain raw counts for statistical testing, avoiding pre-normalized data that limits reanalysis options.
- Account for sequencing depth differences when integrating datasets from multiple batches or studies.
- Monitor the impact of normalization on variance structure using PCA before differential expression analysis.
Module 5: Differential Expression Analysis
- Select DESeq2, edgeR, or limma-voom based on data distribution, sample size, and design complexity.
- Model batch effects as covariates in the design matrix to prevent false positives.
- Set significance thresholds using adjusted p-values (e.g., FDR < 0.05) and minimum log2 fold change (e.g., |1.0|).
- Validate dispersion estimates and mean-variance trends to ensure model fit in count-based methods.
- Perform power analysis post-hoc to interpret non-significant results in underpowered studies.
- Use contrasts to test specific hypotheses (e.g., time-point comparisons, interaction effects).
- Generate MA and volcano plots with gene labels to communicate results to domain experts.
- Export ranked gene lists for pathway enrichment and prioritization in validation experiments.
Module 6: Functional Enrichment and Pathway Analysis
- Map gene identifiers consistently across databases (e.g., Ensembl, Entrez, HGNC) to avoid annotation mismatches.
- Select background gene sets that reflect detection capability (e.g., expressed genes) rather than the whole genome.
- Compare over-representation analysis (ORA) with gene set enrichment analysis (GSEA) based on data continuity.
- Adjust for gene length bias in enrichment results when using RNA-seq data with positional biases.
- Use curated pathway databases (e.g., Reactome, MSigDB) with version-controlled annotations.
- Interpret enrichment results in context of tissue-specific expression and known biological roles.
- Validate enrichment findings with orthogonal data (e.g., protein levels, phenotypic assays).
- Report multiple testing correction methods applied to enrichment p-values (e.g., FDR, Bonferroni).
Module 7: Single-Cell RNA-Seq Analysis Pipeline
- Set UMI and cell barcode thresholds to distinguish real cells from ambient RNA and empty droplets.
- Apply doublet detection algorithms (e.g., Scrublet, DoubletFinder) in droplet-based scRNA-seq data.
- Select dimensionality reduction methods (PCA, UMAP, t-SNE) based on interpretability and computational load.
- Choose clustering resolution parameters to balance granularity and biological coherence.
- Annotate cell types using marker genes from reference atlases or literature, avoiding over-clustering artifacts.
- Correct batch effects across samples using integration methods (e.g., Harmony, Seurat’s CCA) without removing biological variation.
- Filter mitochondrial gene percentage and total UMI counts to remove low-quality or stressed cells.
- Validate pseudotime inference results with known differentiation markers and trajectory topology.
Module 8: Data Integration and Multi-Omics Considerations
- Align genomic coordinates and gene annotations across data types (e.g., RNA-seq, ChIP-seq, methylation) using consistent reference builds.
- Use WGCNA or MOFA+ to identify co-expression modules correlated with epigenetic or clinical traits.
- Match sample IDs rigorously across omics layers, resolving discrepancies in naming or processing dates.
- Normalize each data modality separately before integration to preserve scale-specific variance.
- Apply statistical models (e.g., mediation analysis) to infer regulatory relationships between methylation and expression.
- Visualize integrated results using heatmaps with dendrograms or Circos plots for cross-omic interactions.
- Assess data missingness patterns in multi-omics datasets and apply imputation cautiously.
- Document provenance of each dataset to ensure reproducibility in joint analyses.
Module 9: Reproducibility, Governance, and Data Sharing
- Use version-controlled workflows (e.g., Snakemake, Nextflow) to ensure analysis reproducibility.
- Archive raw and processed data in public repositories (e.g., GEO, SRA) with MIAME-compliant metadata.
- Apply controlled vocabulary (e.g., EDAM, OBI) in metadata to enhance dataset discoverability.
- Implement checksums for data files to detect corruption during transfer or storage.
- Define data access levels and consent restrictions for human-derived expression data.
- Document software versions, parameters, and environment configurations using containerization (e.g., Docker).
- Structure project directories following standards (e.g., NIH Data Commons) for team collaboration.
- Conduct periodic audit trails of analysis steps to support regulatory or publication review.