This curriculum spans the full bioinformatics workflow from experimental design to translational handoff, comparable in scope to a multi-phase research program integrating data generation, computational analysis, and cross-functional collaboration across wet and dry labs.
Module 1: Defining Biological Objectives and Study Design
- Select appropriate tissue types and developmental stages for sampling based on the biological hypothesis, balancing relevance with availability and ethical constraints.
- Determine case-control or time-series experimental design depending on whether the goal is differential expression or dynamic gene behavior analysis.
- Establish sample size requirements using power calculations informed by expected effect sizes and variability from pilot data or literature.
- Decide between bulk RNA-seq and single-cell RNA-seq based on cellular heterogeneity concerns and resolution needs.
- Negotiate access to clinical metadata while complying with patient privacy regulations and institutional review board requirements.
- Coordinate with wet-lab teams to standardize collection, preservation, and shipping protocols across multiple sites.
- Define primary and secondary endpoints for clustering outcomes, such as biomarker identification or pathway enrichment.
- Document experimental design decisions in a data management plan to ensure reproducibility and audit readiness.
Module 2: Data Acquisition and Quality Control
- Validate raw sequencing data integrity using FastQC and verify expected read lengths, GC content, and adapter contamination levels.
- Implement automated pipelines to flag samples with low sequencing depth or high ribosomal RNA content for exclusion.
- Compare alignment rates across samples using STAR or HISAT2 to detect batch effects or technical outliers.
- Apply sample-level filtering thresholds based on gene detection rates and total read counts to remove low-quality libraries.
- Integrate external datasets only after confirming compatibility in library preparation, sequencing platform, and read length.
- Document all quality control decisions and thresholds in a standardized report for audit and replication.
- Use principal component analysis on raw counts to identify unintended sources of variation such as sex or batch.
- Establish a data freeze point after QC to prevent uncontrolled reprocessing during downstream analysis.
Module 3: Preprocessing and Normalization Strategies
- Choose between TPM, FPKM, and DESeq2's median-of-ratios depending on downstream use in clustering versus differential expression.
- Apply log-transformation with pseudocount adjustment, evaluating the impact on distribution symmetry and outlier sensitivity.
- Remove lowly expressed genes using mean-variance trends and minimum counts-per-million thresholds.
- Correct for library size variation using TMM normalization when combining datasets from different sources.
- Assess the influence of housekeeping genes on normalization stability across conditions.
- Decide whether to preserve or remove batch effects at this stage based on study objectives and confounding risks.
- Implement gene filtering based on variance across samples to reduce noise before clustering.
- Validate normalization efficacy using density plots and MA plots to confirm centering and spread consistency.
Module 4: Dimensionality Reduction and Feature Selection
- Compare PCA, t-SNE, and UMAP embeddings to assess preservation of global versus local structure in gene expression space.
- Select the number of principal components using elbow plots or cumulative variance thresholds, typically 80–90%.
- Evaluate feature stability in high-variance gene lists across resampling iterations to prevent overfitting.
- Apply surrogate variable analysis (SVA) to capture hidden confounders not accounted for in experimental design.
- Use mutual information or correlation-based filtering to remove redundant genes before clustering.
- Validate dimensionality reduction outputs by overlaying known cell type markers or experimental conditions.
- Balance computational efficiency and interpretability when choosing between linear and nonlinear methods.
- Store reduced feature sets with metadata linking back to original gene identifiers and selection criteria.
Module 5: Clustering Algorithm Selection and Execution
- Choose between k-means, hierarchical, and DBSCAN clustering based on expected cluster shapes and density distribution.
- Determine optimal k using the gap statistic, silhouette score, or elbow method, validating across multiple metrics.
- Run consensus clustering with bootstrapping to assess cluster stability and reduce algorithmic sensitivity.
- Adjust distance metrics (Euclidean, Pearson, Spearman) based on data distribution and biological interpretability.
- Handle outliers by either isolating them or using robust clustering methods that accommodate noise.
- Parallelize clustering jobs across gene subsets or parameter grids to manage computational load.
- Preserve intermediate clustering results for parameter tuning and comparative analysis.
- Implement reproducible execution using containerized environments with fixed random seeds.
Module 6: Cluster Validation and Biological Interpretation
- Calculate cluster cohesion and separation using within-cluster sum of squares and between-cluster distances.
- Compare clustering solutions using adjusted Rand index when integrating multiple algorithms or parameters.
- Annotate clusters with overrepresented GO terms using hypergeometric tests and correct for multiple testing.
- Map clusters to known pathways using KEGG or Reactome and evaluate enrichment significance with FDR thresholds.
- Identify hub genes within clusters using intramodular connectivity measures from WGCNA.
- Validate cluster robustness by subsampling genes and recalculating membership consistency.
- Integrate transcription factor binding data to infer regulatory drivers of cluster-specific expression.
- Flag clusters with ambiguous or mixed biological signatures for re-evaluation or splitting.
Module 7: Cross-Dataset Integration and Reproducibility
- Apply ComBat or Harmony to correct batch effects while preserving biological variation across datasets.
- Use reciprocal PCA or CCA to align gene expression spaces from different studies or platforms.
- Validate cluster conservation by projecting new data onto existing clustering models using label transfer.
- Assess generalizability of clusters by testing enrichment in independent cohorts with similar phenotypes.
- Document version control for gene annotations and reference databases to ensure consistency.
- Standardize gene identifiers across datasets using biomaRt or UniProt mapping with conflict resolution rules.
- Archive processed data matrices and clustering labels in standardized formats (e.g., HDF5, Seurat objects).
- Implement checksums and metadata logs to track data transformations across integration steps.
Module 8: Reporting and Translational Handoff
- Generate cluster-specific gene lists with fold changes, p-values, and functional annotations for wet-lab validation.
- Produce publication-ready heatmaps with dendrograms, sample annotations, and color-coded metadata.
- Export interactive visualizations using Plotly or Shiny for stakeholder exploration.
- Define minimal reporting standards including clustering parameters, software versions, and preprocessing steps.
- Prepare data packages for deposition in GEO or ArrayExpress with MIAME-compliant metadata.
- Coordinate with biologists to prioritize candidate genes for knockout or overexpression assays.
- Document limitations such as potential overclustering or underrepresentation of rare cell types.
- Establish a change log for all analytical updates to support audit and regulatory review.