Description

This curriculum spans the full lifecycle of a multi-workshop bioinformatics project, equivalent to an internal capability program for establishing end-to-end microarray analysis in a research organisation, from experimental design through regulatory-compliant data sharing.

Module 1: Experimental Design and Sample Selection for Microarray Studies

Determine appropriate sample size using power analysis based on expected effect size and biological variability in pilot data.
Select matched case-control pairs or randomized cohorts to minimize confounding in differential expression analysis.
Define inclusion and exclusion criteria for patient-derived samples considering comorbidities, medication use, and sample collection timing.
Balance batch effects by randomizing sample processing order across experimental groups.
Decide between one-color and two-color microarray platforms based on experimental goals and available reference samples.
Document metadata rigorously, including tissue preservation method, RNA extraction protocol, and patient demographics for reproducibility.
Integrate sex, age, and batch variables as covariates during design to enable downstream adjustment.
Plan replicate structure—technical vs biological—based on variance components estimated from prior studies.

Module 2: Microarray Platform Selection and Data Acquisition

Evaluate probe content coverage for target genes of interest across Affymetrix, Illumina, and Agilent platforms.
Compare probe design specificity and cross-hybridization risks using BLAST alignment against the reference genome.
Assess dynamic range and sensitivity of platforms for low-abundance transcripts in the tissue type under study.
Negotiate data format delivery (e.g., CEL files, IDAT files) with core facility to retain raw data access.
Validate scanner calibration logs and PMT settings to ensure signal linearity across arrays.
Implement checksum verification for data transfer from sequencing core to local storage.
Establish naming conventions for samples that encode experimental group, batch, and processing date.
Configure automated file ingestion pipelines to parse vendor-specific file structures upon receipt.

Module 3: Raw Data Preprocessing and Quality Control

Generate array-level QC metrics including mean intensity, background levels, and presence/absence calls.
Identify outlier arrays using PCA and hierarchical clustering on unnormalized data.
Apply RLE (Relative Log Expression) and NUSE (Normalized Unscaled Standard Errors) plots to detect hybridization artifacts.
Filter out probes with low detection p-values across more than 50% of samples.
Decide between RMA, MAS5, or GCRMA normalization based on background correction needs and platform type.
Remove probes overlapping known SNPs or repetitive genomic regions to reduce false signals.
Correct for spatial artifacts on arrays using image inspection tools like AffyPLM.
Document QC decisions in a standardized report for audit and replication.

Module 4: Normalization and Batch Effect Adjustment

Apply quantile normalization for one-color arrays while preserving inter-array comparability.
Use ComBat to adjust for known batch effects when batch is correlated with experimental condition.
Assess effectiveness of batch correction using PCA before and after adjustment.
Retain uncorrected data as backup in case overcorrection removes biological signal.
Apply frozen surrogate variable analysis (fSVA) to estimate and adjust for hidden confounders.
Validate normalization success with density plot alignment across arrays.
Compare limma’s normalizeBetweenArrays with alternative methods for multi-batch studies.
Exclude arrays with extreme GC-content bias post-normalization from downstream analysis.

Module 5: Differential Expression Analysis and Statistical Modeling

Fit linear models using limma with empirical Bayes moderation for small sample sizes.
Incorporate covariates such as age, sex, and batch into the design matrix to control confounding.
Define significance thresholds using adjusted p-values (FDR < 0.05) and fold-change cutoffs (|log2FC| > 1).
Validate model assumptions using residual plots and mean-variance trends.
Perform contrasts for multi-group designs (e.g., time-course or dose-response) using appropriate coefficient combinations.
Apply duplicate correlation adjustment for repeated measurements on the same subject.
Compare results from limma with edgeR or DESeq2 when working with log-transformed microarray data.
Flag genes with high variability across replicates for manual inspection of probe behavior.

Module 6: Functional Enrichment and Pathway Analysis

Select gene sets from MSigDB, Reactome, or KEGG based on biological relevance to the study domain.
Apply over-representation analysis (ORA) using Fisher’s exact test with proper background gene filtering.
Use GSEA (Gene Set Enrichment Analysis) to detect subtle coordinated changes in gene sets.
Adjust enrichment p-values for multiple testing across gene sets using FDR or Bonferroni.
Interpret leading-edge analysis in GSEA to identify core genes driving enrichment signals.
Validate pathway results against independent datasets or literature evidence.
Filter out broad or redundant gene sets (e.g., “regulation of cellular process”) to improve interpretability.
Generate reproducible enrichment reports using RMarkdown or Quarto with embedded visualizations.

Module 7: Data Integration with External Omics Datasets

Map microarray probe IDs to consistent gene symbols using up-to-date annotation packages (e.g., hugene11sttranscriptcluster.db).
Integrate microarray expression with TCGA RNA-seq data using cross-platform normalization methods.
Perform correlation analysis between gene expression and methylation or CNV data from the same cohort.
Use WGCNA to identify co-expression modules and correlate eigengenes with clinical traits.
Align sample identifiers across datasets using harmonized patient IDs and remove mismatches.
Address platform-specific biases when merging data from different microarray versions.
Apply cross-dataset batch correction only after confirming biological comparability of tissues.
Validate integrated findings using external validation cohorts from GEO or ArrayExpress.

Module 8: Visualization and Interpretation of Results

Generate publication-ready heatmaps with dendrograms using pheatmap or ComplexHeatmap.
Plot volcano plots with labeled significant genes and effect size thresholds.
Create interactive visualizations using plotly for exploratory analysis by collaborators.
Use gene expression trajectory plots for time-series or developmental studies.
Integrate pathway diagrams with expression data using tools like Pathview.
Ensure color palettes are colorblind-safe and printer-friendly for manuscript submission.
Display probe-level data alongside gene-level summaries to expose probe discordance.
Produce multi-panel figures that link differential expression, enrichment, and network results.

Module 9: Data Sharing, Reproducibility, and Regulatory Compliance

Deposit raw and processed data in GEO or ArrayExpress with MIAME-compliant metadata.
Version control analysis scripts using Git with descriptive commit messages and branching.
Containerize analysis pipelines using Docker to ensure computational reproducibility.
Obtain IRB approval for data sharing when patient-derived samples are involved.
De-identify clinical metadata according to HIPAA or GDPR standards before public release.
Archive intermediate data files with checksums to enable pipeline re-execution.
Use workflow managers like Snakemake or Nextflow to document data provenance.
Respond to data access requests with data use agreements when required by funding bodies.