This curriculum spans the full lifecycle of a multi-workshop bioinformatics project, equivalent to an internal capability program for establishing end-to-end microarray analysis in a research organisation, from experimental design through regulatory-compliant data sharing.
Module 1: Experimental Design and Sample Selection for Microarray Studies
- Determine appropriate sample size using power analysis based on expected effect size and biological variability in pilot data.
- Select matched case-control pairs or randomized cohorts to minimize confounding in differential expression analysis.
- Define inclusion and exclusion criteria for patient-derived samples considering comorbidities, medication use, and sample collection timing.
- Balance batch effects by randomizing sample processing order across experimental groups.
- Decide between one-color and two-color microarray platforms based on experimental goals and available reference samples.
- Document metadata rigorously, including tissue preservation method, RNA extraction protocol, and patient demographics for reproducibility.
- Integrate sex, age, and batch variables as covariates during design to enable downstream adjustment.
- Plan replicate structure—technical vs biological—based on variance components estimated from prior studies.
Module 2: Microarray Platform Selection and Data Acquisition
- Evaluate probe content coverage for target genes of interest across Affymetrix, Illumina, and Agilent platforms.
- Compare probe design specificity and cross-hybridization risks using BLAST alignment against the reference genome.
- Assess dynamic range and sensitivity of platforms for low-abundance transcripts in the tissue type under study.
- Negotiate data format delivery (e.g., CEL files, IDAT files) with core facility to retain raw data access.
- Validate scanner calibration logs and PMT settings to ensure signal linearity across arrays.
- Implement checksum verification for data transfer from sequencing core to local storage.
- Establish naming conventions for samples that encode experimental group, batch, and processing date.
- Configure automated file ingestion pipelines to parse vendor-specific file structures upon receipt.
Module 3: Raw Data Preprocessing and Quality Control
- Generate array-level QC metrics including mean intensity, background levels, and presence/absence calls.
- Identify outlier arrays using PCA and hierarchical clustering on unnormalized data.
- Apply RLE (Relative Log Expression) and NUSE (Normalized Unscaled Standard Errors) plots to detect hybridization artifacts.
- Filter out probes with low detection p-values across more than 50% of samples.
- Decide between RMA, MAS5, or GCRMA normalization based on background correction needs and platform type.
- Remove probes overlapping known SNPs or repetitive genomic regions to reduce false signals.
- Correct for spatial artifacts on arrays using image inspection tools like AffyPLM.
- Document QC decisions in a standardized report for audit and replication.
Module 4: Normalization and Batch Effect Adjustment
- Apply quantile normalization for one-color arrays while preserving inter-array comparability.
- Use ComBat to adjust for known batch effects when batch is correlated with experimental condition.
- Assess effectiveness of batch correction using PCA before and after adjustment.
- Retain uncorrected data as backup in case overcorrection removes biological signal.
- Apply frozen surrogate variable analysis (fSVA) to estimate and adjust for hidden confounders.
- Validate normalization success with density plot alignment across arrays.
- Compare limma’s normalizeBetweenArrays with alternative methods for multi-batch studies.
- Exclude arrays with extreme GC-content bias post-normalization from downstream analysis.
Module 5: Differential Expression Analysis and Statistical Modeling
- Fit linear models using limma with empirical Bayes moderation for small sample sizes.
- Incorporate covariates such as age, sex, and batch into the design matrix to control confounding.
- Define significance thresholds using adjusted p-values (FDR < 0.05) and fold-change cutoffs (|log2FC| > 1).
- Validate model assumptions using residual plots and mean-variance trends.
- Perform contrasts for multi-group designs (e.g., time-course or dose-response) using appropriate coefficient combinations.
- Apply duplicate correlation adjustment for repeated measurements on the same subject.
- Compare results from limma with edgeR or DESeq2 when working with log-transformed microarray data.
- Flag genes with high variability across replicates for manual inspection of probe behavior.
Module 6: Functional Enrichment and Pathway Analysis
- Select gene sets from MSigDB, Reactome, or KEGG based on biological relevance to the study domain.
- Apply over-representation analysis (ORA) using Fisher’s exact test with proper background gene filtering.
- Use GSEA (Gene Set Enrichment Analysis) to detect subtle coordinated changes in gene sets.
- Adjust enrichment p-values for multiple testing across gene sets using FDR or Bonferroni.
- Interpret leading-edge analysis in GSEA to identify core genes driving enrichment signals.
- Validate pathway results against independent datasets or literature evidence.
- Filter out broad or redundant gene sets (e.g., “regulation of cellular process”) to improve interpretability.
- Generate reproducible enrichment reports using RMarkdown or Quarto with embedded visualizations.
Module 7: Data Integration with External Omics Datasets
- Map microarray probe IDs to consistent gene symbols using up-to-date annotation packages (e.g., hugene11sttranscriptcluster.db).
- Integrate microarray expression with TCGA RNA-seq data using cross-platform normalization methods.
- Perform correlation analysis between gene expression and methylation or CNV data from the same cohort.
- Use WGCNA to identify co-expression modules and correlate eigengenes with clinical traits.
- Align sample identifiers across datasets using harmonized patient IDs and remove mismatches.
- Address platform-specific biases when merging data from different microarray versions.
- Apply cross-dataset batch correction only after confirming biological comparability of tissues.
- Validate integrated findings using external validation cohorts from GEO or ArrayExpress.
Module 8: Visualization and Interpretation of Results
- Generate publication-ready heatmaps with dendrograms using pheatmap or ComplexHeatmap.
- Plot volcano plots with labeled significant genes and effect size thresholds.
- Create interactive visualizations using plotly for exploratory analysis by collaborators.
- Use gene expression trajectory plots for time-series or developmental studies.
- Integrate pathway diagrams with expression data using tools like Pathview.
- Ensure color palettes are colorblind-safe and printer-friendly for manuscript submission.
- Display probe-level data alongside gene-level summaries to expose probe discordance.
- Produce multi-panel figures that link differential expression, enrichment, and network results.
Module 9: Data Sharing, Reproducibility, and Regulatory Compliance
- Deposit raw and processed data in GEO or ArrayExpress with MIAME-compliant metadata.
- Version control analysis scripts using Git with descriptive commit messages and branching.
- Containerize analysis pipelines using Docker to ensure computational reproducibility.
- Obtain IRB approval for data sharing when patient-derived samples are involved.
- De-identify clinical metadata according to HIPAA or GDPR standards before public release.
- Archive intermediate data files with checksums to enable pipeline re-execution.
- Use workflow managers like Snakemake or Nextflow to document data provenance.
- Respond to data access requests with data use agreements when required by funding bodies.