This curriculum spans the full lifecycle of microarray analysis, equivalent in depth to a multi-phase bioinformatics project involving experimental planning, regulatory-grade data processing, cross-platform integration, and stakeholder-specific reporting within a research-intensive organisation.
Module 1: Experimental Design and Sample Selection
- Determine appropriate sample size using power analysis to detect biologically relevant expression differences while accounting for expected variability in tissue sources.
- Select matched controls and cases to minimize confounding variables such as age, gender, and comorbidities in clinical studies.
- Decide between one-color and two-color microarray platforms based on throughput needs, cost constraints, and availability of reference RNA.
- Implement randomization of sample processing order to reduce batch effects during hybridization and scanning.
- Establish criteria for sample exclusion due to poor RNA quality (e.g., RIN < 7) prior to array hybridization.
- Coordinate with clinical teams to ensure ethical compliance and proper annotation of patient-derived samples.
- Balance biological replicates versus technical replicates based on project budget and statistical requirements.
Module 2: Microarray Platform Selection and Data Acquisition
- Evaluate probe specificity and genome coverage when selecting between Affymetrix, Agilent, and Illumina platforms for a given organism.
- Negotiate access to proprietary array formats and ensure compatibility with institutional core facility equipment.
- Configure scanner settings (e.g., PMT voltage) to maximize signal dynamic range without saturation.
- Implement standardized protocols for RNA labeling, fragmentation, and hybridization to ensure reproducibility.
- Monitor hybridization efficiency using spike-in controls and assess spatial artifacts on raw image files.
- Validate probe performance by checking for cross-hybridization risks using in silico alignment tools.
- Document all instrument settings and reagent lots for audit and replication purposes.
Module 3: Preprocessing and Quality Control
- Apply background correction methods (e.g., RMA, MAS5) based on array type and noise distribution characteristics.
- Identify outlier arrays using PCA plots, density distributions, and hierarchical clustering of raw intensities.
- Correct for spatial artifacts and grid misalignment during image gridding using platform-specific software.
- Implement quantile normalization for one-color arrays while preserving biological variation across samples.
- Assess RNA degradation effects by analyzing 3’/5’ probe intensity ratios for housekeeping genes.
- Filter low-intensity probes that fall below detection thresholds across multiple samples.
- Generate standardized QC reports using Bioconductor packages (e.g., arrayQualityMetrics) for team review.
Module 4: Normalization and Batch Effect Correction
- Choose between global and intensity-dependent normalization methods based on MA plot asymmetry.
- Detect batch effects using surrogate variable analysis (SVA) when samples are processed in different labs or time points.
- Apply ComBat to adjust for known batches while preserving biological signal in differential expression analysis.
- Validate correction efficacy by checking cluster separation in PCA before and after adjustment.
- Retain metadata on processing dates, personnel, and reagent lots to support batch modeling.
- Assess overcorrection risks when removing batch effects that may correlate with biological conditions.
- Document normalization parameters and software versions for reproducibility in regulatory contexts.
Module 5: Differential Expression Analysis
- Select statistical models (e.g., limma, SAM) based on sample size, design complexity, and variance stability.
- Define fold-change thresholds in conjunction with p-value adjustments to prioritize biologically meaningful genes.
- Apply empirical Bayes moderation of variances to improve stability in small sample studies.
- Adjust for multiple testing using FDR (Benjamini-Hochberg) rather than Bonferroni to balance sensitivity and specificity.
- Incorporate covariates (e.g., age, tumor stage) into linear models to isolate primary effects of interest.
- Validate findings using qRT-PCR on a subset of top differentially expressed genes.
- Flag genes with inconsistent probe set behavior for manual curation or exclusion.
Module 6: Functional Enrichment and Pathway Analysis
- Select annotation databases (e.g., GO, KEGG, Reactome) based on organism and pathway coverage completeness.
- Resolve gene identifier discrepancies between array probes and pathway databases using mapping files.
- Choose between over-representation analysis (ORA) and gene set enrichment analysis (GSEA) based on hypothesis structure.
- Adjust significance thresholds for redundant or correlated pathways to avoid overinterpretation.
- Filter out broad GO terms (e.g., “cellular process”) that lack biological specificity.
- Integrate tissue-specific expression data to prioritize relevant pathways in interpretation.
- Visualize results using pathway diagrams with expression directionality and fold-change overlays.
Module 7: Data Integration and Multi-Omics Correlation
- Align microarray expression data with genomic variants (e.g., SNPs) to identify expression quantitative trait loci (eQTLs).
- Map probe locations to promoter regions when integrating with ChIP-seq or methylation data.
- Normalize data across platforms using Z-scores or rank-based methods for combined analysis.
- Apply canonical correlation analysis (CCA) to detect coordinated patterns between mRNA and protein levels.
- Resolve gene symbol conflicts across datasets using authoritative sources like HGNC.
- Use time-series microarray data to infer regulatory networks with dynamic Bayesian models.
- Flag discordant results between microarray and RNA-seq for technical or biological investigation.
Module 8: Data Archiving and Regulatory Compliance
- Format datasets according to MIAME standards for submission to public repositories (e.g., GEO, ArrayExpress).
- Encrypt and store raw image files (e.g., .CEL, .TIF) for audit and reanalysis requirements.
- Obtain IRB approval documentation for sharing human-derived expression data under GDPR or HIPAA.
- Define data retention schedules for raw and processed files based on institutional policies.
- Assign persistent identifiers (DOIs) to datasets to support citation and reproducibility.
- Document all preprocessing steps in metadata using controlled vocabularies (e.g., EDAM ontology).
- Restrict access to sensitive datasets using tiered permission systems in institutional databases.
Module 9: Visualization and Stakeholder Reporting
- Design publication-ready heatmaps with dendrograms and annotation tracks using ComplexHeatmap in R.
- Generate interactive dashboards for clinicians using Shiny to explore gene expression patterns.
- Select color palettes that are perceptually uniform and accessible to colorblind users.
- Summarize key findings in static summary figures for inclusion in regulatory dossiers.
- Balance detail and clarity when annotating volcano plots with gene labels.
- Produce dynamic reports using R Markdown or Quarto to link analysis code with visual output.
- Validate figure resolution and font sizes for both digital and print publication formats.