This curriculum spans the full analytical lifecycle of DNA methylation studies, comparable in scope to a multi-phase bioinformatics consulting engagement supporting epigenetic discovery projects from experimental design through data sharing.
Module 1: Fundamentals of DNA Methylation Biology and Epigenetic Mechanisms
- Select appropriate CpG island definitions based on genomic context (e.g., promoter vs. intergenic regions) when annotating methylation sites.
- Determine the biological relevance of 5-methylcytosine (5mC) versus 5-hydroxymethylcytosine (5hmC) in tissue-specific gene regulation.
- Evaluate the impact of methylation at different genomic elements (promoters, enhancers, gene bodies) on transcriptional outcomes.
- Assess the role of DNMT and TET enzyme families in dynamic methylation changes during cellular differentiation.
- Integrate histone modification data to interpret bivalent chromatin states in stem cell and cancer epigenomes.
- Decide when to include non-CpG methylation (CpA, CpT, CpC) in analyses based on cell type (e.g., neurons, embryonic cells).
- Interpret allele-specific methylation in the context of genomic imprinting and X-chromosome inactivation.
- Account for age-related methylation drift in longitudinal study designs and control selection.
Module 2: Experimental Design and Platform Selection for Methylation Profiling
- Choose between bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS), and methylation arrays based on budget, coverage needs, and sample type.
- Optimize input DNA quantity and quality thresholds for bisulfite conversion efficiency across degraded samples (e.g., FFPE).
- Balance multiplexing capacity with per-sample depth when designing Illumina TruSeq Methyl Capture panels.
- Implement spike-in controls (e.g., unmethylated lambda DNA) to monitor bisulfite conversion rates.
- Design case-control or cohort studies with appropriate matching for age, sex, and cell composition.
- Decide on batch processing strategies to minimize technical variation in multi-center studies.
- Select between single-end and paired-end sequencing for RRBS based on insert size distribution and alignment accuracy.
- Validate array-based findings (e.g., 450K/EPIC) with targeted bisulfite sequencing in follow-up experiments.
Module 3: Raw Data Preprocessing and Quality Control
- Trim adapter sequences and low-quality bases from bisulfite-converted reads using tools like Trim Galore! with proper parameter tuning.
- Assess bisulfite conversion efficiency by calculating C-to-T conversion rates in non-CpG contexts.
- Filter out reads with poor alignment rates to bisulfite-converted reference genomes (e.g., bismark with Bowtie2).
- Remove PCR duplicates using molecular barcodes (UMIs) or alignment-based methods depending on library prep.
- Generate sample-level QC metrics (coverage depth, CpG coverage uniformity, mitochondrial read proportion) for outlier detection.
- Compare beta-value distributions across samples to detect technical artifacts or batch effects.
- Use control probes on methylation arrays to assess background signal and dye bias.
- Apply gender checks using X/Y chromosome methylation patterns to verify sample identity.
Module 4: Alignment, Methylation Calling, and Data Normalization
- Select alignment tools optimized for bisulfite data (e.g., Bismark, BSMAP) based on speed and sensitivity requirements.
- Resolve ambiguous alignments in repetitive regions by adjusting seed length and mismatch tolerance.
- Calculate beta and M-values from raw methylation counts, choosing appropriate metrics for downstream analysis.
- Apply functional normalization (FunNorm) or BMIQ to correct for type I and type II probe bias in array data.
- Use reference-based or reference-free methods (e.g., RefFreeEWAS) to adjust for cell type heterogeneity in whole blood samples.
- Implement quantile normalization cautiously, preserving biological variation in heterogeneous tissue samples.
- Handle missing methylation data using imputation methods (e.g., missMethyl) or exclusion based on missingness thresholds.
- Integrate multiple batches using ComBat or SVA while preserving known biological covariates.
Module 5: Differential Methylation Analysis and Region-Based Detection
- Choose between site-specific (e.g., limma, methylKit) and region-based (e.g., DSS, methylSig) methods based on study hypothesis.
- Define differentially methylated positions (DMPs) using thresholds for delta-beta, p-value, and FDR-adjusted significance.
- Aggregate adjacent CpGs into differentially methylated regions (DMRs) using sliding windows or clustering algorithms.
- Adjust statistical models for confounding variables such as age, batch, and estimated cell proportions.
- Validate DMRs using permutation testing to assess significance under null distribution.
- Interpret directionality of methylation changes (hyper- vs. hypomethylation) in context of gene regulatory elements.
- Compare effect sizes across genomic contexts to prioritize functionally relevant DMRs.
- Apply region-set enrichment analysis (e.g., GSEA) to identify pathways enriched for methylation changes.
Module 6: Integration with Transcriptomic and Genomic Data
- Perform cis-methylation and gene expression correlation using matched RNA-seq and methylation data from the same samples.
- Identify methylation quantitative trait loci (meQTLs) by integrating SNP genotypes with methylation levels.
- Overlay DMRs with chromatin accessibility (ATAC-seq) peaks to infer regulatory potential.
- Use promoter methylation to stratify expression outliers in cancer samples (e.g., TCGA).
- Assess concordance between methylation silencing and copy number loss in tumor suppressor genes.
- Construct multi-omic interaction networks using tools like MOFA or iCluster.
- Validate predicted regulatory relationships using public databases (e.g., ENCODE, Roadmap Epigenomics).
- Resolve discordant signals (e.g., hypermethylation with increased expression) by considering alternative promoters or enhancers.
Module 7: Functional Annotation and Pathway Enrichment
- Map DMRs to nearest genes while considering topologically associating domains (TADs) for distal regulatory effects.
- Use GREAT or ChIP-Enrich to assign biological meaning to non-promoter DMRs based on regulatory domain models.
- Perform gene ontology (GO) and KEGG pathway analysis with proper multiple testing correction.
- Filter enriched terms based on specificity and avoid overinterpretation of broad categories (e.g., "cellular process").
- Integrate transcription factor binding site (TFBS) databases (e.g., JASPAR) to identify potential regulatory drivers.
- Assess enrichment of DMRs in known super-enhancers or disease-associated loci from GWAS.
- Compare functional profiles across conditions (e.g., tumor vs. normal) to identify context-specific pathways.
- Use tissue-specific regulatory annotations to prioritize findings in relevant biological systems.
Module 8: Methylation Clocks and Biomarker Development
- Select epigenetic clock algorithms (Horvath, Hannum, PhenoAge) based on tissue type and phenotypic focus.
- Calculate epigenetic age acceleration and interpret its association with disease or environmental exposures.
- Validate clock performance in non-European populations to assess generalizability.
- Develop custom biomarkers using elastic net or random forest models trained on methylation data.
- Assess biomarker robustness across batches, platforms, and sample collection methods.
- Define clinically actionable thresholds for methylation-based classifiers (e.g., cancer detection).
- Estimate minimal sample size for biomarker validation using power calculations for AUC.
- Implement cross-validation strategies to avoid overfitting in high-dimensional methylation datasets.
Module 9: Data Sharing, Reproducibility, and Ethical Considerations
- Prepare methylation data for public deposition in GEO or dbGaP with complete metadata and experimental details.
- Use standardized ontologies (e.g., OBI, EFO) to describe sample characteristics and protocols.
- Document bioinformatics workflows using containers (Docker/Singularity) and workflow languages (Snakemake, Nextflow).
- Archive intermediate files and version control scripts to ensure reproducibility.
- Address privacy risks in methylation data due to potential identification from epigenetic signatures.
- Implement data access controls for sensitive studies involving minors or stigmatized conditions.
- Report sex and ancestry estimates derived from methylation arrays in compliance with ethical guidelines.
- Disclose conflicts of interest when developing commercializable biomarkers or diagnostic tools.