This curriculum spans the breadth of an end-to-end bioinformatics initiative in a research-intensive organisation, comparable to a multi-phase project integrating experimental design, large-scale data analysis, structural modelling, and cross-team data governance.
Module 1: Foundations of RNA Biology and Data Types
- Select appropriate RNA sequencing protocols (e.g., total RNA-seq, small RNA-seq, long-read sequencing) based on target RNA classes and biological questions.
- Evaluate trade-offs between sequencing depth and read length when profiling low-abundance transcripts or splice variants.
- Integrate metadata standards (e.g., MIAME, MINSEQE) into experimental design to ensure reproducibility and data reuse.
- Assess quality of RNA input using RIN (RNA Integrity Number) and adjust library preparation protocols accordingly.
- Choose between poly-A selection and ribosomal RNA depletion based on sample type and transcript targets.
- Implement spike-in controls for normalization in differential expression analyses involving degraded or limited samples.
- Define criteria for batch effect detection when combining datasets from different labs or platforms.
Module 2: Preprocessing and Quality Control of RNA-seq Data
- Configure adapter trimming tools (e.g., Trimmomatic, Cutadapt) with sequence-specific parameters to preserve non-polyA transcript ends.
- Set quality score thresholds and read length cutoffs that balance data retention with downstream alignment accuracy.
- Diagnose strandedness issues in alignment outputs by analyzing antisense read distributions across known gene bodies.
- Implement FastQC and MultiQC pipelines in continuous integration workflows for automated QC reporting.
- Adjust base quality recalibration parameters when working with FFPE or single-cell RNA-seq data.
- Filter ribosomal RNA reads using SortMeRNA or Bowtie2 against reference rRNA databases prior to transcript assembly.
- Validate technical reproducibility using PCA and correlation matrices on raw count matrices before normalization.
Module 3: Transcriptome Assembly and Quantification
- Select de novo assemblers (e.g., Trinity, SOAPdenovo-Trans) versus reference-guided tools (e.g., StringTie, Cufflinks) based on species annotation availability.
- Optimize k-mer size and coverage cutoffs in de novo assembly to reduce fragmentation and chimeric transcripts.
- Resolve isoform ambiguity using expectation-maximization algorithms in tools like Salmon or kallisto with proper bias correction.
- Compare transcript-level quantification outputs across tools to assess consistency in low-expression genes.
- Integrate long-read sequencing data (e.g., PacBio, Oxford Nanopore) to improve isoform resolution in complex loci.
- Validate novel transcript predictions using RT-PCR and Sanger sequencing in follow-up wet-lab experiments.
- Manage memory and disk I/O requirements when assembling large transcriptomes on high-performance computing clusters.
Module 4: Structural RNA Detection and Annotation
- Apply covariance models (e.g., Infernal) with Rfam database to identify non-coding RNA families with structural homology.
- Tune E-value and bit score thresholds to minimize false positives in ncRNA detection across divergent species.
- Combine sequence conservation and RNAfold-predicted stability to prioritize functional RNA structures in genomic regions.
- Use SHAPE-Seq or DMS-Seq data to constrain in silico folding predictions and improve secondary structure accuracy.
- Annotate riboswitches and RNA thermometers by scanning UTRs for conserved structural motifs and ligand-binding pockets.
- Integrate RNA structure probing data into genome browsers for visualization alongside expression and conservation tracks.
- Address annotation conflicts between different databases (e.g., GENCODE, RefSeq, Rfam) in multi-source pipelines.
Module 5: RNA Secondary Structure Prediction and Modeling
- Choose between minimum free energy (MFE), partition function, and suboptimal folding methods based on required confidence metrics.
- Adjust thermodynamic parameters for non-standard conditions (e.g., high Mg²⁺, temperature shifts) in folding simulations.
- Validate predicted structures using cross-linking data (e.g., PARIS, COMRADES) to assess long-range interactions.
- Implement ensemble defect analysis to evaluate the reliability of predicted base pairs across folding algorithms.
- Compare RNAfold, mfold, and ViennaRNA outputs to assess consensus structures in ambiguous regions.
- Model pseudoknots using specialized tools (e.g., HotKnots, pknotsRG) when standard dynamic programming fails.
- Scale structure prediction workflows using parallelization across clusters for genome-wide analyses.
Module 6: Functional Analysis of RNA Structure-Function Relationships
- Correlate structural accessibility in 5' UTRs with ribosome profiling data to infer translational regulation mechanisms.
- Map SNPs and mutations onto predicted RNA structures to assess disruption of functional elements (e.g., miRNA binding sites).
- Integrate CLIP-seq data (e.g., HITS-CLIP, iCLIP) to identify protein-binding sites coinciding with structural motifs.
- Design compensatory mutations to test structural hypotheses in functional assays (e.g., luciferase reporters).
- Quantify structural changes under different conditions using reactivity data from chemical probing experiments.
- Link RNA structural dynamics to alternative splicing outcomes by analyzing splice site accessibility.
- Use deep mutational scanning data to validate structural models at nucleotide resolution.
Module 7: Integration of Multi-Omics Data for Regulatory Insights
- Align RNA structure data with epigenetic marks (e.g., ChIP-seq, ATAC-seq) to explore co-regulation mechanisms.
- Overlay RNA modifications (e.g., m⁶A from MeRIP-seq) onto structural models to assess impact on folding.
- Construct regulatory networks linking lncRNAs, miRNAs, and mRNA targets using expression and structural compatibility.
- Use time-series RNA-seq and structure probing to model dynamic RNA conformational changes during cellular responses.
- Integrate proteomics data to identify RNA-binding proteins associated with structural motifs.
- Apply causal inference methods to distinguish whether structural changes drive expression changes or vice versa.
- Manage data harmonization challenges when combining public datasets with different processing pipelines.
Module 8: Scalable Infrastructure and Reproducible Workflows
- Containerize RNA analysis pipelines using Docker or Singularity to ensure cross-platform reproducibility.
- Design Snakemake or Nextflow workflows that handle conditional execution for failed jobs and data dependencies.
- Implement version control for reference genomes, annotations, and software to track analytical provenance.
- Optimize cloud storage costs by tiering raw data, intermediate files, and final results across storage classes.
- Configure job scheduling parameters (e.g., memory, CPU, walltime) based on empirical resource profiling.
- Apply checksum validation at each pipeline stage to detect data corruption during transfer or processing.
- Enforce metadata capture at ingestion using schema-compliant databases (e.g., Chado, BioSQL).
Module 9: Ethical, Legal, and Collaborative Data Practices
- Apply GDPR and HIPAA compliance measures when handling human RNA-seq data, including de-identification protocols.
- Navigate data access agreements (e.g., dbGaP, EGA) for controlled-access datasets in multi-institutional projects.
- Establish data use limitations for sensitive findings (e.g., incidental germline variants) in RNA analyses.
- Implement audit trails for data access and analysis in shared environments using logging frameworks.
- Coordinate data sharing timelines with publication embargoes and consortium policies.
- Document model assumptions and limitations in structural predictions to prevent overinterpretation.
- Engage domain experts (e.g., clinicians, molecular biologists) in interpreting functional implications of structural findings.