Description

This curriculum spans the breadth of an end-to-end bioinformatics initiative in a research-intensive organisation, comparable to a multi-phase project integrating experimental design, large-scale data analysis, structural modelling, and cross-team data governance.

Module 1: Foundations of RNA Biology and Data Types

Select appropriate RNA sequencing protocols (e.g., total RNA-seq, small RNA-seq, long-read sequencing) based on target RNA classes and biological questions.
Evaluate trade-offs between sequencing depth and read length when profiling low-abundance transcripts or splice variants.
Integrate metadata standards (e.g., MIAME, MINSEQE) into experimental design to ensure reproducibility and data reuse.
Assess quality of RNA input using RIN (RNA Integrity Number) and adjust library preparation protocols accordingly.
Choose between poly-A selection and ribosomal RNA depletion based on sample type and transcript targets.
Implement spike-in controls for normalization in differential expression analyses involving degraded or limited samples.
Define criteria for batch effect detection when combining datasets from different labs or platforms.

Module 2: Preprocessing and Quality Control of RNA-seq Data

Configure adapter trimming tools (e.g., Trimmomatic, Cutadapt) with sequence-specific parameters to preserve non-polyA transcript ends.
Set quality score thresholds and read length cutoffs that balance data retention with downstream alignment accuracy.
Diagnose strandedness issues in alignment outputs by analyzing antisense read distributions across known gene bodies.
Implement FastQC and MultiQC pipelines in continuous integration workflows for automated QC reporting.
Adjust base quality recalibration parameters when working with FFPE or single-cell RNA-seq data.
Filter ribosomal RNA reads using SortMeRNA or Bowtie2 against reference rRNA databases prior to transcript assembly.
Validate technical reproducibility using PCA and correlation matrices on raw count matrices before normalization.

Module 3: Transcriptome Assembly and Quantification

Select de novo assemblers (e.g., Trinity, SOAPdenovo-Trans) versus reference-guided tools (e.g., StringTie, Cufflinks) based on species annotation availability.
Optimize k-mer size and coverage cutoffs in de novo assembly to reduce fragmentation and chimeric transcripts.
Resolve isoform ambiguity using expectation-maximization algorithms in tools like Salmon or kallisto with proper bias correction.
Compare transcript-level quantification outputs across tools to assess consistency in low-expression genes.
Integrate long-read sequencing data (e.g., PacBio, Oxford Nanopore) to improve isoform resolution in complex loci.
Validate novel transcript predictions using RT-PCR and Sanger sequencing in follow-up wet-lab experiments.
Manage memory and disk I/O requirements when assembling large transcriptomes on high-performance computing clusters.

Module 4: Structural RNA Detection and Annotation

Apply covariance models (e.g., Infernal) with Rfam database to identify non-coding RNA families with structural homology.
Tune E-value and bit score thresholds to minimize false positives in ncRNA detection across divergent species.
Combine sequence conservation and RNAfold-predicted stability to prioritize functional RNA structures in genomic regions.
Use SHAPE-Seq or DMS-Seq data to constrain in silico folding predictions and improve secondary structure accuracy.
Annotate riboswitches and RNA thermometers by scanning UTRs for conserved structural motifs and ligand-binding pockets.
Integrate RNA structure probing data into genome browsers for visualization alongside expression and conservation tracks.
Address annotation conflicts between different databases (e.g., GENCODE, RefSeq, Rfam) in multi-source pipelines.

Module 5: RNA Secondary Structure Prediction and Modeling

Choose between minimum free energy (MFE), partition function, and suboptimal folding methods based on required confidence metrics.
Adjust thermodynamic parameters for non-standard conditions (e.g., high Mg²⁺, temperature shifts) in folding simulations.
Validate predicted structures using cross-linking data (e.g., PARIS, COMRADES) to assess long-range interactions.
Implement ensemble defect analysis to evaluate the reliability of predicted base pairs across folding algorithms.
Compare RNAfold, mfold, and ViennaRNA outputs to assess consensus structures in ambiguous regions.
Model pseudoknots using specialized tools (e.g., HotKnots, pknotsRG) when standard dynamic programming fails.
Scale structure prediction workflows using parallelization across clusters for genome-wide analyses.

Module 6: Functional Analysis of RNA Structure-Function Relationships

Correlate structural accessibility in 5' UTRs with ribosome profiling data to infer translational regulation mechanisms.
Map SNPs and mutations onto predicted RNA structures to assess disruption of functional elements (e.g., miRNA binding sites).
Integrate CLIP-seq data (e.g., HITS-CLIP, iCLIP) to identify protein-binding sites coinciding with structural motifs.
Design compensatory mutations to test structural hypotheses in functional assays (e.g., luciferase reporters).
Quantify structural changes under different conditions using reactivity data from chemical probing experiments.
Link RNA structural dynamics to alternative splicing outcomes by analyzing splice site accessibility.
Use deep mutational scanning data to validate structural models at nucleotide resolution.

Module 7: Integration of Multi-Omics Data for Regulatory Insights

Align RNA structure data with epigenetic marks (e.g., ChIP-seq, ATAC-seq) to explore co-regulation mechanisms.
Overlay RNA modifications (e.g., m⁶A from MeRIP-seq) onto structural models to assess impact on folding.
Construct regulatory networks linking lncRNAs, miRNAs, and mRNA targets using expression and structural compatibility.
Use time-series RNA-seq and structure probing to model dynamic RNA conformational changes during cellular responses.
Integrate proteomics data to identify RNA-binding proteins associated with structural motifs.
Apply causal inference methods to distinguish whether structural changes drive expression changes or vice versa.
Manage data harmonization challenges when combining public datasets with different processing pipelines.

Module 8: Scalable Infrastructure and Reproducible Workflows

Containerize RNA analysis pipelines using Docker or Singularity to ensure cross-platform reproducibility.
Design Snakemake or Nextflow workflows that handle conditional execution for failed jobs and data dependencies.
Implement version control for reference genomes, annotations, and software to track analytical provenance.
Optimize cloud storage costs by tiering raw data, intermediate files, and final results across storage classes.
Configure job scheduling parameters (e.g., memory, CPU, walltime) based on empirical resource profiling.
Apply checksum validation at each pipeline stage to detect data corruption during transfer or processing.
Enforce metadata capture at ingestion using schema-compliant databases (e.g., Chado, BioSQL).

Module 9: Ethical, Legal, and Collaborative Data Practices

Apply GDPR and HIPAA compliance measures when handling human RNA-seq data, including de-identification protocols.
Navigate data access agreements (e.g., dbGaP, EGA) for controlled-access datasets in multi-institutional projects.
Establish data use limitations for sensitive findings (e.g., incidental germline variants) in RNA analyses.
Implement audit trails for data access and analysis in shared environments using logging frameworks.
Coordinate data sharing timelines with publication embargoes and consortium policies.
Document model assumptions and limitations in structural predictions to prevent overinterpretation.
Engage domain experts (e.g., clinicians, molecular biologists) in interpreting functional implications of structural findings.