Skip to content

RayverAimar/secondary-structure-RNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RNA Secondary Structure

Prediction of RNA secondary structure using the Nussinov dynamic programming algorithm — implemented in C++.

C++17 License: MIT


A from-scratch implementation of the Nussinov algorithm for predicting RNA secondary structure. Given an RNA sequence, it finds the folding with the maximum number of Watson-Crick and wobble base pairs using dynamic programming, then recovers the structure via recursive traceback. Built during the CS361-UCSP Artificial Intelligence course.

Quickstart

git clone https://github.com/RayverAimar/secondary-structure-RNA.git
cd secondary-structure-RNA

g++ -O2 -o rna main.cpp
./rna

Output

Running on the default sequence ACUCGAUUCCGAG:

Sequence  : ACUCGAUUCCGAG
Structure : .(((((.).))))
Base pairs: 5

DP matrix (cell value = negative base pairs in that subrange):
       A   C   U   C   G   A   U   U   C   C   G   A   G
   A   0   0  -1  -1  -2  -2  -3  -3  -3  -3  -4  -5  -5
   C   0   0   0   0  -1  -2  -2  -2  -2  -2  -3  -4  -5
   U   0   0   0   0  -1  -2  -2  -2  -2  -2  -3  -4  -5
   C   0   0   0   0  -1  -1  -2  -2  -2  -2  -3  -4  -4
   G   0   0   0   0   0   0  -1  -2  -2  -2  -3  -3  -4
   A   0   0   0   0   0   0  -1  -1  -1  -1  -2  -3  -3
   U   0   0   0   0   0   0   0   0   0   0  -1  -2  -3
   U   0   0   0   0   0   0   0   0   0   0  -1  -2  -2
   C   0   0   0   0   0   0   0   0   0   0  -1  -1  -2
   C   0   0   0   0   0   0   0   0   0   0  -1  -1  -1
   G   0   0   0   0   0   0   0   0   0   0   0   0   0
   A   0   0   0   0   0   0   0   0   0   0   0   0   0
   G   0   0   0   0   0   0   0   0   0   0   0   0   0

The structure in dot-bracket notation — . means unpaired, ( and ) mark the two ends of a base pair:

A C U C G A U U C C G  A  G
. ( ( ( ( ( . ) . ) )  )  )
0 1 2 3 4 5 6 7 8 9 10 11 12

Recovered pairs: C(1)–G(12), U(2)–A(11), C(3)–G(10), G(4)–C(9), A(5)–U(7) — all Watson-Crick valid.

Theory

RNA base pairing rules

Pair Type
A – U Watson-Crick
U – A Watson-Crick
C – G Watson-Crick
G – C Watson-Crick
G – U Wobble
U – G Wobble

DP formulation (Nussinov)

dp[i][j] stores the minimum cost for the subsequence i..j, where cost is defined as the negative number of base pairs (minimizing cost = maximizing pairs).

Base case:

$$dp[i][i] = 0 \qquad dp[i][j] = 0 \quad \text{if } j \leq i$$

Recurrence:

$$dp[i][j] = \min \begin{cases} dp[i+1][j-1] - 1 & \text{if bases } i \text{ and } j \text{ can pair} \ \displaystyle\min_{i \leq k < j}; dp[i][k] + dp[k+1][j] & \text{bifurcation (always evaluated)} \end{cases}$$

The bifurcation case is always evaluated — even when $i$ and $j$ can pair — because splitting the interval at some $k$ may yield more total pairs than a single $i$–$j$ pairing.

The answer is $-,dp[0][n-1]$.

Complexity: $O(n^3)$ time, $O(n^2)$ space.

Traceback

The optimal structure is recovered recursively from dp[0][n-1]:

  1. If canPair(i, j) and dp[i][j] == dp[i+1][j-1] - 1 → pair $(i, j)$, recurse on $(i{+}1,; j{-}1)$
  2. Else if dp[i][j] == dp[i+1][j]$i$ is unpaired, recurse on $(i{+}1,; j)$
  3. Else if dp[i][j] == dp[i][j-1]$j$ is unpaired, recurse on $(i,; j{-}1)$
  4. Else find $k$ such that dp[i][j] == dp[i][k] + dp[k+1][j] → recurse on both halves

The result is written in dot-bracket notation: . for unpaired, ( and ) for paired positions.

Bugs fixed

The original implementation had three correctness issues:

# Bug Fix
1 DP if-else: bifurcation was only considered when $i$–$j$ could NOT pair, missing cases where a split yields more pairs than a single pairing Removed else — bifurcation loop always runs after the pairing check
2 Traceback false positives: the pair check dp[i][j] == dp[i+1][j-1] - 1 passed for non-pairable bases due to coincidental DP values Added canPair(seq[i], seq[j]) guard to the traceback condition
3 Traceback incompleteness: the while (i < mid && j >= mid) loop stopped at the midpoint, silently missing all nested pairs contained entirely within one half Replaced iterative loop with a fully recursive traceback

Project structure

secondary-structure-RNA/
├── main.cpp    # Nussinov DP + recursive traceback
├── LICENSE
└── README.md

About

RNA secondary structure prediction with the Nussinov dynamic programming algorithm in C++. Maximizes Watson-Crick + wobble base pairs, recovers folding via traceback. CS361-UCSP AI course.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages