RNA Folding Problem

using Dynamic Programming over intervals


You can think of DNA as a sequence of bases, represented by A(Adenine), C(Cytosine), G(Guanine), and T((Thymine)). In DNA, A pairs with T, and C pairs with G, which keeps the two strands connected. However, RNA doesn't have a complementary "second strand" like DNA does, so it often folds back on itself, creating pairs within its own sequence. This folding and pairing give rise to the secondary structure of the RNA molecule.


Problem Statement

A single-stranded RNA molecule can be understood as a string of symbols (bases) selected from {A, C, G, U}. Let's represent this string as B = b1b2 ... bn, where each bi belongs to {A, C, G, U}. Initially, we can simplify its secondary structure as follows: A pairs with U, and C pairs with G. Moreover, each base can only form a pair with one other base, creating what's termed as a matching. Additionally, secondary structures usually lack knots.

So specifically, we define a secondary structure on B as a set of pairs S = {(i, j)}, where i, j belong to {1, 2, ..., n}, and it adheres to the following criteria:

  1. Each pair within S maintains a minimum separation of four bases between their starting and ending positions; in other words, if (i, j) ∈ S, then i < j - 4.
  2. The constituents of any pair within S are exclusively either {A, U} or {C, G} (in any arrangement).
  3. S forms a matching: no base is shared among multiple pairs.
  4. (The noncrossing condition.) If (i, j) and (k, l) are two pairs in S, then we cannot have i < k < j < l.

Structurally speaking, condition (i) emerges because the RNA molecule lacks the flexibility to bend sharply, while conditions (ii) and (iii) represent the foundational principles of base-pairing, known as the Watson-Crick rules. Condition (iv) is particularly notable, as its natural validity isn't immediately apparent. However, although occasional exceptions exist in actual molecules (referred to as pseudo-knotting), it generally serves as a reliable approximation of the spatial limitations on real RNA secondary structures.

We want an efficient algorithm for basic RNA Secondary Structure prediction problem that takes a single-stranded RNA molecule and determines a Secondary Structure S with the maximum possible number of base pairs.


Approaches used to solve the RNA Folding problem

Dynamic Programming using single parameter

The challenge arises when attempting to formulate a recurrence that defines OPT(j) based on solutions to smaller subproblems. Progress can be made to a certain extent: within the optimal secondary structure on b1b2 ... bj, one of the following holds true:

  1. j isn't engaged in a pair, or
  2. j forms a pair with t for a t < j - 4.

The Following inferences can be drawn from the above points :

  1. In the initial scenario, we simply refer to our solution for OPT(j - 1).
  2. In the subsequent scenario, due to the noncrossing constraint, it's evident that no pair can have one end positioned between 1 and t − 1 while the other end falls between t + 1 and j − 1. Consequently, we've effectively identified two distinct subproblems: one concerning the bases b1b2 ... bt−1 and the other concerning the bases bt+1 ... bj−1. While the former is addressed by OPT(t − 1), the latter isn't included in our list of subproblems since it doesn't commence with b1.

This realization prompts us to introduce an additional variable. We require the ability to handle subproblems that don't commence with b1; hence, we must consider subproblems on bibi+1 ... bj for all i ≤ j.

Dynamic Programming over intervals

Let OPT(i, j) denote the maximum number of base pairs in a secondary structure on bibi+1 ... bj. The no-sharp-turns condition lets us initialize OPT(i, j) = 0 whenever i ≥ j − 4.

Now, in the optimal sub-structure on bi,bi+1,...,bj, we have the same alternatives as before :

  1. j isn't engaged in a pair, or
  2. j forms a pair with t for a t < j - 4.

The Following inferences can be drawn from the above points :

  1. In the first case, we have OPT(i, j) = OPT(i, j - 1).
  2. In the second case, we recur on the two subproblems OPT(i, t - 1) and OPT(t + 1, j - 1), the noncrossing condition has isolated these two subproblems from each other.

Recurrence Relation

Algo2

It is easy to bound the running time: there are O(n^2) subproblems to solve, and evaluating the recurrence takes time O(n) for each. Thus the running time is O(n^3).



Dynamic Programming Algorithm

Algo1

Time Complexity Analysis

Algo1

Results

Actual Secondary Structure

Secondary Structure obtained from Algorithm