You can think of DNA as a sequence of bases, represented by A(Adenine), C(Cytosine), G(Guanine), and T((Thymine)). In DNA, A pairs with T, and C pairs with G, which keeps the two strands connected. However, RNA doesn't have a complementary "second strand" like DNA does, so it often folds back on itself, creating pairs within its own sequence. This folding and pairing give rise to the secondary structure of the RNA molecule.
A single-stranded RNA molecule can be understood as a string of symbols (bases) selected from {A, C, G, U}. Let's represent this string as B = b1b2 ... bn, where each bi belongs to {A, C, G, U}. Initially, we can simplify its secondary structure as follows: A pairs with U, and C pairs with G. Moreover, each base can only form a pair with one other base, creating what's termed as a matching. Additionally, secondary structures usually lack knots.
So specifically, we define a secondary structure on B as a set of pairs S = {(i, j)}, where i, j belong to {1, 2, ..., n}, and it adheres to the following criteria:
Structurally speaking, condition (i) emerges because the RNA molecule lacks the flexibility to bend sharply, while conditions (ii) and (iii) represent the foundational principles of base-pairing, known as the Watson-Crick rules. Condition (iv) is particularly notable, as its natural validity isn't immediately apparent. However, although occasional exceptions exist in actual molecules (referred to as pseudo-knotting), it generally serves as a reliable approximation of the spatial limitations on real RNA secondary structures.
We want an efficient algorithm for basic RNA Secondary Structure prediction problem that takes a single-stranded RNA molecule and determines a Secondary Structure S with the maximum possible number of base pairs.
The challenge arises when attempting to formulate a recurrence that defines OPT(j) based on solutions to smaller subproblems. Progress can be made to a certain extent: within the optimal secondary structure on b1b2 ... bj, one of the following holds true:
The Following inferences can be drawn from the above points :
This realization prompts us to introduce an additional variable. We require the ability to handle subproblems that don't commence with b1; hence, we must consider subproblems on bibi+1 ... bj for all i ≤ j.
Let OPT(i, j) denote the maximum number of base pairs in a secondary structure on bibi+1 ... bj. The no-sharp-turns condition lets us initialize OPT(i, j) = 0 whenever i ≥ j − 4.
Now, in the optimal sub-structure on bi,bi+1,...,bj, we have the same alternatives as before :
The Following inferences can be drawn from the above points :
It is easy to bound the running time: there are O(n^2) subproblems to solve, and evaluating the recurrence takes time O(n) for each. Thus the running time is O(n^3).