An Eulerian path approach to DNA fragment assembly Presented by: Zohar Barak [email protected] Dor Alon [email protected] PAV E L A P E V Z N E R , H A I X U TA N G , A N D M I C H A E L S W AT E R M A N . A N E U L E R I A N PAT H APPROACH TO DNA F R A G M E N T A S S E M B LY. P R O C E E D I N G S O F T H E N AT I O N A L A C A D E M Y O F S C I E N C E S , 98(17):9748{9753, 2001 1 What are we going to see

Background and motivation De Bruijn Graph Eulerian Path Approach Error and data correction

Eulerian Superpath Problem Using DB data Results and conclusions 2 Some Terminology

L-tuples = k-mers = strings of constant size (l for l-tuple, k for k-mer) EP or EPP Eulerian Path or Eulerian Path Problem ES or ESP Eulerian Superpath or Eulerian Superpath Problem NM Neisseria meningitidis (used its sequence data for examples throughout the article)

Was considered to be one of the most difficult-to-assemble and repeat-rich bacterial genomes completed EULER The assembler that uses the methods discussed in this article 3 Background and motivation overlap-layout-consensus (OLC) In the period of 20 years before the article was out (2001), fragment assembly in DNA sequencing followed the overlaplayoutconsensus paradigm that was used in all available assembly tools. The OLC algorithm has 3 steps (1) Overlap - build overlap graph

(2) Layout - Bundle stretches of the overlap graph into contigs (3) Consensus - Pick most likely nucleotide sequence for each contig 4 Problems with OLC - Huge graph Memory and build time - Long time to assemble * HamPath problem - Problem with repeats 5 6

Many problems encountered in OLC Suggesting a new approach: An Eulerian path approach. 7 Reminder of de Bruijn Graph Vertex Edges (directed) We use the name de Bruijn graph also for subgraphs induced by a subset of the vertices

De Bruijn sequence Euler cycle on G - includes each 4mer once 0000110010111101 8 An easy example 9 Building the de Bruijn Graph Given a set of reads S = {s1, , sn} define - the set of all l-tuples from S.

Given we can construct the de-Bruijn graph G(Sl) with vertex set Sl1 (the set of all (l 1)tuples from S) as follows: An (l 1)-tuple v Sl1 is joined by a directed edge with an (l 1)-tuple w Sl1, if Sl contains an l-tuple for which the first l 1 nucleotides coincide with v and the last l 1 nucleotides coincide with w. Each l-tuple from Sl corresponds to an edge in G We glue usages of the same edge (same l-tuple) together

10 Comparison 11 Eulerian Path Finding Well solve the assembly problem by finding an Eulerian path to the graph EP problem can be solved in linear time

12 Error Correction Unlike OLC assemblers, EULER makes the consensus step error correction, the first step. If genome G is known, then error correction in a read s can be done by aligning the read s against G.

Catch 22! To assemble a genome, wed like to correct errors in reads first, but to correct errors in reads, one has to assemble the genome first. 13 Error Correction Bypassing Catch-22 - The set of all l-tuples in genome G Solid l-tuple an l-tuple that belongs to more than M reads, for some threshold M. If a tuple is not solid well call it weak.

We use an approximation of rather than G to correct sequencing errors. A natural approximation of is the set of all solid l-tuples 14 Spectral Alignment Problem Let T be a collection of l-tuples called a spectrum.

A string s is called a T-string if all its l-tuples belong to T Spectral Alignment Problem - Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string. Aligning a read with the set solid tuples changes the sets of weak and solid and l-tuples. Iteratively aligning the set of all reads with the set of solid l-tuples gradually reduces the number of weak l-tuples and increase the number of solid l-tuples. Leads to elimination of many read errors: Usually, reads with poor alignments represents different errors. It is common to ignore these reads.

But we can do better! 15 Error Correction Problem Given a collection of r, and an integer the spectrum of S is a set of all l-tuples from the reads and . - Upper bound on the number of errors

in each DNA read. Error Correction Problem given S, introduce up to corrections in each read in S so is minimized. An error in a single read affects at most l l-tuples in s and l l-tuples in and usually creates 2l erroneous l-tuples that point to the same error ( for positions within from the endpoints)

16 Error Correction Problem EULER uses a greedy approach and looks to correct errors in the reads that reduces by We do this by correcting mutations with multiplicity to match mutations with higher multiplicity This eliminates 86.5% of the errors in the reads Then another more involved procedure is conducted. This eliminates 97.7% of the errors. It transformed the original test data with 4.8 errors per read on average into almost error-free data with 0.11 errors per read on average.

17 Data correction or Data corruption? However, this procedure isnt perfect while deciding which nucleotide (for example A or T) is correct in a given l-tuple within a read. If the correct nucleotide is A, but T is also present in some reads from the same region, we might assign T instead of A to all reads- introducing an error rather than to correct it.

These errors are easy to fix later, and it is much more important for the reads of the same region to be consistent thus reducing the complexity of the De-Bruijn graph. For the NM sequencing project, EULER eliminated 234,410 errors and introduced 1452 errors. 18 Back to graphs 19

Repeats A path v1 vn in the de Bruijn graph is called a repeat if indegree(v1) > 1, outdegree(vn) > 1, and indegree (vn) = outdegree(vi) = 1 for 1 i n 1 Edges entering the vertex v1 are called entrances into a repeat, whereas edges leaving the vertex vn are called exits from a repeat 20 A repeat v1 vn and a system of paths overlapping with this repeat.

Branching vertex Entrances Branching vertex Exists 21 Tangles A repeat is called a tangle if there is no read-path containing this repeat.

Tangles create problems in fragment assembly, because pairings of entrances and exits in a tangle cannot be resolved via the analysis of read-paths. To address this issue, we formulate the following generalization of the Eulerian Path Problem: Eulerian Superpath Problem 22 Eulerian Superpath Problem (ESP) ESP - Given an Eulerian graph G and a collection of paths in this graph, find an Eulerian path in

this graph that contains all these paths as subpaths. To solve the Eulerian Superpath Problem, we transform both the graph G and the system of paths in this graph into a new graph in this graph into a new graph G1 with a new system of paths in this graph into a new graph1. Such transformation is called equivalent if there exists a one-to-one correspondence between Eulerian superpaths in (G, ) and (G 1, ). 23 Eulerian Superpath Problem

(ESP) Our goal is to make a series of equivalent transformations that lead to a system of paths , with every path being a single edge. Because all transformations on the way from (G, ) to (G k, ) are equivalent, every solution of the Eulerian Path Problem in (Gk, ) provides a solution of the Eulerian Superpath Problem in (G, ). Well look into 2 cases:

Graph has multiple edges Graph doesnt have multiple edges 24 Case: no multiple edges Let x = (vin, vmid) and y = (vmid, vout) be two consecutive edges in graph G, and let in this graph into a new graphx,y be a collection of all paths from in this graph into a new graph that include both these edges as a subpath. Define in this graph into a new graphx as a collection of paths from in this graph into a new graph that end with x and in this graph into a new graphy as a collection of paths from in this graph into a new graph that start with y.

25 Case: no multiple edges The x, y-detachment is a transformation that adds a new edge z = (vin, vout) and deletes the edges x and y from G This detachment alters the system of paths in this graph into a new graph as follows: I.

substitute z for x, y in all paths from in this graph into a new graphx,y II. substitute z for x in all paths from in this graph into a new graphx III. substitute z for y in all paths from in this graph into a new graphy Because every detachment reduces the number of edges in G, the detachments will eventually shorten all paths from in this graph into a new graph to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path Problem. 26

Case: got multiple edges in the case of graphs with multiple edges, the detachment procedure may lead to errors, because directing all paths from the set in this graph into a new graphx through a new edge z may not be an equivalent transformation. In this case, the edge x may be visited many times in the Eulerian path, and it may or may not be followed by the edge y on some of these visits. 27

For illustration purposes, let us consider a simple case when the vertex vmid has the only incoming edge x = (vin, vmid) with multiplicity 2 and two outgoing edges y1 = (vmid, vout1) and y2 = (vmid, vout2), each with multiplicity 1. In this case, the Eulerian path visits the edge x twice; in one case, it is followed by y1, and in another case, it is followed by y2. Consider an x,y1-detachment that adds a new edge z = (vin, vout1) after deleting the edge y1 and one of two copies of the edge x. This detachment: (i) (ii) shortens all paths in in this graph into a new graphx,y1 by substitution of x, y1 by a single

edge z substitutes z for y1 in every path from in this graph into a new graphy1. 28 If in this graph into a new graphx is not empty, it is not clear whether the last edge of a path P in this graph into a new graphx should be assigned to the edge z or to the (remaining copy of) edge x. To resolve this dilemma, one has to analyze every path P in this graph into a new graphx and decide whether it relates to in this graph into a new graphx,y1 (in this case, it should be directed through z) or to in this graph into a new graphx,y2 (in this case, it should be directed through x). By relates to in this graph into a new graphx,y1 ( in this graph into a new graphx,y2), we mean that every Eulerian superpath visits y1 (y2)

immediately after visiting P. 29 More definitions Two paths are called consistent if their union is a path again. A path P is consistent with a set of paths in this graph into a new graph if it is consistent with all paths in in this graph into a new graph and inconsistent otherwise (i.e., if it is inconsistent with at least one path in in this graph into a new graph). are consistent because is a path.

are not consistent because is not a path 30 There are three possibilities: (a) P is consistent with exactly one of the sets in this graph into a new graphx,y1 and in this graph into a new graphx,y2 (b) P is inconsistent with both in this graph into a new graphx,y1and in this graph into a new graphx,y2 (c) P is consistent with both in this graph into a new graphx,y1 and in this graph into a new graphx,y2. 31 Case (a): Consistent with one In the first case, the path P is called resolvable, because it can be unambiguously related to either in this graph into a new graphx,y1 or in this graph into a new graphx,y2. An edge x is called resolvable if all paths in in this graph into a new graphx are

resolvable. If the edge x is resolvable, then the described x, ydetachment is an equivalent transformation after the correct assignments of last edges in every path from in this graph into a new graphx. In the analysis of the NM project, the researchers found that 18,026 among 18,962 edges in the de Bruijn graph are resolvable. 32 Case (b): Inconsistent with both The second condition implies that the Eulerian Superpath Problem has no solution, because P, in this graph into a new graphx,y1, and in this graph into a new graphx,y2 impose three different scenarios for just two visits of the edge x.

After discarding the poor-quality and chimeric reads, the researchers did not encounter this condition in the NM project. 33 Case (c): Consistent with both The last condition (P is consistent with both in this graph into a new graphx,y1 and in this graph into a new graphx,y2) corresponds to the most difficult situation. If this condition holds for at least one path in in this graph into a new graphx , the edge x is called unresolvable, and we postpone analysis of this edge until all resolvable edges are analyzed. The researchers observed that equivalent transformation of other resolvable edges often resolves previously unresolvable edges.

However, some edges cannot be resolved even after the detachments of all resolvable edges are completed. Such situations usually correspond to tangles, and they have to be addressed by another equivalent transformation called a cut. 34 Cuts Consider a fragment of graph G with 5 edges and 4 paths y3 x, y4 x, x y1, and x y2.

In this symmetric situation, x is a tangle, and there is no information available to relate any of paths y3 x and y4 x to any of paths x y1 and x y2. An edge x = (v, w) is removable if: (i) it is the only outgoing edge for v and the only incoming edge for w (ii) x is either the initial or the terminal edge for every path P in this graph into a new graph containing x. An x-cut transforms in this graph into a new graph into a new system of paths by simply removing x from all paths in in this graph into a new graphx and in this graph into a new graphx without affecting the graph G itself .

An x-cut is an equivalent transformation if x is a removable edge. 35 Detachments and cuts proved to be powerful techniques to untangle the de Bruijn graph and to reduce the fragment assembly to the Eulerian Path Problem for all studied bacterial genomes. However, there is still a gap in the theoretical analysis of the Eulerian Superpath Problem in

the case when the system of paths is not amenable to either detachments or cuts. 36 Using DB data Still have tangles to solve using DB data to solve it DB data Double Barreled data

Mate pair ( 2 reads, at both ends of an unknown sequence with roughly known length 1 ( 1 , 2) 2 37 Using DB data EULER-DB maps every read into some edge(s) of the de Bruijn graph

After this mapping, most mate-pairs of reads correspond to paths that connect the positions of these reads in the de Bruijn graph. We call these paths that connect a mate pair: matepaths. We define to be the distance between the reads in the de Bruijn graph (length of the matepath of 38 Using DB data

We compare to and use this information to our advantage. In most cases such path is unique and its length approximately matches the distance between the mate-pair 39 Using DB data In the case of multiple paths between r1 and r2 in the de Bruijn graph, there are three possibilities: 1. 2. 3.

The difference between and is beyond an acceptable variation, it is most likely an error in the DB data matches the length of exactly one path between . We transform the mate-pair into materead. matches the length of more than one path between . In this case there is no sufficient information (yet) for mate-read. The mate-pair is kept in the DB data in the hope it will be resolved at the following iterations. 40 Last case example Red mate-pair corresponds to 2 different paths

Blue corresponds to a unique path. Paths generated lead to resolution of both tangles 41 Insert length is the length of the sequence in between a pair of reads.

In the NM project (with insert length up to 1,800), EULER left only 5 unresolved tangles of length 3,610, 3,215, 2,741, 2,503, and 1,831. After completing EULER-DB, we build scaffolds by using DB data as bridges between different contigs (EULER-SF). EULER-SF combines the 91 contigs into 60 scaffolds, thus closing most gaps that are shorter than the insert length 42 43 More result

s 44 45 Conclusions EULER bypasses the repeat problem, because the Eulerian Superpath approach transforms most repeats into different paths in the de Bruijn graph. As a result, EULER does not even notice repeats unless they are long perfect repeats. Tangles caused by long repeats may be resolved by DB information (EULER-DB)

EULER has excellent scaling potential for eukaryotic genomes, because there exist linear-time algorithms for the Eulerian Path Problem. 46 Thank you for listening! 47