The workshop DSB “Data Structures in Bioinformatics” (DSB) is an incubator of ideas and facilitates exchanges as well as collaborations on topics related to data structures in bioinformatics. Speakers will present state of the art techniques and recent advances in this field. Also see the program of DSB 2015/2016.

Please note that registration for DSB2017 has been closed.

Dates

Venue

The workshop will take place at CWI Amsterdam (travel information, map) in room L016. There is a direct train connection to Amsterdam Science Park from Amsterdam Central Station (5 min) and Schiphol airport (30-50 min). We suggest the nearby hotels CASA (5 min bus ride) and The Manor (2 min train ride).


Participants

Name Affiliation
Mohamed Abouelhoda KACST/KFSHRC, Saudi Arabia
Mai Alzamel King’s College London
Lorraine Ayad King’s College London
Jasmijn Baaijens CWI Amsterdam
Paola Bonizzoni University of Milano-Bicocca
Rayan Chikhi University of Lille
Anthony Cox Illumina UK
Luca Denti University of Milano-Bicocca
Daniel Dörr Bielefeld University
Johannes Fischer TU Dortmund
Ehsan Haghshenas Simon Fraser University
Guillaume Holley Bielefeld University
Leandro Ishi INRIA
Johannes Köster CWI Amsterdam
Thierry Lecroq University of Rouen
Jasper Linthorst TU Delft
Tom Mokveld TU Delft
Marco Previtali University of Milano-Bicocca
Simon Puglisi University of Helsinki
Sven Rahmann University of Duisburg-Essen
Eric Rivals University of Montpellier
Kamil Salikhov Université Paris-Est Marne-la-Vallée
Alexander Schönhuth CWI Amsterdam
Tizian Schulz Bielefeld University
Siavash Sheikhizadeh Wageningen University
Jouni Sirén Wellcome Trust Sanger Institute
Sandra Smit Wageningen University
Tina Zekic Bielefeld University

Organizing Committee

Name Affiliation
Jasmijn Baaijens CWI Amsterdam, the Netherlands
Johannes Köster CWI Amsterdam, the Netherlands
Alexander Schönhuth CWI Amsterdam, the Netherlands

Workshop Schedule

Monday, Feb 20 2017

Tuesday, Feb 21 2017

9:00-9:10: Arrival at CWI

Start Speaker  
9:10 Sven Rahmann From discrete to continuous - new algorithms for fused LASSO on graphs
9:55 Thierry Lecroq FM-index of Alignment with Gaps
10:40   Coffee Break
11:00 Jouni Siren Relative Data Structures
11:45 Mohamed Abouelhoda Big Data Challenges for Clinical Bioinformatics
12:30   Lunch
14:00 Eric Rivals Superstrings, cycles covers, and multiplicities
14:45 Jasmijn Baaijens Haplotype-resolved genome assembly using overlap graphs
15:30   Coffee Break
16:00 Rayan Chikhi Efficient construction of compacted de Bruijn graphs
16:45 Ehsan Haghshenas Polishing draft genome assemblies using de Bruijn graph
17:30   Open Discussion

Each talk of 25 minutes is followed by up to twenty minutes of questions/discussion

19:00: Dinner at restaurant Tomaz

Wednesday, Feb 22 2017

Start Speaker  
9:00 Johannes Fischer New Results on Wavelet Tree/Matrix Construction
9:45 Siavash Sheikhizadeh PanTools: representation, storage and exploration of pan-genomic data
10:30   Coffee Break
11:00 Anthony Cox Putting the “p” in BWT: the positional Burrows-Wheeler transform and its applications
11:45 Jasper Linthorst, Tom Mokveld Practical algorithms for the creation and use of multi-genome reference graphs
12:30   Lunch
14:00 Kamil Salikhov ProPhyle – a memory efficient BWT-based metagenomic classifier
14:45 Mai Alzamel Palindromic Decompositions with Gaps and Errors
15:30 Coffee Break  
16:00 Lorraine Ayad MARS: improving multiple circular sequence alignment using refined sequences
16:45   Open Discussion

Each talk of 25 minutes is followed by up to twenty minutes of questions/discussion

Talks (preliminary)

Palindromic Decompositions with Gaps and Errors

Mai Alzamel, Michał Adamczyk, Panagiotis Charalampopoulos, Costas Iliopoulos, and Jakub Radoszewski.

Identifying palindromes in sequences has been an interesting line of research after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorisation of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by imposing a lower bound to the length of acceptable palindromes and allowing gaps and errors in the decompositions. We present an algorithm for obtaining such a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n.g) and space O(n.g), where g is the number of allowed gaps. We further present an O(n (g + 𝛿 ))-time and O(n. g)-space algorithm for the decomposition of a string of length n in maximal palindromes with at most 𝛿 errors each, under the edit or Hamming distance, and g allowed gaps.

New Results on Wavelet Tree/Matrix Construction

Johannes Fischer

The wavelet tree (Grossi et al. [SODA, 2003]) and wavelet matrix (Claude et al. [Inf. Syst., 47:15–32, 2015]) are compact indices for texts over an alphabet [0,sigma) that support rank, select and access queries in O(log sigma) time. We first present new practical sequential and parallel algorithms for wavelet matrix construction. Their unifying characteristics is that they construct the wavelet matrix bottom-up, i.e., they compute the last level first. We also show that this bottom-up construction can easily be adapted to wavelet trees.

FM-index of Alignment with Gaps

Thierry Lecroq, Joong Chae Na, Hyunjoon Kim, Seunghwan Min, Heejin Park, Martine Leonard, Laurent Mouchard and Kunsoo Park

Recently a compressed index for similar strings, called the FM-index of alignment (FMA), has been proposed with the functionalities of pattern search and random access. The FMA is quite efficient in space requirement and pattern search time, but it is applicable only for an alignment of strings without gaps. In this talk we propose the FM-index of alignment with gaps, a realistic index for similar strings, which allows gaps in their alignment. For this, we design a new version of the suffix array of alignment by using an alignment transformation and a new definition of the alignment-suffix. The new suffix array of alignment enables us to support the LF-mapping and backward search, the key functionalities of the FM-index, regardless of gap existence in the alignment. We experimentally compared our index with RLCSA due to Makinen et al. and related indexes GCSA due to Siren et al. and GCSA2 due to Siren on genome sequences from the 1000 Genomes Project. The index size of our index is smaller than those of other indexes.

Relative Data Structures

Jouni Siren

Relative data compression encodes the target dataset relative to a similar reference dataset. Given the reference and the target compressed relative to the reference, we can then recover the target dataset. With data structures, we may want to go further than that. Given a data structure for the reference and a relative data structure for the target, we may want to simulate the target data structure efficiently.

In this talk, I will present relative FM-indexes and suffix trees for assembled genomes. I will discuss the techniques used in the data structures and show benchmarks against ordinary FM-indexes and compressed suffix trees.

From discrete to continuous - new algorithms for fused LASSO on graphs

Sven Rahmann, Elias Kuthe

Frequently, solving a discrete optimization problem is much harder than the corresponding continuous problem (e.g., general integer linear programming vs. (continuous) linear programming).

We are interested in very large (“genome-wide”) instances of the fused LASSO problem, a continuous convex optimization problem. Given a graph G=(V,E), observed data y=(y_v) on the nodes, weights w=(w_v) on the nodes and weights u=(u_e) on the edges, find values x=(x_v) on the nodes that are both close to the observed data (quadratic error) and close to neighboring values (in the l1 sense).

Formally, we seek x to minimize f(x) := sum_v w_v * (x_v - y_v)^2 + sum_{e={v,v’}} u_e * |x_v - x_v’|. This particular target function ensures that the solution x is close to the data y and regionally constant (nonzero differences are sparse). In principle, standard convex optimization methods will do the job. However, due to the special structure of the problem (and of the particular graph in a given application!), one can hope to do much better. Indeed, on a grid graph this formulation is frequently used for image denoising.

Here, we consider ladder-like graphs that model a signal across a chromosome in several individuals belonging to two classes. Immediate applications include methylation levels and copy number variations. We show that a forward-like algorithm efficiently solves a discretized version of the problem, which is often already sufficient in practice. We further show that by iteratively refining the discretization, we are able to solve the continuous problem to any desired accuracy.

This appears to be a rare case where taking the detour via a discrete problem yields an efficient algorithm for the original continuous problem. The efficiency of the method, however, crucially depends on the graph structure.

MARS: improving multiple circular sequence alignment using refined sequences.

Lorraine Ayad, Solon Pissis

A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary. This is relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure.

A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program. MARS is a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. The program computes the rotations (cyclic shifts) required to best align a set of input sequences, where the output can then be input into any multiple sequence alignment program.

Experimental results, using real and synthetic data are presented, showing that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency.

Efficient construction of compacted de Bruijn graphs

Rayan Chikhi

The de Bruijn graph is a widely used data structure in assembly algorithms. Compacting the graph is an important data reduction step where long simple paths are compacted into single vertices. Construction of the compacted graph has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. In this talk I’ll present an algorithm and a tool BCALM 2 for the compaction of de Bruijn graphs. BCALM 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. It is at least an order of magnitude more efficient than other available methods. Source code: https://github.com/GATB/bcalm

ProPhyle – a memory efficient BWT-based metagenomic classifier

Karel Brinda, Kamil Salikhov, Simone Pignotti

Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. A way to improve the accuracy of metagenomic classification is to match the metagenome against a largest possible set of known genomic sequences. With many thousands of completed microbial genomes available today, modern metagenomic projects match their samples against genomic databases of tens of billions of bp. To cope with increasingly large metagenomic projects, alignment-free methods have recently come into use. The most popular tool - Kraken - performs metagenomic classification of NGS reads based on the analysis of shared k-mers between an input read and each genome from a pre-compiled database. Kraken provides an extremely rapid read classification, but its index suffers from two major limitations. First, Kraken’s enormous memory consumption, due to a large hash table, does not allow one to perform classification other than on high-performance clusters. Second, each k-mer K in the index is represented by the lowest common ancestor of all nodes that contains K, which can result in an inaccurate classification. We present ProPhyle, a metagenomic classifier based on BWT-index. ProPhyle uses a classification algorithm similar to Kraken, but with an indexing strategy based on a bottom-up propagation of k-mers in the tree, assembling contigs at each node and matching using a standard full-text search. Matching can be done using two different approaches: “restarted search”, when each k-mer is processed independently, and “rolling window”, which utilizes the fact that two consecutive k-mers in a read have a suffix-prefix overlap of length k - 1. The latter approach, which is based on kLCP data structure, allows to decrease the assignment time so that it can be comparable to the Kraken’s one. The obtained index occupies only a fraction of RAM compared to Kraken – 13 GB instead of 120 GB for index construction and 14 GB instead of 75 GB for index querying. The resulting index is also more expressive as it can, for every queried k-mer, retrieve a list of all genomes in which the k-mer occurs.

Practical algorithms for the creation and use of multi-genome reference graphs

Jasper Linthorst, Tom Mokveld

The application of multi-reference representations holds a great promise for improving the quality of genetic variant detection. We present a practical algorithm for the construction of multi-genome reference graphs. We introduce the concept of recursive exact matching and show how this can be applied to graph-based multi-genome representations. Subsequently we outline an algorithm that enables the alignment of reads through such a graph, by indexing all haplotype informed paths.

Big Data Challenges for Clinical Bioinformatics

Mohamed Abouelhoda

The recent advancement of Next Generation Sequencing technologies in terms of cost, throughput, and accuracy has already changed the clinical practice largely for diagnosis and growingly for treatment. However, the analysis requirements regarding quality, accuracy, speed, and integrated knowledge sources are higher, compared to samples from other organisms. Such sophistications along with the sheer volume of data shape a real big data problem with a number of challenges. In this talk, we will address a number of computational challenges for processing thousands of samples for large genome projects and for clinical use. We will discuss some criteria for selecting the data management and analysis methods, considering the recent advances in computational infrastructure and storage systems that can help overcoming the Big Data challenge. The discussion will be based on best practices in the Saudi Human Genome Program, which targets to sequence 100,000 genomes over a period of five years. We will also briefly shed light on the still coming challenges with new sequencing techniques and applications in the clinic.

Superstrings, cycles covers, and multiplicities.

Eric Rivals, Bastien Cazaux

In genome or transcriptome assembly, dealing with repeated sequence remains a major bottleneck. The number of repeated elements in a genome varies greatly between species, but both vertebrates and plants often exhibit a repetitive fraction of their genome that goes beyond 40%. Reads located in different copies of a repeated region tend to collapse in a single region, the coverage of which provides an estimation of the number of copies. It is thus crucial to develop algorithms able to reconstruct a target sequence and at the same time to satisfy the constraints imposed by the information on the /multiplicity/ of each input word, where the multiplicity means the desired number of copies for each “read”. Here, we will describe an algorithm for computing a cycle cover of a set of input words with multiplicities. This algorithm uses a specific graph to account for these multiplicities and remains linear in complexity. It thus provides a way to compute approximate shortest superstring with multiplicities.

Putting the “p” in BWT: the positional Burrows-Wheeler transform and its applications

Anthony Cox

Introduced by Richard Durbin in 2014, the positional Burrows-Wheeler transform (pBWT) is facilitating new ways to speed up some of the key computational tasks in population genetics, much as the “regular” Burrows-Wheeler transform has become a cornerstone of text search and compression algorithms in bioinformatics and elsewhere. Algorithms based on the pBWT are enabling computations such as phasing and IBD detection to scale to the cohorts of hundreds of thousands of individuals being genotyped by population-scale sequencing and microarray studies. I will discuss some of these algorithms and outline the relationship of the pBWT to the regular Burrows-Wheeler transform.

PanTools: representation, storage and exploration of pan-genomic data

Siavash Sheikhizadeh, M. Eric Schranz, Mehmet Akdel, Dick de Ridder and Sandra Smit

Thanks to the impressive advances in sequencing technologies, the availability of eukaryotic genome assemblies is growing rapidly. The fact that many species are now represented by multiple (reference) genomes necessitates a transition from single-genome to pan-genome analyses. We define the pan-genome as a comprehensive representation of multiple genomes, facilitating comparative analyses at the nucleotide, gene, and function level. Current pan-genomic approaches do not thoroughly address scalability, functionality, and usability issues. We propose a graph structure consisting of a hierarchy of nucleotide, annotation, and function layers, which is stored in a Neo4j graph database making it scalable to large datasets. We have developed an online algorithm to construct a generalized De Bruijn graph as the nucleotide layer of the pan-genome. In addition, our toolbox PanTools provides useful functionalities for adding new genomes and annotations, retrieving sequence of genes and genomic regions, reconstructing the constituent genomes, and inferring orthology relationships in the pan-genome. Efficiency and accuracy of PanTools have been validated on various eukaryotic data sets.

Haplotype-resolved genome assembly using overlap graphs

Jasmijn Baaijens, Amal Zine El Aabidine, Eric Rivals, Alexander Schönhuth

While most genome assemblers focus on reconstructing a consensus genome, an even more challenging task is to find the individual haplotype sequences present in an NGS sample. These haplotypes can be of high clinical interest, for example when studying a viral infection: such a sample contains multiple, closely related virus strains (i.e., a viral quasispecies) which may react differently to a given medical treatment. We present SAVAGE, a computational tool for reconstructing individual haplotypes from a polyploid, highly diverse sample. SAVAGE makes use of the overlap graph assembly paradigm in combination with a sound statistical model, so that edges in the graph reflect that two reads originate from identical haplotypic sequences, and iteratively extends contigs into individual haplotypes. In benchmark experiments on both simulated and real data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler so far. When using an ad-hoc consensus reference, SAVAGE also compares very favorably with the state-of-the-art reference-guided viral quasispecies assemblers.

Polishing draft genome assemblies using de Bruijn graph

Ehsan Haghshenas

The advent of Single Molecule Sequencing (SMS) technologies greatly increased the feasibility of obtaining more contiguous assemblies compared to Next Generation Sequencing (NGS) approaches. However, unlike NGS-based assemblies, contigs generated by SMS-based methods lack a high accuracy due to the significantly higher error rate. Although hybrid assembly approaches can be a good alternative to get more accurate contigs, the current state of the art hybrid assemblers work only for bacterial-size genomes.

In order to get higher accuracy contigs, usually, a polishing stage is done after obtaining the contigs. However, the current polishing approaches are time-consuming. In this talk, we investigate the possibility of using a de Bruijn graph built from NGS reads to polish assembled draft contigs.