In recent years, alignment free sequence analysis methods have gained importance, due to their superior speed at equivalent results in comparison to traditional mapping- and alignment-based methods. Recently, methods have emerged that are able to index very large collections of sequenced DNA samples (e.g. any genome ever sequenced).
The basis of each alignment-free method is a so called k-mer dictionary (or key-value-store) hat associates a value (e.g., a transcript ID, chromosome number, species ID or counter) to each DNA substring of length k (from a genome or a sequenced sample). Almost always, such a dictionary is implemented via hashing. Ideally, considering that billions of k-mers have to be processed, such a hash table is both small and fast. It is both a science and an art to design fast and small hash tables for a given task.
This tutorial is addressed to bioinformaticians who have heard about or used alignment-free methods and would like to known more about the underlying hashing algorithms. The tutorial will also be interesting to algorithmically oriented scientists who have not followed the advances in hashing methods over the past few years.
Following the tutorial will enable you to better understand the underlying methods (and their limitations) of many state-of-the-art sequence analysis tools in genomics, transcriptomics, metagenomics and pangenomics. It will also help you to design your own method efficiently when the need arisies.
We will cover the following topics:
– Introductory examples from alignment-free methods (error correction, transcriptomics, metagenomics, …)
– Design goals of k-mer key value stores (exact vs. probabilistic filtering, static vs. dynamic, optimising for speed vs. space)
– Hashing and classical strategies of collision resolution
– Modern hashing and collision resolution strategies: Cuckoo, Hopscotch, Robin Hood, …
– Probabilistic hashing / filtering: Bloom filters and variants, Cuckoo filters, …
– Typical hash functions on DNA sequences and their properties
– Practical construction of minimal perfect hash functions for k-mers
– The role of modern hardware: CPU caches, prefetching, parallelism
– Examples: Analysis and discussion of the implemented hashes in several state-of-the-art tools
– Bring your own application or idea and design a hashing strategy for it