Fast kmer counting table algorithm using perfect hash function: C++ pseudo-code integration into R using Rcpp API

Macherki, M E

Abstract

Counting kmers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We proposed a simple algorithm to calculate the kmer count using perfect hash table implemented in C++ and using Rcpp API to be able exported into R. The pdf version is available at: Fast kmer counting table algorithm using perfect hash function: C++ pseudo-code integration into R using Rcpp API

References

1. Deorowicz, S., A. Debudaj-Grabysz, and S. Grabowski, Disk-based k-mer counting on a PC. BMC bioinformatics, 2013. 14(1): p. 1.

2. Melsted, P. and J.K. Pritchard, Efficient counting of k-mers in DNA sequences using a bloom filter. BMC bioinformatics, 2011. 12(1): p. 1.

3. Zhang, Q., et al., These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS one, 2014. 9(7): p. e101271.

4. Pages, H., et al., String objects representing biological sequences, and matching algorithms. R package version, 2009. 2(2).

In the daylight of biotechnology:Rcpp and R applications

Monday, June 13, 2016