Fast kmer counting table algorithm using perfect hash function: C++ pseudo-code integration into R using Rcpp API

Macherki, M E

Saturday, October 01, 2016

Shell sorting algorithm: review and implementation example using Rcpp

There are several reasons for studying t simple sorting algorithms in some detail. First, they provide a relatively painless way to learn terminology and basic mechanisms for sorting algorithms so that we get an adequate background for studying the more sophisticated algorithms. Second, there are many applications of sorting where it’s better to use these simple methods than the more powerful general-purpose methods. Finally, some of the simple methods extend to better general purpose methods or can be used to improve the efficiency of more powerful methods. The most prominent example of this is seen in recursive sorts which “divide and conquer” big files into many small ones. Obviously, it is advantageous to know the best way to deal with small files in such situations.

As mentioned above, there are several sorting applications in which a relatively simple algorithm may be the method of choice. Sorting programs are often used only once (or only a few times). If the number of items to be sorted is not too large (say, less than five hundred elements), it may well be more efficient just to run a simple method than to implement and debug a complicated method. Elementary methods are always suitable for small files (say, less than fifty elements); it is unlikely that a sophisticated algorithm would be justified for a small file, unless a very large number of such files are to be sorted. Other types of files that are relatively easy to sort are ones that are already almost sorted (or already sorted) or ones that contain large numbers of equal keys. Simple methods can do much better on such well-structured files than general-purpose methods.

Insertion sort is the method often used by people to sort bridge hands: consider the elements one at a time, inserting each in its proper place among those already considered (keeping them sorted). The element being considered is inserted merely by moving larger elements one position to the right, then inserting the element into the vacated position. The code for this algorithm is straightforward.

Insertion Sort by Diminishing Increment A refinement of the straight insertion sort was proposed by D. L. Shell in l959 (Shell 1959). The method is explained and demonstrated on our standard example of eight items (see Table 1). First, all items that are four positions apart are grouped and sorted separately. This process is called a 4-sort. In this example of eight items, each group contains exactly two items. After this first pass, the items are regrouped into groups with items two positions apart and then sorted anew. This process is called a 2-sort. Finally, in a third pass, all items are sorted in an ordinary sort or 1-sort. One may at first wonder if the necessity of several sorting passes, each of which involves all items, does not introduce more work than it saves. However, each sorting step over a chain either involves relatively few items or the items are already quite well ordered and comparatively few rearrangements are required. It is obvious that the method results in an ordered array, and it is fairly obvious that each pass profits from previous passes (since each i-sort combines two groups sorted in the preceding 2i-sort).

It is also obvious that any sequence of increments is acceptable, as long as the last one is unity, because in the worst case the last pass does all the work. It is, however, much less obvious that the method of diminishing increments yields even better results with increments other than powers of 2.

The procedure is therefore developed without relying on a specific sequence of increments. The T increments are denoted by h0, h1, ... , hT-1 with the conditions ht-1 = 1, hi+1 < hi

Analysis of Shellsort.

The analysis of this algorithm poses some very difficult mathematical problems, many of which have not yet been solved. In particular, it is not known which choice of increments yields (gap function) the best results. One surprising fact, however, is that they should not be multiples of each other. This will avoid the phenomenon evident from the example given above in which each sorting pass combines two chains that before had no interaction whatsoever. It is indeed desirable that interaction between various chains takes place as often as possible, and the following theorem holds: If a k-sorted sequence is i-sorted, then it remains k-sorted.

1, 4, 13, 40, 121, ...where hk-1 = 3hk+1, ht = 1, and t = k×log3(n) - 1.He also recommends the sequence 1, 3, 7, 15, 31, ... where hk-1 = 2hk+1, ht = 1, and t = k×log2(n) - 1(Nievergelt and Hinrichs 1999)

For the latter choice, mathematical analysis yields an effort proportional to n2 required for sorting n items with the Shellsort algorithm. Although this is a significant improvement over n2 , we will not expound further on this method, since even better algorithms are known.

Deciding which gap sequence to use is difficult. Every gap sequence that contains 1 yields a correct sort; however, the properties of thus obtained versions of Shellsort may be very different.

The table below compares most proposed gap sequences published so far((Shell 1959), (Frank and Lazarus 1960), (Hibbard 1963), (Papernov and Stasevich 1965), (Incerpi and Sedgewick 1985) (Sedgewick 1986) (Gonnet, Baeza-Yates et al. 1991) (Tokuda 1992) (Ciura 2001)). Some of them have decreasing elements that depend on the size of the sorted array (N). Others are increasing infinite sequences, whose elements less than N should be used in reverse order.

R and example of code and benchmarking:

The Rcpp package has become the most widely used language extension for R, the powerful environment and language for computing with data. As of September 2016, 778 packages on CRAN and a further 76 on BioConductor deploy Rcpp to extend R, to accelerate computations and to connect to other C++ projects. Rcpp provides a powerful API on top of R, permitting direct interchange of rich R objects (including S3, S4 or Reference Class objects) between R and C++. Rcpp sugar gives syntactic sugar such as vectorised C++ expression; Rcpp modules provide easy extensibility using declarations and Rcpp attributes greatly facilitates code integration.

Shell sort is used naturally in R using order and sort function in the basic package. The source file of the Rcpp implementation is available at: M. E., Macherki (2016): Shell Sort Algorithm:Rcpp. figshare.

https://dx.doi.org/10.6084/m9.figshare.3386284.v1

References

Ciura, M. (2001). Best increments for the average case of shellsort. International Symposium on Fundamentals of Computation Theory, Springer.

Frank, R. and R. Lazarus (1960). "A high-speed sorting procedure." Communications of the ACM 3(1): 20-22.

Gonnet, G. H., et al. (1991). Lexicographical indices for text: Inverted les vs pat trees, Technical Report TR-OED-91-01, University of Waterloo.

Hibbard, T. N. (1963). "A simple sorting algorithm." Journal of the ACM (JACM) 10(2): 142-150.

Incerpi, J. and R. Sedgewick (1985). "Improved upper bounds on Shellsort." Journal of Computer and System Sciences 31(2): 210-224.

Nievergelt, J. and K. H. Hinrichs (1999). Algorithms & data structures, vdf Hochschulverl. an der ETH.

Papernov, A. and G. Stasevich (1965). "A method of information sorting in computer memories." Problemy Peredachi Informatsii 1(3): 81-98.

Sedgewick, R. (1986). "A new upper bound for Shellsort." Journal of Algorithms 7(2): 159-173.

Shell, D. L. (1959). "A high-speed sorting procedure." Communications of the ACM 2(7): 30-32.

Tokuda, N. (1992). An improved Shellsort. Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture-Information Processing'92, Volume 1-Volume I, North-Holland Publishing Co.

Monday, June 13, 2016

Fast kmer counting table algorithm using perfect hash function: C++ pseudo-code integration into R using Rcpp API

Abstract

Counting kmers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We proposed a simple algorithm to calculate the kmer count using perfect hash table implemented in C++ and using Rcpp API to be able exported into R. The pdf version is available at: Fast kmer counting table algorithm using perfect hash function: C++ pseudo-code integration into R using Rcpp API

References

1. Deorowicz, S., A. Debudaj-Grabysz, and S. Grabowski, Disk-based k-mer counting on a PC. BMC bioinformatics, 2013. 14(1): p. 1.

2. Melsted, P. and J.K. Pritchard, Efficient counting of k-mers in DNA sequences using a bloom filter. BMC bioinformatics, 2011. 12(1): p. 1.

3. Zhang, Q., et al., These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS one, 2014. 9(7): p. e101271.

4. Pages, H., et al., String objects representing biological sequences, and matching algorithms. R package version, 2009. 2(2).

Monday, May 30, 2016

Fast text file reading into R using Rcpp functions

Using the C++ I/O system provides fast input functions which can be used in this Rcpp code with two ways: 1-using std::stringstream 2-using std::ios system This codes includes also a function to overcome delimiter problems and simple running examples..

Monday, May 23, 2016

Algorithm of binary search :Rcpp code

A binary search divides a range of values(sorted array) into halves by the median point(noted k), and continues tonarrow down the field of search until the pattern(noted x) is found. It is the classic example of a "divide and conquer" algorithm.

Wednesday, February 24, 2016

Seqan an Rcpp

https://uclrdata.wordpress.com/2013/04/05/unix-sequencing-tools-vs-bioconductor/

Seqan and rcpp

https://uclrdata.wordpress.com/2013/05/01/rcpp-rcpparmadillo-seqan-sequence-analysis-part-3/

Tuesday, February 16, 2016

Rcpp: the Shell sort code with gaps function exemples

shellsoert.htm


/*
 * The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared.
 * The running time of Shellsort is heavily dependent on the gap sequence it uses.
 * Many gap function are evaluable see Link.
 * source file:Rcpp: the Shell sort code with gaps function exemples

 * 
 * The best profile was effectuated when we use sort_shell_sed function.
 * 
 *                                 MACHERKI M E.
 *                                 17/02/2016
 * 
 * 
 */

#include <Rcpp.h>
using namespace Rcpp;



//The following code is the typical body of the shell sort algorithm:
inline void shell ( NumericVector unsorted, int size, NumericVector Gaps)
{
  int gn=Gaps.size();
  int j,gap ;
  
  for (int t = gn-1; t > -1; t--)
  {
    gap=Gaps[t];
    for (int i = gap; i < size; ++i)
    {
      double temp = unsorted[i];
      for (j = i; j >= gap && temp < unsorted[j - gap]; j -= gap)
      {
        unsorted[j] = unsorted[j - gap];
      }
      unsorted[j] = temp;
    } 
  }
}
//******************gaps functions*************************

//Many gaps function can be used 
/*Shell sort function using slicing by two gaps (Shell, 1959)
 Gaps= N/2^k={N/2,N/4,N/8...1}
*/
inline NumericVector gapshell( int size ) {
  NumericVector result;
  for(int i=size/2;i>0;i/=2)
    result.push_back(i);
  return rev(result);}
/*Method of Frank & Lazarus, 1960
 Gaps=2*(N/2^k+1)+1
*/
inline NumericVector gapFrank( int N ) {
  int tmp=10;
  NumericVector result;
  for(int k=2;tmp>1;k++){
    tmp=2*(N/pow(2,k))+1;
    result.push_back(tmp);}
  return rev(result);}

/*Method of Hibbard, 1963
 Gaps=2^k-1
 */
inline NumericVector gapHibbard( int N ) {
  NumericVector result;
  int tmp=0;
  for(int k=1;tmp<N;k++){
    tmp=pow(2,k)-1;
    result.push_back(tmp);}
  return result;}

/*Method of Pratt, 1971
 Gaps=(3^k-1)/2
 */
inline NumericVector gappratt( int N ) {
  int n=N/3;
  NumericVector result;
  int tmp=0;
  for(int k=1;tmp<n;k++){
    tmp=(pow(3,k)-1)/2;
    result.push_back(tmp);}
  return result;}
/*Method of Sedgewick, 1986
 Gaps=4^k+3*2^(k-1)+1
 */
inline NumericVector gapsed( int N ) {
  NumericVector result;
  result.push_back(1);
    int tmp=0;
  for(int k=1;;k++){
    tmp=pow(4,k)+(3*pow(2,k-1))+1;
    if(tmp>N)break;
    result.push_back(tmp);}
  return result;}
/*Method of Tokuda, 1992
 Gaps=(9^k-4^k)/(5*4^(k-1))
*/

inline NumericVector gaptocuda( int N ) {
  NumericVector result;
  int tmp=0;
  for(int k=1;;k++){
    tmp=(pow(9,k)-pow(4,k))/(5*pow(4,k-1));
    if(tmp>N)break;
    result.push_back(tmp);}//we can use tmp+1
  result[0]=1;
  return result;}
/*Method of Tokuda, 1992:generic
 Gaps=h[i]=(h[i-1])*2.25+1; h[1]=1,1 base indexing
*/

inline NumericVector gaptocuda_emp( int N ) {
  NumericVector result;
  result.push_back(1);
  int tmp=1;
  for(int k=1;;k++){
    tmp=2.25*tmp+1;
    if(tmp>N)break;
    result.push_back(tmp);}//we can use tmp+1
  return result;}





//using empirically derived gaps (Ciura, 2001) N<2000

NumericVector gapCiura( int sz ) { 
  NumericVector result;
  result.push_back(1);
  result.push_back(4);
  result.push_back(10);
  result.push_back(23);
  result.push_back(57);
  result.push_back(132);
  result.push_back(301);
  result.push_back(701);
  result.push_back(1705);
  return result;}

/******************sorting functions*************************
************************************************************/







// [[Rcpp::export]]
NumericVector sort_shell_gap(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gapshell(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_frunk(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gapFrank(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_hibbard(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gapHibbard(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_patt(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gappratt(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_sed(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gapsed(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_tacuda(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gaptocuda(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_tacudaemp(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector gaps=gaptocuda_emp(N);
  NumericVector result=clone(unsorted);
  shell(result,N,gaps);
  return result;}
// [[Rcpp::export]]
NumericVector sort_shell_Ciura(NumericVector unsorted) { //run
  int N =unsorted.size();
  NumericVector Ciura=gapCiura(0);
  NumericVector result=clone(unsorted);
  shell(result,N,Ciura);
  return result;}
/***R
##trial mode
x<-runif(1000)
base<-sort(x)
a<-sort_shell_gap(x)
b<-sort_shell_frunk(x)
c<-sort_shell_hibbard(x)
d<-sort_shell_patt(x)
e<-sort_shell_sed(x)
f<-sort_shell_tacuda(x)
g<-sort_shell_tacudaemp(x)
h<-sort_shell_Ciura(x)
all.equal.list(base,a,b,c,e,f,g,h)
## timing :
x<-runif(1000000)
system.time(base<-sort(x))
system.time(a<-sort_shell_gap(x))
system.time(b<-sort_shell_frunk(x))
system.time(c<-sort_shell_hibbard(x))
system.time(d<-sort_shell_patt(x))
system.time(e<-sort_shell_sed(x))
system.time(f<-sort_shell_tacuda(x))
system.time(g<-sort_shell_tacudaemp(x))
system.time(h<-sort_shell_Ciura(x))

*/