In the daylight of biotechnology:Rcpp and R applications: Shell sorting algorithm: review and implementation example using Rcpp

There are several reasons for studying t simple sorting algorithms in some detail. First, they provide a relatively painless way to learn terminology and basic mechanisms for sorting algorithms so that we get an adequate background for studying the more sophisticated algorithms. Second, there are many applications of sorting where it’s better to use these simple methods than the more powerful general-purpose methods. Finally, some of the simple methods extend to better general purpose methods or can be used to improve the efficiency of more powerful methods. The most prominent example of this is seen in recursive sorts which “divide and conquer” big files into many small ones. Obviously, it is advantageous to know the best way to deal with small files in such situations.

As mentioned above, there are several sorting applications in which a relatively simple algorithm may be the method of choice. Sorting programs are often used only once (or only a few times). If the number of items to be sorted is not too large (say, less than five hundred elements), it may well be more efficient just to run a simple method than to implement and debug a complicated method. Elementary methods are always suitable for small files (say, less than fifty elements); it is unlikely that a sophisticated algorithm would be justified for a small file, unless a very large number of such files are to be sorted. Other types of files that are relatively easy to sort are ones that are already almost sorted (or already sorted) or ones that contain large numbers of equal keys. Simple methods can do much better on such well-structured files than general-purpose methods.

Insertion sort is the method often used by people to sort bridge hands: consider the elements one at a time, inserting each in its proper place among those already considered (keeping them sorted). The element being considered is inserted merely by moving larger elements one position to the right, then inserting the element into the vacated position. The code for this algorithm is straightforward.

Insertion Sort by Diminishing Increment A refinement of the straight insertion sort was proposed by D. L. Shell in l959 (Shell 1959). The method is explained and demonstrated on our standard example of eight items (see Table 1). First, all items that are four positions apart are grouped and sorted separately. This process is called a 4-sort. In this example of eight items, each group contains exactly two items. After this first pass, the items are regrouped into groups with items two positions apart and then sorted anew. This process is called a 2-sort. Finally, in a third pass, all items are sorted in an ordinary sort or 1-sort. One may at first wonder if the necessity of several sorting passes, each of which involves all items, does not introduce more work than it saves. However, each sorting step over a chain either involves relatively few items or the items are already quite well ordered and comparatively few rearrangements are required. It is obvious that the method results in an ordered array, and it is fairly obvious that each pass profits from previous passes (since each i-sort combines two groups sorted in the preceding 2i-sort).

It is also obvious that any sequence of increments is acceptable, as long as the last one is unity, because in the worst case the last pass does all the work. It is, however, much less obvious that the method of diminishing increments yields even better results with increments other than powers of 2.

The procedure is therefore developed without relying on a specific sequence of increments. The T increments are denoted by h0, h1, ... , hT-1 with the conditions ht-1 = 1, hi+1 < hi

Analysis of Shellsort.

The analysis of this algorithm poses some very difficult mathematical problems, many of which have not yet been solved. In particular, it is not known which choice of increments yields (gap function) the best results. One surprising fact, however, is that they should not be multiples of each other. This will avoid the phenomenon evident from the example given above in which each sorting pass combines two chains that before had no interaction whatsoever. It is indeed desirable that interaction between various chains takes place as often as possible, and the following theorem holds: If a k-sorted sequence is i-sorted, then it remains k-sorted.

1, 4, 13, 40, 121, ...where hk-1 = 3hk+1, ht = 1, and t = k×log3(n) - 1.He also recommends the sequence 1, 3, 7, 15, 31, ... where hk-1 = 2hk+1, ht = 1, and t = k×log2(n) - 1(Nievergelt and Hinrichs 1999)

For the latter choice, mathematical analysis yields an effort proportional to n2 required for sorting n items with the Shellsort algorithm. Although this is a significant improvement over n2 , we will not expound further on this method, since even better algorithms are known.

Deciding which gap sequence to use is difficult. Every gap sequence that contains 1 yields a correct sort; however, the properties of thus obtained versions of Shellsort may be very different.

The table below compares most proposed gap sequences published so far((Shell 1959), (Frank and Lazarus 1960), (Hibbard 1963), (Papernov and Stasevich 1965), (Incerpi and Sedgewick 1985) (Sedgewick 1986) (Gonnet, Baeza-Yates et al. 1991) (Tokuda 1992) (Ciura 2001)). Some of them have decreasing elements that depend on the size of the sorted array (N). Others are increasing infinite sequences, whose elements less than N should be used in reverse order.

R and example of code and benchmarking:

The Rcpp package has become the most widely used language extension for R, the powerful environment and language for computing with data. As of September 2016, 778 packages on CRAN and a further 76 on BioConductor deploy Rcpp to extend R, to accelerate computations and to connect to other C++ projects. Rcpp provides a powerful API on top of R, permitting direct interchange of rich R objects (including S3, S4 or Reference Class objects) between R and C++. Rcpp sugar gives syntactic sugar such as vectorised C++ expression; Rcpp modules provide easy extensibility using declarations and Rcpp attributes greatly facilitates code integration.

Shell sort is used naturally in R using order and sort function in the basic package. The source file of the Rcpp implementation is available at: M. E., Macherki (2016): Shell Sort Algorithm:Rcpp. figshare.

https://dx.doi.org/10.6084/m9.figshare.3386284.v1

References

Ciura, M. (2001). Best increments for the average case of shellsort. International Symposium on Fundamentals of Computation Theory, Springer.

Frank, R. and R. Lazarus (1960). "A high-speed sorting procedure." Communications of the ACM 3(1): 20-22.

Gonnet, G. H., et al. (1991). Lexicographical indices for text: Inverted les vs pat trees, Technical Report TR-OED-91-01, University of Waterloo.

Hibbard, T. N. (1963). "A simple sorting algorithm." Journal of the ACM (JACM) 10(2): 142-150.

Incerpi, J. and R. Sedgewick (1985). "Improved upper bounds on Shellsort." Journal of Computer and System Sciences 31(2): 210-224.

Nievergelt, J. and K. H. Hinrichs (1999). Algorithms & data structures, vdf Hochschulverl. an der ETH.

Papernov, A. and G. Stasevich (1965). "A method of information sorting in computer memories." Problemy Peredachi Informatsii 1(3): 81-98.

Sedgewick, R. (1986). "A new upper bound for Shellsort." Journal of Algorithms 7(2): 159-173.

Shell, D. L. (1959). "A high-speed sorting procedure." Communications of the ACM 2(7): 30-32.

Tokuda, N. (1992). An improved Shellsort. Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture-Information Processing'92, Volume 1-Volume I, North-Holland Publishing Co.

In the daylight of biotechnology:Rcpp and R applications

Saturday, October 01, 2016

Shell sorting algorithm: review and implementation example using Rcpp

No comments:

Post a Comment