SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Algorithms for External Memory Sorting
Milind Gokhale
mgokhale@indiana.edu
December 10, 2014
1. Abstract
Sorting has been one of the fundamental operations for processing data in any database
for a very long time. With the increasing amount of data, the problem of sorting data within a
stipulated time has become one of the major factors for almost all applications. We study two
papers on algorithms for external memory (EM) sorting and describe a couple of algorithms
with good I/O complexity.
2. Introduction
With the increasing complexity of applications and the humongous data associated with
the applications, many required operations like scanning, sorting, searching, and query output
can take longer time. Even with the computing power of advanced processors, the internal
memory is limited and the frequent movement of data has magnified the external factor of
data movement to be of much higher importance along with time and space complexity in
algorithms.
The problem of how to sort efficiently has strong practical and theoretical merit and has
motivated many studies in the analysis of algorithms and computational complexity. Studies
[1] confirm that sorting continues to account for roughly one-fourth of all computer cycles.
Much of those resources are consumed by external sorts, in which the file is too large to fit in
internal memory and must reside in secondary storage. The bottleneck in external sorting is
the time for input/output (I/O) operations between internal memory and secondary storage. So
external sorting is required when the data being sorted does not fit into the internal memory
of the computing device and instead must reside in the slower external memory. We study two
papers which talk about sorting in external memory (EM) and present optimal algorithms for
EM sorting.
Paper 1: The Input/output Complexity of Sorting and Related Problems, authored by
Alok Aggarwal and Jeffrey Scott Vitter and published in the journal Communications of the
ACM Volume 31, issue number 9 in 1988.
Area: Algorithms and Data Structures
Paper 2: Algorithms and Data Structures for External Memory written by Jeffrey Scott
Vitter and published in the journal Foundations and Trends in Theoretical Computer Science
Volume 2, issue number 4 Chapter 5 in 2006.
Area: Algorithms and Data Structures, External Memory Algorithms.
Milind Gokhale Algorithms for External Memory Sorting
2
In section 3 we describe a couple of EM sorting algorithms viz. Distribution Sort and
Merge Sort algorithms, in section 4 we compare the way the chosen papers focus on the
different aspects of the algorithms with a common goal to reduce the I/O complexity.
3. Techniques/Algorithms
3.1.External Sorting
Even today, Sorting accounts for a significant percentage of computer operations and so
is an important paradigm in the design of efficient External Memory algorithms. Both the
papers present variants of merge-sort and distribution-sort algorithms as the optimal
algorithms for external sorting. They also characterize the I/O complexity of the sorting
methods in a theorem as below
The average-case and worst-case number of I/Os required for sorting N = nB data items
using D disks is
	( ) = ℎ 	( 	log )
The algorithms are based upon the generic sort-merge strategy. In the sorting phase,
chunks of data small enough to fit in main memory are read, sorted, and written out to a
temporary file. In the merge phase, the sorted sub-files are combined into a single larger file.
Distribution sort and merge sort using the randomized cycling and simple randomized merge
sort are the methods of choice for external sorting.
3.2.Parameters:
 N - # records to sort
 M - # records that can fit into internal memory
 B - # records that can be transferred in a single block
 n – N/B - # blocks of records to sort
 m – M/B - # blocks of records that can fit into internal memory
 D - # independent disk drives
 P - # of CPUs
3.3.Distribution Sort
In this algorithm the elements to be sorted are partitioned into subsets also called as
buckets. The important property of the partitioning is that all the items in one bucket precede
all the items in the next bucket. The sorting is completed by recursively sorting the individual
buckets and concatenating them to form a single fully sorted list.
Milind Gokhale Algorithms for External Memory Sorting
3
Figure 1: The sorting process [3]
3.3.1. The Sorting Process
The elements stored in external memory are read and partitioned into S buckets. To
partition the elements S-1 partitioning elements are found first. The input is partitioned into S
buckets of roughly equal size such that any element in bucket i is greater than or equal to the
(i-1)-th partitioning element and lesser than or equal to the i-th partitioning element [2]. This
process is repeated till the bucket size is reduced to M.
Partitioning Invariant: ( − 1) ℎ	 	 < < ℎ	 	
3.3.2. Finding the partitioning elements
Figure 2: Recursively Distribute Items into S buckets until each bucket contains at most M elements. [2]
While choosing S-1 partitioning elements into equal size buckets, the bucket size decreases
from one recursion level to next by a relative factor of Θ(S). So there are (log ) levels
of recursion. During each level of recursion, data is scanned into internal memory and
partitioned into S buckets. When the buffer is full, another buffer is used to store the next
Milind Gokhale Algorithms for External Memory Sorting
4
set of incoming items for the bucket. So the maximum number of buckets is Θ(M/B) =
Θ(m), and the thus the number of levels of recursion is Θ(log ).
3.3.3. Analysis of Distribution Sort
The I/O complexity of the above algorithm can be described by the below recursion
expression.
( ) = 	 + 																						 	 >
																				= 																																										 	 ≤
Therefore,
( ) = 	 1 + log 																															
If S = Θ ,
Then,
( ) = 	 1 + log 																															
																		= ( )																																																						
Thus the number of partitioning elements S is Θ min , .
However since there is no algorithm to find the m partitioning elements using O(n) I/O
operations, hence we find √ partitioning elements using O(n) I/O operations.
Now, 	log√ = ( 	 log )
The number of I/O operations to implement the distribution sort is therefore,
	 log√ 	≈ 		 ( 	 log )	= 		 	log 	
3.3.4. Algorithm to identify partitioning elements
We see an algorithm to identify partitioning elements or pivots deterministically using
O(n) I/O complexity. We apply the selection algorithm to find the ith element in sorted order.
1. First, the array of given data items is split into pieces of size 5 each.
2. Then we find the median of each piece of 5 elements each.
Milind Gokhale Algorithms for External Memory Sorting
5
3. We recursively select the median of N/5 selected elements.
4. Distribute the elements into two lists using computed median.
5. Then we recursively select in one of the two lists.
Step 2 and 4 are performed in O (N/B) I/Os. While step 5 recurses on at most
elements. Thus algorithm to identify partitioning elements can be given by the below
recursion:
( ) = +
5
+
7
10
= 	 	 /
3.4.Merge Sort
Merge-sort algorithm for external memory uses the sort and merge strategy to sort the
huge data file on external memory. It sorts chunks that fit in main memory and then merges
the sorted chunks into a single larger file. Thus it can be divided into 2 phases – Run formation
Phase (Sorting chunks) and Merging Phase.
3.4.1. Run formation
n blocks of data are scanned, one memory load at a time. Each memory load consisting
of m blocks is sorted into a single run and is given as output to a series of stripes on the disk.
Thus there are N/M or n/m runs each sorted in stripes on the disk.
3.4.2. Merging Phase
Figure 3: Merge Phase [2]
After initial runs are formed, the merging phase begins where groups of R runs are
merged. For each merge, the R runs are scanned and merged in an online manner as they
stream through the internal memory. With double buffering for overlap in I/O and
computation, maximum R = Θ(m) runs can be merged at a time and hence the number of
passes to merge the R runs is (log ) [2].
Milind Gokhale Algorithms for External Memory Sorting
6
3.4.3. Analysis of External Merge-Sort algorithm
Figure 4: I/O complexity of External merge-sort algorithm [4]
The run formation phase which involves creation of N/M or n/m memory sized sorted
lists takes place in I/O operations.
During the merging phase the, R runs are merged together repeatedly. As seen in the I/O
complexity diagram (figure 4), it forms a recursion tree with N/M elements at leaves and
height of the tree log N/M. Since the problem is divided into M/B parts every time, the I/O
complexity of the merging phase becomes
	 log ℎ 	 	 / 		 ℎ	 	
/
4. Comparison
First paper focuses on establishing tighter upper and lower bounds for the number of
I/Os between internal and external memory required for five sorting related problems of
sorting, Fast Fourier Transform (FFT), permutation networks, permuting and matrix
transposition. On the other hand second paper focuses on the many variants of distribution
sort and merge sort algorithms and improvisations taking into consideration the parallel disk
model (PDM) with multiple disks. It also analyses the various effects of partial or complete
striping on the external sorting algorithms.
4.1.Improvements in External Sorting Algorithms
4.1.1. Load Balancing across multiple disks in Distribution Sort:
In order to meet the I/O bound by the given theorem the buckets at each recursion level
must be formed by O(n/D) I/Os. For a single disk, D = 1, Each input I/O step and each output
I/O step during the bucket formation stage in distribution sort must involve on the average
Θ(D) blocks. So the challenge in distribution sort is to output the blocks of the buckets to the
Milind Gokhale Algorithms for External Memory Sorting
7
disks in an online manner and achieve a global load balance by the end of the partitioning, so
that the bucket can be input efficiently during the next level of the recursion.
Using Partial Striping can also be effective by reducing the amount of information stored
in the internal memory. [2] In this the disks are grouped into clusters of size C and data is
output in logical blocks of size CB. Choosing C = √D will although not improve the sorting
time by more than a constant factor, however it will give better I/O complexity than Full
Striping.
4.1.2. Randomized Cycling Distribution Sort
Using the standard striping method, blocks that belong to a given stripe belong to
multiple buckets and the buckets will not be striped across the disks. Thus in the output phase,
each bucket must keep a track last block output to each disk so that the blocks for the bucket
can be linked together which is an overhead. Hence a better approach is to stripe the contents
of each bucket across the disks so the input operation can be done in a striped manner and
during the output step multiple buckets will be transmitted to the multiple stripes [2].
So the basic loop of distribution sort algorithm is same as before, to stream the data
items through the internal memory and partition them into S buckets, and the blocks for each
individual bucket will reside on disks in stripes. Thus each block has predefined disk where it
must be output. If we choose the normal round-robin method of ordering the disks for striping,
then the blocks of different buckets can collide. To solve this problem, randomized cycling is
used instead of round-robin. For each of the S buckets, they determine the ordering of the
disks in the stripe for that bucket via a random permutation of {1,2,…,D}. The S random
permutations are also chosen independently from the S-1 buckets meaning each bucket has its
own random permutation ordering chosen independently from those of the other S-1 buckets
[2]. The resulting sorting algorithm called the randomized cycling distribution sort (RCD),
achieves optimal sorting bound as in the theorem 1 with extremely small constant factors.
4.1.3. Simple randomized merge sort
Striping technique can also be used in merge sort to stripe runs across multiple disks.
However disk striping uses too much of internal memory to cache blocks not yet merged and
thus effective order of the merge is reduced to R = Θ(m/D) which gives a non-optimal result.
The Simple Randomized merge sort uses much less space in internal memory for caching
blocks and thus allows run R to be much larger. Each run is striped across the disks, but with
a random starting point. During the merging process, the next block needed from each disk is
input into internal memory, and if the memory is full, then least needed blocks are flushed
back to disk. Simple randomized merge sort is not optimal for some parameter values, but it
outperforms disk striping.
5. Conclusion
Although both EM sorting algorithms, distribution sort and merge sort have almost
similar I/O complexity, there are differences in the way they work. Distribution sort work by
partitioning the unsorted values into smaller "buckets" that can be sorted in main memory,
Milind Gokhale Algorithms for External Memory Sorting
8
while the Merge sort algorithm works on the principle of merging sorted sub-lists. The former
paper by J. S. Vitter in 1988 claims that the standard merge sorting algorithm is an optimal
external sorting method while the later paper in 2006 points out that the Distribution sort
algorithm has an advantage over the merge approach in a way that they typically make better
use of lower levels of cache in the memory hierarchy of real systems, however, merge
approaches can take advantage of the replacement selection to start off with larger run sizes.
References
1. Aggarwal, Alok, and Jeffrey S Vitter. "The Input/output Complexity of Sorting and Related
Problems." Communications of the ACM, Vol 31, no. 9, pp 1116-127, 1988.
2. J. S. Vitter, "Algorithms and Data Structures for External Memory", Foundation and Trends®
in
Theoretical Computer Science, vol 2, no 4, pp 305–474, 2006.
3. Sitchinava, Nodari. "EM Model." Lecture, Algorithms for Memory Hierarchies, October 17,
2012.
4. Arge, Lars. "I/O-Algorithms." Lecture, Spring 2009, Aarhus, January, 2009.

Mais conteúdo relacionado

Mais procurados

Introductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, QueueIntroductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, QueueGhaffar Khan
 
Algorithms
AlgorithmsAlgorithms
AlgorithmsDevMix
 
Data Structure # vpmp polytechnic
Data Structure # vpmp polytechnicData Structure # vpmp polytechnic
Data Structure # vpmp polytechniclavparmar007
 
Review on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyReview on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyCSCJournals
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in javaMuhammad Aleem Siddiqui
 
23. Advanced Datatypes and New Application in DBMS
23. Advanced Datatypes and New Application in DBMS23. Advanced Datatypes and New Application in DBMS
23. Advanced Datatypes and New Application in DBMSkoolkampus
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesHoang Nguyen
 
Data structure
Data structureData structure
Data structureMohd Arif
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimizationKumar
 
Presentation on Data Structure
Presentation on Data StructurePresentation on Data Structure
Presentation on Data StructureA. N. M. Jubaer
 
data structure
data structuredata structure
data structurehashim102
 

Mais procurados (19)

1816 1819
1816 18191816 1819
1816 1819
 
Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
 
Introductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, QueueIntroductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, Queue
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Data Structure # vpmp polytechnic
Data Structure # vpmp polytechnicData Structure # vpmp polytechnic
Data Structure # vpmp polytechnic
 
Introduction to data structure and algorithms
Introduction to data structure and algorithmsIntroduction to data structure and algorithms
Introduction to data structure and algorithms
 
Data Structures
Data StructuresData Structures
Data Structures
 
Review on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyReview on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative Study
 
ch13
ch13ch13
ch13
 
Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in java
 
23. Advanced Datatypes and New Application in DBMS
23. Advanced Datatypes and New Application in DBMS23. Advanced Datatypes and New Application in DBMS
23. Advanced Datatypes and New Application in DBMS
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Data structure
Data structureData structure
Data structure
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
 
Data structure
Data structureData structure
Data structure
 
Data struters
Data strutersData struters
Data struters
 
Presentation on Data Structure
Presentation on Data StructurePresentation on Data Structure
Presentation on Data Structure
 
data structure
data structuredata structure
data structure
 

Destaque

3.9 external sorting
3.9 external sorting3.9 external sorting
3.9 external sortingKrish_ver2
 
Introduction to datastructure and algorithm
Introduction to datastructure and algorithmIntroduction to datastructure and algorithm
Introduction to datastructure and algorithmPratik Mota
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting AlgorithmAl Amin
 
Overview of physical storage media
Overview of physical storage mediaOverview of physical storage media
Overview of physical storage mediaSrinath Sri
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structureeShikshak
 
8 queens problem using back tracking
8 queens problem using back tracking8 queens problem using back tracking
8 queens problem using back trackingTech_MX
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexityAnkit Katiyar
 
Quick Sort , Merge Sort , Heap Sort
Quick Sort , Merge Sort ,  Heap SortQuick Sort , Merge Sort ,  Heap Sort
Quick Sort , Merge Sort , Heap SortMohammed Hussein
 

Destaque (14)

3.9 external sorting
3.9 external sorting3.9 external sorting
3.9 external sorting
 
Merging
MergingMerging
Merging
 
Introduction to datastructure and algorithm
Introduction to datastructure and algorithmIntroduction to datastructure and algorithm
Introduction to datastructure and algorithm
 
Merge sort
Merge sortMerge sort
Merge sort
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting Algorithm
 
Overview of physical storage media
Overview of physical storage mediaOverview of physical storage media
Overview of physical storage media
 
3.8 quicksort
3.8 quicksort3.8 quicksort
3.8 quicksort
 
Mergesort
MergesortMergesort
Mergesort
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structure
 
8 queens problem using back tracking
8 queens problem using back tracking8 queens problem using back tracking
8 queens problem using back tracking
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexity
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
 
Quick Sort , Merge Sort , Heap Sort
Quick Sort , Merge Sort ,  Heap SortQuick Sort , Merge Sort ,  Heap Sort
Quick Sort , Merge Sort , Heap Sort
 
Merge sort
Merge sortMerge sort
Merge sort
 

Semelhante a Algorithms for External Memory Sorting

Query Processing, Query Optimization and Transaction
Query Processing, Query Optimization and TransactionQuery Processing, Query Optimization and Transaction
Query Processing, Query Optimization and TransactionPrabu U
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structuresunilchute1
 
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...IJCSEA Journal
 
Buffer Trees - Utility and Applications for External Memory Data Processing
Buffer Trees - Utility and Applications for External Memory Data ProcessingBuffer Trees - Utility and Applications for External Memory Data Processing
Buffer Trees - Utility and Applications for External Memory Data ProcessingMilind Gokhale
 
Query optimization
Query optimizationQuery optimization
Query optimizationPooja Dixit
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture NotesFellowBuddy.com
 
DataSructure-Time and Space Complexity.pptx
DataSructure-Time and Space Complexity.pptxDataSructure-Time and Space Complexity.pptx
DataSructure-Time and Space Complexity.pptxLakshmiSamivel
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSortRyo Jin
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxprakashvs7
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_IntroductionThenmozhiK5
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 
b,Sc it data structure.pptx
b,Sc it data structure.pptxb,Sc it data structure.pptx
b,Sc it data structure.pptxclassall
 
b,Sc it data structure.pptx
b,Sc it data structure.pptxb,Sc it data structure.pptx
b,Sc it data structure.pptxclassall
 
b,Sc it data structure.ppt
b,Sc it data structure.pptb,Sc it data structure.ppt
b,Sc it data structure.pptclassall
 
Study on Sorting Algorithm and Position Determining Sort
Study on Sorting Algorithm and Position Determining SortStudy on Sorting Algorithm and Position Determining Sort
Study on Sorting Algorithm and Position Determining SortIRJET Journal
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957IJMER
 
Introduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptxIntroduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptxpoongothai11
 

Semelhante a Algorithms for External Memory Sorting (20)

Query Processing, Query Optimization and Transaction
Query Processing, Query Optimization and TransactionQuery Processing, Query Optimization and Transaction
Query Processing, Query Optimization and Transaction
 
Binary Sort
Binary SortBinary Sort
Binary Sort
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...
PROPOSAL OF A TWO WAY SORTING ALGORITHM AND PERFORMANCE COMPARISON WITH EXIST...
 
Buffer Trees - Utility and Applications for External Memory Data Processing
Buffer Trees - Utility and Applications for External Memory Data ProcessingBuffer Trees - Utility and Applications for External Memory Data Processing
Buffer Trees - Utility and Applications for External Memory Data Processing
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture Notes
 
DataSructure-Time and Space Complexity.pptx
DataSructure-Time and Space Complexity.pptxDataSructure-Time and Space Complexity.pptx
DataSructure-Time and Space Complexity.pptx
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptx
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_Introduction
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
b,Sc it data structure.pptx
b,Sc it data structure.pptxb,Sc it data structure.pptx
b,Sc it data structure.pptx
 
b,Sc it data structure.pptx
b,Sc it data structure.pptxb,Sc it data structure.pptx
b,Sc it data structure.pptx
 
b,Sc it data structure.ppt
b,Sc it data structure.pptb,Sc it data structure.ppt
b,Sc it data structure.ppt
 
Study on Sorting Algorithm and Position Determining Sort
Study on Sorting Algorithm and Position Determining SortStudy on Sorting Algorithm and Position Determining Sort
Study on Sorting Algorithm and Position Determining Sort
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
Data Structures 7
Data Structures 7Data Structures 7
Data Structures 7
 
Introduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptxIntroduction to Data Structures, Data Structures using C.pptx
Introduction to Data Structures, Data Structures using C.pptx
 

Mais de Milind Gokhale

Yelp Dataset Challenge 2015
Yelp Dataset Challenge 2015Yelp Dataset Challenge 2015
Yelp Dataset Challenge 2015Milind Gokhale
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Technology Survey and Design
Technology Survey and DesignTechnology Survey and Design
Technology Survey and DesignMilind Gokhale
 
Epics and User Stories
Epics and User StoriesEpics and User Stories
Epics and User StoriesMilind Gokhale
 
Aloha Social Networking Portal - SRS
Aloha Social Networking Portal - SRSAloha Social Networking Portal - SRS
Aloha Social Networking Portal - SRSMilind Gokhale
 
Aloha Social Networking Portal - Design Document
Aloha Social Networking Portal - Design DocumentAloha Social Networking Portal - Design Document
Aloha Social Networking Portal - Design DocumentMilind Gokhale
 
Android games analysis final presentation
Android games analysis final presentationAndroid games analysis final presentation
Android games analysis final presentationMilind Gokhale
 
Android gamesanalysis hunger-gamesfinal
Android gamesanalysis hunger-gamesfinalAndroid gamesanalysis hunger-gamesfinal
Android gamesanalysis hunger-gamesfinalMilind Gokhale
 
Building effective teams in Amdocs-TECC - project report
Building effective teams in Amdocs-TECC - project reportBuilding effective teams in Amdocs-TECC - project report
Building effective teams in Amdocs-TECC - project reportMilind Gokhale
 
Building effective teams in Amdocs TECC - Presentation
Building effective teams in Amdocs TECC - PresentationBuilding effective teams in Amdocs TECC - Presentation
Building effective teams in Amdocs TECC - PresentationMilind Gokhale
 
Internet marketing report
Internet marketing reportInternet marketing report
Internet marketing reportMilind Gokhale
 
Change: to be or not to be
Change: to be or not to beChange: to be or not to be
Change: to be or not to beMilind Gokhale
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 

Mais de Milind Gokhale (20)

Yelp Dataset Challenge 2015
Yelp Dataset Challenge 2015Yelp Dataset Challenge 2015
Yelp Dataset Challenge 2015
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Sprint Plan1
Sprint Plan1Sprint Plan1
Sprint Plan1
 
Technology Survey and Design
Technology Survey and DesignTechnology Survey and Design
Technology Survey and Design
 
Market Survey Report
Market Survey ReportMarket Survey Report
Market Survey Report
 
Epics and User Stories
Epics and User StoriesEpics and User Stories
Epics and User Stories
 
Visualforce
VisualforceVisualforce
Visualforce
 
Aloha Social Networking Portal - SRS
Aloha Social Networking Portal - SRSAloha Social Networking Portal - SRS
Aloha Social Networking Portal - SRS
 
Aloha Social Networking Portal - Design Document
Aloha Social Networking Portal - Design DocumentAloha Social Networking Portal - Design Document
Aloha Social Networking Portal - Design Document
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Android games analysis final presentation
Android games analysis final presentationAndroid games analysis final presentation
Android games analysis final presentation
 
Android gamesanalysis hunger-gamesfinal
Android gamesanalysis hunger-gamesfinalAndroid gamesanalysis hunger-gamesfinal
Android gamesanalysis hunger-gamesfinal
 
One sample runs test
One sample runs testOne sample runs test
One sample runs test
 
Building effective teams in Amdocs-TECC - project report
Building effective teams in Amdocs-TECC - project reportBuilding effective teams in Amdocs-TECC - project report
Building effective teams in Amdocs-TECC - project report
 
Building effective teams in Amdocs TECC - Presentation
Building effective teams in Amdocs TECC - PresentationBuilding effective teams in Amdocs TECC - Presentation
Building effective teams in Amdocs TECC - Presentation
 
Internet marketing report
Internet marketing reportInternet marketing report
Internet marketing report
 
Internet marketing
Internet marketingInternet marketing
Internet marketing
 
Indian it industry
Indian it industryIndian it industry
Indian it industry
 
Change: to be or not to be
Change: to be or not to beChange: to be or not to be
Change: to be or not to be
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 

Último

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 

Último (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Algorithms for External Memory Sorting

  • 1. Algorithms for External Memory Sorting Milind Gokhale mgokhale@indiana.edu December 10, 2014 1. Abstract Sorting has been one of the fundamental operations for processing data in any database for a very long time. With the increasing amount of data, the problem of sorting data within a stipulated time has become one of the major factors for almost all applications. We study two papers on algorithms for external memory (EM) sorting and describe a couple of algorithms with good I/O complexity. 2. Introduction With the increasing complexity of applications and the humongous data associated with the applications, many required operations like scanning, sorting, searching, and query output can take longer time. Even with the computing power of advanced processors, the internal memory is limited and the frequent movement of data has magnified the external factor of data movement to be of much higher importance along with time and space complexity in algorithms. The problem of how to sort efficiently has strong practical and theoretical merit and has motivated many studies in the analysis of algorithms and computational complexity. Studies [1] confirm that sorting continues to account for roughly one-fourth of all computer cycles. Much of those resources are consumed by external sorts, in which the file is too large to fit in internal memory and must reside in secondary storage. The bottleneck in external sorting is the time for input/output (I/O) operations between internal memory and secondary storage. So external sorting is required when the data being sorted does not fit into the internal memory of the computing device and instead must reside in the slower external memory. We study two papers which talk about sorting in external memory (EM) and present optimal algorithms for EM sorting. Paper 1: The Input/output Complexity of Sorting and Related Problems, authored by Alok Aggarwal and Jeffrey Scott Vitter and published in the journal Communications of the ACM Volume 31, issue number 9 in 1988. Area: Algorithms and Data Structures Paper 2: Algorithms and Data Structures for External Memory written by Jeffrey Scott Vitter and published in the journal Foundations and Trends in Theoretical Computer Science Volume 2, issue number 4 Chapter 5 in 2006. Area: Algorithms and Data Structures, External Memory Algorithms.
  • 2. Milind Gokhale Algorithms for External Memory Sorting 2 In section 3 we describe a couple of EM sorting algorithms viz. Distribution Sort and Merge Sort algorithms, in section 4 we compare the way the chosen papers focus on the different aspects of the algorithms with a common goal to reduce the I/O complexity. 3. Techniques/Algorithms 3.1.External Sorting Even today, Sorting accounts for a significant percentage of computer operations and so is an important paradigm in the design of efficient External Memory algorithms. Both the papers present variants of merge-sort and distribution-sort algorithms as the optimal algorithms for external sorting. They also characterize the I/O complexity of the sorting methods in a theorem as below The average-case and worst-case number of I/Os required for sorting N = nB data items using D disks is ( ) = ℎ ( log ) The algorithms are based upon the generic sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted sub-files are combined into a single larger file. Distribution sort and merge sort using the randomized cycling and simple randomized merge sort are the methods of choice for external sorting. 3.2.Parameters:  N - # records to sort  M - # records that can fit into internal memory  B - # records that can be transferred in a single block  n – N/B - # blocks of records to sort  m – M/B - # blocks of records that can fit into internal memory  D - # independent disk drives  P - # of CPUs 3.3.Distribution Sort In this algorithm the elements to be sorted are partitioned into subsets also called as buckets. The important property of the partitioning is that all the items in one bucket precede all the items in the next bucket. The sorting is completed by recursively sorting the individual buckets and concatenating them to form a single fully sorted list.
  • 3. Milind Gokhale Algorithms for External Memory Sorting 3 Figure 1: The sorting process [3] 3.3.1. The Sorting Process The elements stored in external memory are read and partitioned into S buckets. To partition the elements S-1 partitioning elements are found first. The input is partitioned into S buckets of roughly equal size such that any element in bucket i is greater than or equal to the (i-1)-th partitioning element and lesser than or equal to the i-th partitioning element [2]. This process is repeated till the bucket size is reduced to M. Partitioning Invariant: ( − 1) ℎ < < ℎ 3.3.2. Finding the partitioning elements Figure 2: Recursively Distribute Items into S buckets until each bucket contains at most M elements. [2] While choosing S-1 partitioning elements into equal size buckets, the bucket size decreases from one recursion level to next by a relative factor of Θ(S). So there are (log ) levels of recursion. During each level of recursion, data is scanned into internal memory and partitioned into S buckets. When the buffer is full, another buffer is used to store the next
  • 4. Milind Gokhale Algorithms for External Memory Sorting 4 set of incoming items for the bucket. So the maximum number of buckets is Θ(M/B) = Θ(m), and the thus the number of levels of recursion is Θ(log ). 3.3.3. Analysis of Distribution Sort The I/O complexity of the above algorithm can be described by the below recursion expression. ( ) = + > = ≤ Therefore, ( ) = 1 + log If S = Θ , Then, ( ) = 1 + log = ( ) Thus the number of partitioning elements S is Θ min , . However since there is no algorithm to find the m partitioning elements using O(n) I/O operations, hence we find √ partitioning elements using O(n) I/O operations. Now, log√ = ( log ) The number of I/O operations to implement the distribution sort is therefore, log√ ≈ ( log ) = log 3.3.4. Algorithm to identify partitioning elements We see an algorithm to identify partitioning elements or pivots deterministically using O(n) I/O complexity. We apply the selection algorithm to find the ith element in sorted order. 1. First, the array of given data items is split into pieces of size 5 each. 2. Then we find the median of each piece of 5 elements each.
  • 5. Milind Gokhale Algorithms for External Memory Sorting 5 3. We recursively select the median of N/5 selected elements. 4. Distribute the elements into two lists using computed median. 5. Then we recursively select in one of the two lists. Step 2 and 4 are performed in O (N/B) I/Os. While step 5 recurses on at most elements. Thus algorithm to identify partitioning elements can be given by the below recursion: ( ) = + 5 + 7 10 = / 3.4.Merge Sort Merge-sort algorithm for external memory uses the sort and merge strategy to sort the huge data file on external memory. It sorts chunks that fit in main memory and then merges the sorted chunks into a single larger file. Thus it can be divided into 2 phases – Run formation Phase (Sorting chunks) and Merging Phase. 3.4.1. Run formation n blocks of data are scanned, one memory load at a time. Each memory load consisting of m blocks is sorted into a single run and is given as output to a series of stripes on the disk. Thus there are N/M or n/m runs each sorted in stripes on the disk. 3.4.2. Merging Phase Figure 3: Merge Phase [2] After initial runs are formed, the merging phase begins where groups of R runs are merged. For each merge, the R runs are scanned and merged in an online manner as they stream through the internal memory. With double buffering for overlap in I/O and computation, maximum R = Θ(m) runs can be merged at a time and hence the number of passes to merge the R runs is (log ) [2].
  • 6. Milind Gokhale Algorithms for External Memory Sorting 6 3.4.3. Analysis of External Merge-Sort algorithm Figure 4: I/O complexity of External merge-sort algorithm [4] The run formation phase which involves creation of N/M or n/m memory sized sorted lists takes place in I/O operations. During the merging phase the, R runs are merged together repeatedly. As seen in the I/O complexity diagram (figure 4), it forms a recursion tree with N/M elements at leaves and height of the tree log N/M. Since the problem is divided into M/B parts every time, the I/O complexity of the merging phase becomes log ℎ / ℎ / 4. Comparison First paper focuses on establishing tighter upper and lower bounds for the number of I/Os between internal and external memory required for five sorting related problems of sorting, Fast Fourier Transform (FFT), permutation networks, permuting and matrix transposition. On the other hand second paper focuses on the many variants of distribution sort and merge sort algorithms and improvisations taking into consideration the parallel disk model (PDM) with multiple disks. It also analyses the various effects of partial or complete striping on the external sorting algorithms. 4.1.Improvements in External Sorting Algorithms 4.1.1. Load Balancing across multiple disks in Distribution Sort: In order to meet the I/O bound by the given theorem the buckets at each recursion level must be formed by O(n/D) I/Os. For a single disk, D = 1, Each input I/O step and each output I/O step during the bucket formation stage in distribution sort must involve on the average Θ(D) blocks. So the challenge in distribution sort is to output the blocks of the buckets to the
  • 7. Milind Gokhale Algorithms for External Memory Sorting 7 disks in an online manner and achieve a global load balance by the end of the partitioning, so that the bucket can be input efficiently during the next level of the recursion. Using Partial Striping can also be effective by reducing the amount of information stored in the internal memory. [2] In this the disks are grouped into clusters of size C and data is output in logical blocks of size CB. Choosing C = √D will although not improve the sorting time by more than a constant factor, however it will give better I/O complexity than Full Striping. 4.1.2. Randomized Cycling Distribution Sort Using the standard striping method, blocks that belong to a given stripe belong to multiple buckets and the buckets will not be striped across the disks. Thus in the output phase, each bucket must keep a track last block output to each disk so that the blocks for the bucket can be linked together which is an overhead. Hence a better approach is to stripe the contents of each bucket across the disks so the input operation can be done in a striped manner and during the output step multiple buckets will be transmitted to the multiple stripes [2]. So the basic loop of distribution sort algorithm is same as before, to stream the data items through the internal memory and partition them into S buckets, and the blocks for each individual bucket will reside on disks in stripes. Thus each block has predefined disk where it must be output. If we choose the normal round-robin method of ordering the disks for striping, then the blocks of different buckets can collide. To solve this problem, randomized cycling is used instead of round-robin. For each of the S buckets, they determine the ordering of the disks in the stripe for that bucket via a random permutation of {1,2,…,D}. The S random permutations are also chosen independently from the S-1 buckets meaning each bucket has its own random permutation ordering chosen independently from those of the other S-1 buckets [2]. The resulting sorting algorithm called the randomized cycling distribution sort (RCD), achieves optimal sorting bound as in the theorem 1 with extremely small constant factors. 4.1.3. Simple randomized merge sort Striping technique can also be used in merge sort to stripe runs across multiple disks. However disk striping uses too much of internal memory to cache blocks not yet merged and thus effective order of the merge is reduced to R = Θ(m/D) which gives a non-optimal result. The Simple Randomized merge sort uses much less space in internal memory for caching blocks and thus allows run R to be much larger. Each run is striped across the disks, but with a random starting point. During the merging process, the next block needed from each disk is input into internal memory, and if the memory is full, then least needed blocks are flushed back to disk. Simple randomized merge sort is not optimal for some parameter values, but it outperforms disk striping. 5. Conclusion Although both EM sorting algorithms, distribution sort and merge sort have almost similar I/O complexity, there are differences in the way they work. Distribution sort work by partitioning the unsorted values into smaller "buckets" that can be sorted in main memory,
  • 8. Milind Gokhale Algorithms for External Memory Sorting 8 while the Merge sort algorithm works on the principle of merging sorted sub-lists. The former paper by J. S. Vitter in 1988 claims that the standard merge sorting algorithm is an optimal external sorting method while the later paper in 2006 points out that the Distribution sort algorithm has an advantage over the merge approach in a way that they typically make better use of lower levels of cache in the memory hierarchy of real systems, however, merge approaches can take advantage of the replacement selection to start off with larger run sizes. References 1. Aggarwal, Alok, and Jeffrey S Vitter. "The Input/output Complexity of Sorting and Related Problems." Communications of the ACM, Vol 31, no. 9, pp 1116-127, 1988. 2. J. S. Vitter, "Algorithms and Data Structures for External Memory", Foundation and Trends® in Theoretical Computer Science, vol 2, no 4, pp 305–474, 2006. 3. Sitchinava, Nodari. "EM Model." Lecture, Algorithms for Memory Hierarchies, October 17, 2012. 4. Arge, Lars. "I/O-Algorithms." Lecture, Spring 2009, Aarhus, January, 2009.