Project 3: IO-efficient sorting

Implement k-way mergesort as discussed in class to sort files of integers. The input to your program should be a file of integers, and the output a file of sorted integers.

I will provide test files in binary form (here). Your output should be a binary file.

The fan-out k should be given as a command-line argument. In class we discussed that k should be chosen on the order of M/B. Experiment with different values of k and try to optimize k. To decide the size of a run, you will need to know the size of main memory. The simplest way is to let the user provide this information as a command line argument.

iosort -i filename -o filename -k value -m value

Try to optimize efficiency as much as you can both with respect to IO and CPU.

Setup

Run your code on one of the linux machines. For timings, use the grid.

If you need to use temporary files, place them in /tmp/scratch. This is a hard disk that's local to the machine, and thus not on NFS. The location of the scratch space may differ slightly from one machine to the other, so make it a paramater on the command line.

iosort -i filename -o filename -k value -m value -s scratchlocation

Run your experiments with 256MB of memory and datasets of various sizes. Focus on data sizes that show the IO-efficiency of your sort (that is, don't run lots of experiments for small datasets that fit in memory).

For none of the experiments it is clear exactly how long the experiments will take. There is no need to let an algorithm run for more than a day to verify that it takes a long time. Write scripts. Let experiments run overnight while you work on other things. Keep me updated on you progress, so that we can adapt the schedule if necessary.

Add a timer to your code to measure the total running time. Include the time to read the input file and to write the output file.

Here is the code for a heap, which you may need to use (if you do, you'll need to adapt it to your problem): pqueue

Hand in

Email me the code so that I can test it. Bring to class a hardcopy of the code, and a paper. The paper should include:

a clear description of your assignment
a description of the algorithm(s) you implemented
an analysis of the running time, both CPU and IO, experessed using Theta-notation.
a description of all the experiments you did and the results.
a discussion of the experimental results
conclusions (about whiat algorithm works best under what circumstances).

Last modified: Thu Apr 21 15:41:11 EDT 2011