[Bowdoin Computer Science]

CSci 345: When RAM is Not Enough: Computing with Massive Data Sets

Spring 2011

Mon, Wed 11:30 - 12:55

The traditional measure of an algorithm's efficiency -- number of instructions performed -- assumes that all the data fits in main memory. Massive data sets, which are becoming more common and that do not fit into main memory, have led to a new measure of the efficiency of an algorithm. This measure takes into account disk accesses, which are orders of magnitude slower than main memory accesses and usually dominate the running time. IO-efficient algorithms try to minimize both the number of instructions and the number of disk accesses.

This class covers basic algorithms and data structures, techniques and paradigms, and applications. Looks at examples where IO-efficient algorithms make a difference in practice, and the extent of this difference. Consists of lectures, paper reading and presentation, and programming projects.

The class will consist of lectures, readings, discussions and programming projects.

The class was developed with the support of NSF award no. 0728780.

Prerequisites: csci 210 (Data Structures)

Office hours: Mon, Tue 3-4:30pm. For quick questions you can come to my office anytime.

Class Email: csci345 at bowdoin

Class webpage: http://www.bowdoin.edu/~ltoma/teaching/cs345/spring11/. All material will be available from this page throughout the semester. This class does not have a Blackboard site.

Approximate course outline

Week Topic
Week 1,2,3 C, Makefiles, Emacs, Linux.
Week 4,5 Project: Experiencing the IO bottleneck.
Week 6 Paging and the VMM in the OS.
Week 7,8,9,10 The IO-model and IO-efficient algorithms (B-trees, IO-efficient sorting, list ranking, IO-efficient priority queues, IO-efficient flow accumulation, IO-efficient visibility, TBD). Techniques to improve data locality (space-filling curves)
Week 11, 12, 13, 14 Projects.

Here is what happened in class.

  1. Week 1, 2: Intro to C, Makefiles, Linux, Emacs.
    Materials:
  2. Week 3, 4: The IO bottleneck
    Materials:
  3. Week 5,6,7: The IO model. IO efficiency. Cache aware vs cache oblivious algorithms.
    Analysis of scanning, random access, quicksort, mergesort, list ranking, matrix transposition.
    Materials:
    Project 1 presentation and papers (first draft): Octavian | Ben | Sam
    Project 2: Matrix layouts and matrix operations. Report and presentations due Monday March 28th.
  4. Week 8, 9, 10, 11: Fundamental IO-efficient algorithms and data structures
  5. Week 12, 13, 14: IO techniques


The final (5/18 at 9am in 224) will consist of one of the problems listed below. You'll have one hour, in class, closed notes, to describe the algorithm and the analysis.