CSci 345: Computing with Massive Data

CSci 345: When RAM is Not Enough: Computing with Massive Data Sets

Spring 2011

Mon, Wed 11:30 - 12:55

The traditional measure of an algorithm's efficiency -- number of instructions performed -- assumes that all the data fits in main memory. Massive data sets, which are becoming more common and that do not fit into main memory, have led to a new measure of the efficiency of an algorithm. This measure takes into account disk accesses, which are orders of magnitude slower than main memory accesses and usually dominate the running time. IO-efficient algorithms try to minimize both the number of instructions and the number of disk accesses.

This class covers basic algorithms and data structures, techniques and paradigms, and applications. Looks at examples where IO-efficient algorithms make a difference in practice, and the extent of this difference. Consists of lectures, paper reading and presentation, and programming projects.

The class will consist of lectures, readings, discussions and programming projects.

The class was developed with the support of NSF award no. 0728780.

Prerequisites: csci 210 (Data Structures)

Office hours: Mon, Tue 3-4:30pm. For quick questions you can come to my office anytime.

Class Email: csci345 at bowdoin

Class webpage: http://www.bowdoin.edu/~ltoma/teaching/cs345/spring11/. All material will be available from this page throughout the semester. This class does not have a Blackboard site.

Approximate course outline

Week	Topic
Week 1,2,3	C, Makefiles, Emacs, Linux.
Week 4,5	Project: Experiencing the IO bottleneck.
Week 6	Paging and the VMM in the OS.
Week 7,8,9,10	The IO-model and IO-efficient algorithms (B-trees, IO-efficient sorting, list ranking, IO-efficient priority queues, IO-efficient flow accumulation, IO-efficient visibility, TBD). Techniques to improve data locality (space-filling curves)
Week 11, 12, 13, 14	Projects.

Here is what happened in class.

Week 1, 2: Intro to C, Makefiles, Linux, Emacs.
Materials:
- Pointers.pdf
- Emacs: Emacs reference card | Emacs quick reference
- The C programming language (wikipedia)
- The C Programming language
- The history of C
- C programming tutorial
- Assignment 1: Implement a first-in first-out doubly linked queue using the following specification: qlistspec.tar. Fill in the functions in lqueue.c so that when you run your program you get this. You may not change lqmain.c. For an elegant linked list implementation, you may want to use a dummy head (this will avoid all special cases of insert and delete).
- Assignment 2: malloc limits
Week 3, 4: The IO bottleneck
Materials:
- The memory hierachy
- A Latex template
- Reading: What your computer does while you wait
- Project 1:The IO bottleneck. Report and presentations due Wednesday Febr 23rd.
- Reading: Chapter 9 (Virtual Memory) from Operating Systems Concepts, by Silberschatz, Galvin and Gagne | VM.pdf
Week 5,6,7: The IO model. IO efficiency. Cache aware vs cache oblivious algorithms.
Analysis of scanning, random access, quicksort, mergesort, list ranking, matrix transposition.
Materials:
- Nice slides (Haverkort)
- Cache oblivious algorithms and data structures (Demaine) (chap 1, 2, 3.1 and 3.2)
- Memory hierarchies (Sanders) (up to 1.5)
Project 1 presentation and papers (first draft): Octavian | Ben | Sam
Project 2: Matrix layouts and matrix operations. Report and presentations due Monday March 28th.
Week 8, 9, 10, 11: Fundamental IO-efficient algorithms and data structures
- Project 3 (and final): IO-efficient sorting.
- IO-efficient sorting.
- B-trees.
- COB static search trees (ven Emde Boas layout).
- IO-efficient priority queues
- IO-efficient list ranking.
Week 12, 13, 14: IO techniques
- Application of list ranking: topological order on trees
- Time-forward processing: circuit evaluation and flow on terrains
- Distributed sweeping: segment intersection and visibility

The final (5/18 at 9am in 224) will consist of one of the problems listed below. You'll have one hour, in class, closed notes, to describe the algorithm and the analysis.

COB matrix transposition
COB matrix multiplication.
IO-efficient sorting.
IO-efficient priority queue.
B-trees.
COB static search trees.
IO-efficient list ranking
Computing a topological ordering on trees IO-efficiently.
IO-efficient circuit evalution.
IO-efficient flow accumulation on terrains.