Csci 210 Lab: Small World

(Laura Toma)
(inspired from Sedgewick & Wayne, Stanford and Jeff Forbes, Duke)

Overview

In this project you will investigate the degree of separation of Holywood actors, also known as the Kevin Bacon game. As you may know, Kevin Bacon is a prolific actor who has appeared in many movies. We assign Kevin Bacon himself a Kevin-Bacon-number of 0. Any actor (except Kevin Bacon himself) who has starred in a movie with Kevin Bacon has a Kevin-Bacon-number of 1. Any remaining actor who has been in the same cast as an actor whose Kevin-Bacon-number is 1 has a Kevin-Bacon-number of 2, and so on.

For example, Meryl Streep has a Kevin-Bacon-number of 1 because she appeared in The River with Kevin Bacon. Nicole Kidman has a Kevin-Bacon-number of 2 because she did not play with Kevin Bacon in any movie, but she was in Cold Mountain with Donald Sutherland, and Sutherland appeared in Animal House with Kevin Bacon.

Check out the Wiki page on the six degrees of Kevin Bacon. And check out an online version of this game, The Oracle of Bacon.

Genarally speaking, the goal of this lab is to:

Find Kevin-Bacon-numbers: given the name of an actor, find his/her Kevin-Bacon-number and the shortest alternating sequence of actor-movie pairs that lead to Kevin Bacon.
What is the average Kevin-Bacon-number at Holywood? (for this, we'll ignore the actors, if any, that are not connected to Kevin-bacon). This gives a measure of how good Kevin Bacon is as "the center of Holywood".

You may ask, is there anything special about Kevin Bacon? One should be able to compute shortest paths between any two actors; and one should be able to evaluate any actor as a center of Holywood. Your lab should handle the following:

Find the link from Actor A to Actor B. In other words, find the A-number of B.
Evaluate how good a center is Actor A. In other words, find the average A-number at Holywood.

Note that the Kevin-Bacon game is a special case of this more general game (where A=Kevin Bacon).

For Kevin Bacon, you'll see that the average KB-number is much smaller than you expect. This phenomenon is known as the small-world phenomenon, or the six-degrees of separation. It is a concept that was discovered in the 60's in social sciences and has been researched ever since in many disciplines. Take a few minutes to search for "six degrees of separation" on the Internet, it is a fascinating topic. When it comes to Holywood, the idea is that even if every actor has a relatively small number of co-actors, there is a relatively short chain of movies/actors separating two actors from each other. If the theory of the six-degrees of separation is true for Holywood, it would imply that most actors will have a KB-number of 6 or less. That is, the average KB-number is < 6. Checking this theory is one of your tasks for the lab. The other task is to allow the user to evaluate the "centerness" of other actors, and, therefore, to find other actors who are more "central" than Kevin Bacon.

Here are various lists of movies and the actors that you'll be using:

cast.06.txt: movies released in 2006 [movies=8780, actors=84236]
cast.00-06.txt: movies released since 2000 [movies=52195, actors=348497]
cast.all.txt: movies [movies=285462, actors=933874]
cast.action.txt: action movies [movies=14938, actors=139861]
cast.rated.txt: popular movies [movies=4527, actors=122406]

Each line gives the name of a movie followed by the cast. Since names have spaces and commas in them, the / character is used as a delimiter.

'Breaker' Morant (1980)/Fitz-Gerald, Lewis/Steele, Rob (I)/Wilson, Frank (II)/Tingwell, Charles 'Bud'/Cassell, Alan (I)/Rodger, Ron/Knez, Bruno/Woodward, Edward/Cisse, Halifa/Quin, Don/Kiefel, Russell/Meagher, Ray/Procanin, Michael/Bernard, Hank/Gray, Ian (I)/Brown, Bryan (I)/Ball, Ray (I)/Mullinar, Rod/Donovan, Terence (I)/Ball, Vincent (I)/Pfitzner, John/Currer, Norman/Thompson, Jack (I)/Nicholls, Jon/Haywood, Chris (I)/Smith, Chris (I)/Mann, Trevor (I)/Henderson, Dick (II)/Lovett, Alan/Bell, Wayne (I)/Waters, John (III)/Osborn, Peter/Peterson, Ron/Cornish, Bridget/Horseman, Sylvia/Seidel, Nellie/West, Barbara/Radford, Elspeth/Reed, Maria/Erskine, Ria/Dick, Judy/Walton, Laurie (I)
'burbs, The (1989)/Gage, Kevin/Hahn, Archie/Feldman, Corey/Gordon, Gale/Drier, Moosie/Theodore, Brother/Katt, Nicky/Miller, Dick (I)/Hanks, Tom/Dern, Bruce/Turner, Arnold F./Howard, Rance/Ducommun, Rick/Danziger, Cory/Ajaye, Franklyn/Scott, Carey/Kramer, Jeffrey (I)/Olsen, Dana (I)/Gains, Courtney/Picardo, Robert/Hays, Gary/Davis, Sonny Carl/Gibson, Henry (I)/Jayne, Billy/Stevenson, Bill (I)/Katz, Phyllis/Vorgan, Gigi/Darbo, Patrika/Schaal, Wendy/French, Leigh/Fisher, Carrie/Benner, Brenda/Newman, Tracy (I)/Stewart, Lynne Marie/Haase, Heather (I)
...

Reading and representing the data

Your first task will be to read and load the data in memory into a data structure that will facilitate computing degrees of separation between actors.

You have movies, and you have actors. Actors are linked to the movies that they played in, and the other way around. The mathematical model for such a structure that stores pairwise connections between entities is called a graph.

A graph is comprised of a set of vertices and a set of edges. Each edge represents a connection between two vertices. A graph represents a network on the set of vertices. Many, many problems in the world can be modeled as graphs, from telephone and computer networks, to transportation networks, to Internet (websites and links), to social networks, to genetic and neural networks.

Not surprisingly, you'll use graphs to model the movie-actor relationship. The first question is how to model the Holywood world with a graph:

What should the vertices and edges in this graph be?
Should the vertices be be movies with links between movies if they share a common actor?
Should the vertices be actors with edges connecting two actors if they both played in the same movie?
Should we have vertices for both movies and actors and have edges connecting movies to the actors who appear in that movie.

To decide on a representation you need to understand what exactly you need to do with the graph. Think of the pros and cons for each of the options above. Keep in mind that whatever structure you chose to represent the graph, you have to build it based on one of the text files above.

The second question is what is a good way to store the graph. The graph contains of a set of vertices, which you can store as an array/vector, list, or map. For each vertex, you need a list of edges connected that are connected to it; you can store these "adjacency lists" as arrays/vectors, or lists, or maps.

Once you decide what the graph represents and what data structure you'll use to represent it, you'll start developing a MovieGraph class. This class should be able to construct a movie-graph from a file. Encapsulate all necessary getters and setters, and all basic functionality that you may expect from a class that implements a MovieGraph. For example,

//create en empty movie graph
MovieGraph()

//read graph from the file
MovieGraph(String fname)

//add edge u-v
void addEdge(String u, String v)

//number of vertices
int nV()

//number of edges
int nE()

//return the vertices adjacent to vertex v
bolean neighbors(String v)

//return the degree of vertex v (degree = nb of neighbors)
int degree(String v)

//is v a vertex in the graph
boolean hasVertex(String v)

//is u-v an edge in the graph
boolean hasEdge(String u, String v)

Include testing functions that allow to print the vertices and edges in your graph.

Querying the graph

Once you created the graph you want to add the capability to query the graph with the following two types of questions:

Given an actor, find all the movies he/she played in.
Given a movie, find all the actors who starred in the movie.

Write methods that take a movie or actor as an argument, and print out the result of the query. Give me a way to test this functionality. For example, you could add a test function with a text interface that looks something like this:

void queryMovies() {

     while (1) {
         //ask the user to enter a movie name or Q to exit 
         call queryMovie on the movie that the user entered
    }
}

Computing Kevin-Bacon numbers. More generally, computing Actor-A-numbers

Given two vertices in a graph, a path is a sequence of edges connecting them. There may be more than one path in a graph connecting two vertices. A shortest path is a path with minimum length among all paths between two vertices; here the length of a path is the number of edges on the path.

Note that to find the Kevin-Bacon-number of an actor X, we need to find the shortest path connecting X to Kevin Bacon. Generally speaking, for an arbitrary actor A, we need to find the shortest path connecting X to A.

Your goal is to write a method that takes two actor names A and B, finds the A-number of B (that is, a shortest path from A to B) and displays nicely the movie-actor chain to A. Shortest paths are not necessarily unique; that is, there may be several paths of the same minimum length connecting A to X. In this case, we just want to compute one of them (does not matter which one).

It turns out that you can compute shortest paths in a graph using a strategy that you have seen while searching: breadth-first search (BFS). Start from the vertex representing the source (actor A); add all its neighbors to a queue. These are all the actors with an A-number of 1. Then add to the queue all neighbors of these neighbors, and so on. It is not hard to see (and we'll argue this in class) that using breadth-first search from A you find the shortest paths from A to all other vertices (that are connected to A).

Some things to think of:

How to represent a node in the queue while doing BFS. Well, it is a String representing a vertex in a MovieGraph. But you also need to keep track of the actual path of a queue node to the start vertex. At the end, you want to trace back the path to A. Hint: think of how we stored the path out of a maze (we went over this in class).
How do you handle duplicate nodes in the queue: that is, you may want to enqueue a vertex that is already in the queue. Hint: you'll need to mark nodes.
How do you keep track of the cost of a node in the queue to the start node? At the end, you need to print this distance, which is actually the A-number. Hint: you'll need to store the distance of each node.

Note that there is nothing special about Kevin Bacon, and that the same approach can be computed to compute shortest paths between any two actors in Holywood. You want to make your methods general enough, not customized for Kevin Bacon.

In terms of style, you will probably want to implement computing paths as a separate class. Call it MoviePath. This class has to essentially perform BFS from a given vertex on a given graph and has to store all the necessary data for this as class instance variables. I imagine you will have a couple of methods in MoviePath. First, you'll have a constructor that takes as parameters a MovieGraph and a vertex in this graph and runs BFS from this vertex in the graph. Then you'll have functions that will return the actual path and distance to the source vertex.

Efficiency: One thing to think about is efficiency. Some of the graphs are very large. Note that, to compute a path from A to B, you need to run BFS from A until reaching B. So, one way to compute the average path length from A for all actors is to run this process for each actor B. This is extremely inneficient, and you will not be able to use it on anything but the smallest graph. You want to think about running BFS from A until the end (until reaching all nodes that can be reached), and compute in this way all the paths from A in the same time.

Final comments

This is a long lab. Hopefully it will show you some interesting facts about graphs (and about movies). The interface is open ended, so feel free to shine. However, as usual, you should give interface the lowest priority---first get your lab to work well (that is, correctly and efficiently), even if it has a (lousy) text interface.

It is due last on Wednesday December 2nd. You can work with one partner. You are stroungly encouraged to find a partner. Once you have the background, working with a partner is both fun and challenging.

When you turn in the code, include a brief README file that describes the structure of your code, instructs the user how to run it, and specifies how each team member contributed to the lab.

Since the lab gives you little guidance on how to structure the code, you will find that the amount of time you put into this project is directly proportional to how clean is your design.

These are some things to think of as you think of how to model the problem. You need to understand that there is not one "right" way to do it. There are easier ways, and there are harder ways. There are more efficient ways, and less efficient ways. There are ways that will be easy to program, and there are ways that will take a lot of effort to make work. YOU are the creator of your world. Understand what it is that your world needs to do, decide how to model your world, keep it consistent, and make it work.

Lessons to learn:

Think before you start! Sketch the layout. Encapsulate the functionality.
Develop incrementally. Write a few lines, compile, test, debug, repeat.
Keep testing and checking.
Performance matters.

Have fun!