How to Read 1,000,000 Books

Crystal Hall

What's it about? The course asks how we read text and what text is, with specific focus on the text found in books of literary fiction (at the scale of 1 to several thousand) and the data available for analysis from the Google Books' collection. The big question that frames the course is: should we design computer algorithms to read millions of texts in the same way that we read one text? To help answer the question students are practicing the habits of novice literary scholars and also learning basic programming skills in the R language to understand how computers "read" text.

What? The computational aspect of the course uses the R statistical programming language to save text, transform it for more direct (but perhaps not more appropriate) comparison to other books, graph patterns of word usage, run basic statistical analyses on the results, and query large (2.2 Terabyte) datasets.

How? Students start by reading a physical copy of an Italian mystery novel at the same time that they are learning how to perform basic computational analysis on the text file of that same novel.  Discussions of functions in the code happen at the same time as we are conducting close readings of the text.  For example, one of the possible and popular transformations of text involves putting the entire document in lowercase. As we learn about that function, I ask students to find a place in the reading that relies on cases to make meaning and other places where the loss of capitalization would allow us to ask new questions about the book.  We will apply our guidelines for what to do with one digital text to a corpus of 150 mystery novels, likely revise those guidelines, and then investigate Google Books.

Why? When are the assumptions of computation and digitization compatible or incompatible with the assumptions of my discipline? In literary studies, the explosion of digital editions and collections of books gives us unprecedented access to rare individual texts and massive bodies of textual cultural material. How does large-scale textual analysis relate to (or obscure) traditional “close reading” of texts? What kinds of new literary analysis do the algorithms for text analysis make possible? How could the analysis of tweets, status updates, blogs, and comments benefit from a more nuanced understanding of reading, writing, and the human processes of meaning making that they imply?