DCS 1100: Intro to DCS

Python Assignment 5

50 Points

Due Date

Tuesday, November 15 (11:59pm)

Assignment Type

This is an individual assignment. Sharing or showing your code is not allowed.
For example, do not show your computer screen or dictate code to another student.

Submission Instruction

Complete the activities in this notebook and then submit this notebook file on Blackboard. Do not submit any other file. There's also no need to specify your name.

Part I: Review of Lists and Strings

This part is designed to give you some additional coding practice. The best way to do the questions in this part is to first come up with a solution with pencil and paper and then type it in a cell. Using pencil and paper will also help you in the final exam.

Problem 1 (10 points)

Write a function that takes a list of phone numbers as a parameter and prints only the phone numbers that start with 207. Also, write the code to test your function (that is, write the code to call your function).

Note: you should store the phone numbers as strings in the list.

In [ ]:
#Your solution goes here

Problem 2 (10 points)

Write a function that takes a list of animal names as a parameter and prints only the animal names having four or less letters. Also, write the code to test your function.

In [ ]:
#Your solution goes here

Problem 3 (10 points)

Write a function that takes two lists as parameters: (1) a list of animal names and (2) a list of the usual life spans of these animals (respectively). Return a list of the animals that live over 10 years. Also, write the code to test your function.

In [ ]:
#Your solution goes here

Part II: Sentiment Analysis

This is a guided project, where the objective is to apply and extend our Python skills to a real-world problem and to see how a complex task is decomposed into relevant pieces and how they are connected together. The new Python elements you'll learn here are tuples and dictionaries.

Sentiment analysis uses natural language processing and text analysis to extract the "sentiment" implicitly embedded in textual data. This is very broad area where a lot of interesting research is currently taking place. We take a simplistic approach here. We start with a set of tweets labeled pos for positive sentiment and neg for negative. We use the same model of spam filtering shown in class. We find the words associated with positive/negative sentiment. Afterwards, given a test tweet we classify it to be pos or neg using the Naive Bayes classifier. Below are the tasks.

1. Check out the potus.tsv file

Open the potus.tsv file in Microsoft Excel (or any text editor) and take a look at the structure of the file. Each line starts with a tag pos or neg. These are the labels determined by previous Bowdoin students. You may disagree with some of the labelings, but we'll take this file as our ground truth or training data.

2. Import necessary modules

We'll need the following modules. Note the different ways of importing modules. Run the following cell (and every code cell afterwards).

In [ ]:
import nltk
from codecs import * #This is to account for some strange characters in tweets

3. Read the potus.tsv file

The function in the next cell returns the labeled data in the following format.

labeled_data = [('Obama has done a great job', 'pos'),
    ('Thanks Obama for supporting us!', 'pos'),
    ('I like the Obama administration policies', 'pos'),
    ('Wow Obama knows how to give an excellent speach.', 'pos'),
    ("I loved the book Obama wrote", 'pos'),
    ('I do not like Obama', 'neg'),
    ('I am tired of Obama.', 'neg'),
    ("Obama is a bad joke", 'neg'),
    ('When will Obama resign?', 'neg'),
    ('Obama is the worst President ever', 'neg')]

Above, labeled_data is a list of tuples. In this case, each tuple is an ordered pair surrounded by parentheses. For example, ('Thanks Obama for supporting us!', 'pos') is a tuple of two strings. The main difference between tuples and lists is that tuples are immutable.

Note that the function is not called yet. We'll call it later.

In [ ]:
#This function takes a tsv file name as parameter and
#returns a list of labeled data
def get_labeled_data(file_name):
    #Read the content of the file
    file_object = open(file_name, "r", encoding='utf-8', errors='ignore')
    big_str = file_object.read()
    file_object.close()
    
    labeled_data = [] #This will be populated by (tweet, sentiment)-pairs
    
    #Google's newline separator for tsv is "\r\n"
    big_list = big_str.split("\r\n")
    for line in big_list:
        sentiment = line[0:3] #get 'pos' or 'neg'
        #Now test for accuracy (and append to list); ignore unlabeled data
        if sentiment == 'pos' or sentiment == 'neg': #yes, that line is labeled
            tweet = line[4:] #Tweet follows 'pos\t' or 'neg\t'
            labeled_data.append( (tweet, sentiment) ) #Append (tweet, sentiment)-pair

    return labeled_data

4. Build a vocabulary by taking all the words

The following function returns all the words as a set (which is just like a list, except that it doesn't allow duplicates). Stop words like "a", "the", etc. are not included in the word_list. We use an imperfect logic (checking the word length) to do this. The returned set of words will later be used as features.

In [ ]:
#Given some labeled data, return all the words excluding the stop words
#Here labeled_data is a list of (tweet, sentiment)-tuples
def get_all_words(labeled_data):
    word_list = [] #This list will be populated with all the words
    for sample in labeled_data: #Each sample is a tuple of (tweet, sentiment)
        tweet = sample[0] #The first thing in sample is the tweet text
        #Now add words from tweet to the word_list, barring stop words
        for word in nltk.word_tokenize(tweet.lower()): #For each word in the tweet
            if 3 < len(word) < 7 and word not in nltk.corpus.stopwords.words('english'): #If it's not a stop word
                word_list.append(word) #Add it to word_list
    return set(word_list) #Return a set to get rid of duplicate words

5. Get the features for a specific tweet

The following function takes a tweet string and the vocabulary (all_words) as parameters and returns the set of features for the tweet in a specific dictionary format. This dictionary basically stores the information about which words of the vocabulary are present in this tweet and which ones are absent. (Think about the table of check marks in the spam filtering example shown in class.)

In [ ]:
#Take the tweet text and all the words (or vocabulary) as parameters and
#return the features (which words among all_words are present in that tweet?)
def get_features(tweet, all_words):
    features = {} #Initialize feature to an empty dictionary
    for word in all_words: #For each word in all_words
        if word in nltk.word_tokenize(tweet.lower()): #If the word exists in the tweet
            features[word] = True #Add the word:True pair to the dictionary
        else:
            features[word] = False #Add the word:False pair to the dictionary
    return features

6. Make a classifier

There are several small steps in the following classify( ) function.

First, we call the get_labeled_data function to get the labeled data. We save the return value of the get_labeled_data function in a variable named labeled_data.

We then call the get_all_words function from the classify function and save the return value in the variable named all_words.

After that, the following code builds the training set. It will take several minutes. Be patient.

#Now build the training set (a list)
#Each member of training_set is a (feature, sentiment)-pair
training_set = []
for sample in labeled_data: #sample is a (tweet, sentiment)-pair
    features = get_features(sample[0], all_words) #sample[0] is tweet
    training_set.append( (features, sample[1]) ) #sample[1] is sentiment

The following code builds the Naive Bayes classifier (i.e., calculates the probabilities).

classifier = nltk.NaiveBayesClassifier.train(training_set)

We can also get the most informative features (words that most often distinguish positive sentiment from negative).

classifier.show_most_informative_features()

We are now ready to test the classifier. In the test_tweet variable, you can store any string you wish.

#Test the classifier
test_tweet = "I'll never understand why anyone would vote for him!"
test_features= get_features(test_tweet, all_words)
print("Test result:", classifier.classify(test_features))

All of the above codes are entered in the classify( ) function below.

In [ ]:
#This function is the main engine
def classify():
    #labeled_data is a list of "tuples" (pairs of tweet and sentiment here)
    labeled_data = get_labeled_data('potus.tsv')
    all_words = get_all_words(labeled_data)
    print("There are", len(labeled_data), "labeled tweets")
    print("There are a total of", len(all_words), "words (i.e., features)")
    print("Wait till I compute a total of", len(labeled_data), "X", len(all_words), "=", len(labeled_data)*len(all_words)/1000000.0, "million features...")
    
    #Now build the training set (a list)
    #Each member of training_set is a (feature, sentiment)-pair
    training_set = []
    for sample in labeled_data: #sample is a (tweet, sentiment)-pair
        features = get_features(sample[0], all_words) #sample[0] is tweet
        training_set.append( (features, sample[1]) ) #sample[1] is sentiment

    print("Completed building the training set!")
    
    #Build a Naive Bayes classifier
    classifier = nltk.NaiveBayesClassifier.train(training_set)
    classifier.show_most_informative_features()

    print("Classifier has been fine-tuned!")

    #Test the classifier
    test_tweet = "I'll never understand why anyone would vote for him!"
    test_features= get_features(test_tweet, all_words)
    print("Test tweet:", test_tweet)
    print("Test result:", classifier.classify(test_features))

classify()

Question: How does the training set matter? [20 points]

You will experiment with the above code to think about the impact of training set on a machine learning task. Instead of using all 700 labeled tweets (as done above), use some portion of it as a training set. Use slicing in the following line of code inside the classify( ) function to experiment with this.

classifier = nltk.NaiveBayesClassifier.train(training_set)

  1. Did you come up with the same results as before?
  2. If not, why do you think they were different? Does the number of tweets in the training set matter?
  3. What are the implications of this result for studying sentiment in any textual data set?

Answer (edit this cell):