Twitter represents a fundamentally new instrument to make social measurements. Millions of people voluntarily express opinions across any topic imaginable --- this data source is incredibly valuable for both research and business.
For example, researchers have shown that the "mood" of communication on twitter reflects biological rhythms and can even used to predict the stock market. A student here at UW used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012.
Researchers from Northeastern University and Harvard University studying the characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale.
In this assignment, you will
Some points to keep in mind:
Twitter provides a very rich REST API for querying the system, accessing data, and control your account. You can read more about the Twitter API
If you are new to Python, you may find it valuable to work through the codeacademy Python tutorials. Focus on tutorials 1-9, plus tutorial 12 on File IO. In addition, many students have recommended Google's Python class.
You will need to establish a Python programming environment to complete this assignment. You can install Python yourself by downloading it from the Python website, or can use the class virtual machine.
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:
u"This is a string"
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
In most circumstances, you will be able to use a unicode object just like a string.
If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this:
unicode_string = u"aaaà çççñññ"
encoded_string = unicode_string.encode('utf-8')
print encoded_string
Once again: If you are new to Python, many students have recommended Google's Python class.
To access the live stream, you will need to install the oauth2 library so you can properly authenticate.
This library may be installed on your virtual machine. If not, you can install it yourself in your Python environment.
The steps below will help you set up your twitter account to be able to access the live 1% stream.
api_key = "<Enter api key>"
api_secret = "<Enter api secret>"
access_token_key = "<Enter your access token key here>"
access_token_secret = "<Enter your access token secret here>"
$ python twitterstream.py > output.txtThis command pipes the output to a file. Stop the program with Ctrl-C, but wait at least 3 minutes for data to accumulate. Keep the file output.txt for the duration of the assignment; we will be reusing it in later problems. Don't use someone else's file; we will check for uniqueness in other parts of the assignment.
https://api.twitter.com/1.1/search/tweets.json?q=microsoft
What to turn in: The first 20 lines of the twitter data you downloaded from the web. You can save the first 20 lines to a file problem_1_submission.txt
by using the following command:
$ head -n 20 output.txt > problem_1_submission.txt
For this part, you will compute the sentiment of each tweet based on the sentiment scores of the terms in the tweet. The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the tweet.
You are provided with a skeleton file tweet_sentiment.py
which accepts two arguments on the command line: a sentiment file and a tweet file like the one you generated in Problem 1. You can run the skeleton program like this:
$ python tweet_sentiment.py AFINN-111.txt output.txt
The file AFINN-111.txt contains a list of pre-computed sentiment scores. Each line in the file contains a word or phrase followed by a sentiment score. Each word or phrase that is found in a tweet but not found in AFINN-111.txt should be given a sentiment score of 0. See the file AFINN-README.txt for more information.
To use the data in the AFINN-111.txt file, you may find it useful to build a dictionary. Note that the AFINN-111.txt file format is tab-delimited, meaning that the term and the score are separated by a tab character. A tab character can be identified a "\t".The following snippet may be useful:
afinnfile = open("AFINN-111.txt")
scores = {} # initialize an empty dictionary
for line in afinnfile:
term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"
scores[term] = int(score) # Convert the score to an integer.
print scores.items() # Print every (term, score) pair in the dictionary
The data in the tweet file you generated in Problem 1 is represented as JSON, which stands for JavaScript Object Notation. It is a simple format for representing nested structures of data --- lists of lists of dictionaries of lists of .... you get the idea.
Each line of output.txt
represents a streaming message. Most, but not all, will be tweets. (The skeleton program will tell you how many lines are in the file.)
It is straightforward to convert a JSON string into a Python data structure; there is a library to do so called json
.
To use this library, add the following to the top of tweet_sentiment.py
import json
Then, to parse the data in output.txt
, you want to apply the function json.loads
to every line in the file.
This function will parse the json data and return a python data stucture; in this case, it returns a dictionary. If needed, take a moment to read the documentation for Python dictionaries.
You can read the Twitter documentation to understand what information each tweet contains and how to access it, but it's not too difficult to deduce the structure by direct inspection.
Your script should print to stdout the sentiment of each tweet in the file, one numeric sentiment score per line. The first score should correspond to the first tweet, the second score should correspond to the second tweet, and so on. If you sort the scores, they won't match up. If you sort the tweets, they won't match up. If you put the tweets into a dictionary, the order will not be preserved. Once again: The nth line of the file you submit should contain only a single number that represents the score of the nth tweet in the input file!
NOTE: You must provide a score for every tweet in the sample file, even if that score is zero. You can assume the sample file will only include English tweets and no other types of streaming messages.
To grade your submission, we will run your program on a tweet file formatted the same way as the output.txt
file you generated in Problem 1.
Hint: This is real-world data, and it can be messy! Refer to the twitter documentation to understand more about the data structure you are working with. Don't get discouraged, and ask for help on the forums if you get stuck!
What to turn in: The file tweet_sentiment.py
after you've verified that it returns the correct answers.