In previous articles, I have focused on what Twitter can tell us about people’s past behaviour. While this can be extremely useful, Twitter can also be used to predict future events. In most of these cases, researchers attempt to find a correlation between Twitter activity (such as mentions or sentiment) and some real world outcome. Correlation is a start, but it is not a predictive model.
Predictive models take a sequence of data, usually a time series, and calculate what the next value in the sequence should be. The most successful predictive models I have seen in academic literature work best when the value to be predicted is delayed. For example, this study on predicting flu severity uses Twitter data to predict the number of flu cases in the United States. The Centers for Disease Control and Prevention (CDC) publish their flu case numbers, but the numbers are delayed by two weeks. Using Twitter data, the study was able to successfully predict the CDC’s number well in advance of the public health institute releasing their numbers.
Let’s put this kind of predictive methodology to the test, and try to predict the outcome of American Idol. I know, I know, American Idol has probably jumped the shark, but it is a great test case for predicting results. First, the results are delayed by about 20 hours from the voting window. This means we can make our prediction before the outcome is announced. Second, American Idol is still relatively popular and will generate enough tweets for analysis. Third, one contestant is voted off American Idol every week, so we can generate a time series to use in our prediction, with weekly results on which to train the predictive model.
Rather than working with live data, I created a sample of people who discussed American Idol from season 12, which aired in early 2013. Using Conditional Independence Coupling, I gathered a sample of 256 American Idol fans who mentioned their favourite contestants. I used this sample as a virtual focus group to see which contestants they discussed each week and which had dropped off their radar.
The underlying model for who gets voted off the show is quite simple. Each week, between the broadcast of the previous week’s result to those of the current week, I counted how many people in my virtual focus group had been discussing each candidate. These counts are normalized against the average weekly counts of total candidate mentions. This gives a normalized number between 0 and 1 of the percentage of people discussing a candidate each week. This is the model input.
Next, to calculate the model output, we need to assign a value between 0 and 1 to each candidate based on the American Idol results. It seems like American Idol goes out of its way to make predicting the results difficult. They don’t give voting rankings, such as who came in first or second place. Instead, American Idol simply reports who got the fewest votes. Based on this sparsity of data, I assume that each week the contestants who has been previously voted off have no conversation (a value of 0), the contestant who is voted off that week has no conversation (again, a value of 0), and that the remaining contestants equally split the the conversation. Table 1 shows how this output model looks.
|March 7, 2013||0.00||0.11||0.11||0.11||0.11||0.11||0.11||0.11||0.11||0.11|
|March 14, 2013||0.00||0.13||0.13||0.13||0.13||0.13||0.13||0.13||0.13||0.00|
|March 21, 2013||0.00||0.14||0.14||0.14||0.00||0.14||0.14||0.14||0.14||0.00|
|March 28, 2013||0.00||0.17||0.17||0.00||0.00||0.17||0.17||0.17||0.17||0.00|
|April 4, 2013||0.00||0.20||0.00||0.00||0.00||0.20||0.20||0.20||0.20||0.00|
|April 11, 2013||0.00||0.25||0.00||0.00||0.00||0.25||0.25||0.25||0.00||0.00|
|April 18, 2013||0.00||0.25||0.00||0.00||0.00||0.25||0.25||0.25||0.00||0.00|
|April 25, 2013||0.00||0.33||0.00||0.00||0.00||0.00||0.33||0.33||0.00||0.00|
|May 2, 2013||0.00||0.50||0.00||0.00||0.00||0.00||0.00||0.50||0.00||0.00|
|May 9, 2013||0.00||0.00||0.00||0.00||0.00||0.00||0.00||1.00||0.00||0.00|
Table 1: Weekly output model
Since the input and output run between 0 and 1, we will use a Logistic Autoregession with Exogenous Inputs (Logistic VARX). This is run using the R package Dynamic Systems Estimation (DSE) using the VARX model where the logistic transform — log(x/(1-x)) — has been applied to both the input and output.
When running the VARX model, weekly data up to and including the current week is used as the input. The output is truncated to include only results up to the previous week. This makes sure the model is working exclusively on data known at the time that the prediction is being made. Every week the predictive model runs the remaining contestants against through the model and the lowest scorer is declared the loser.
So how does this predictive model perform? The results of the model are shown in Table 2:
Table 2: Predicted losers from Twitter
Prior to the final five contestants, there was not enough time series data to get a statistically significant predictive model. Once there is enough time series data, the model predicted four out of five of the contestants who were voted off the show. This data also includes what were considered major upsets of the season, such as Angie (a clear fan favorite) being voted off in week nine.
It’s encouraging to see that Twitter data can indeed be used to predict the future. In this case, we were able to forecast the weekly loser from season 12 of American Idol. Of course, this model was trained knowing the complete results. The next test is to apply this model against the current season of American Idol. Stay tuned!
Featured Image courtesy of Wikimedia Commons, the free media repository
- Achrekar, H., Gandhe, A., Lazarus, R., Yu, S. H., & Liu, B. (2012). Twitter Improves Seasonal Influenza Prediction. In HEALTHINF (pp. 61-70).↵
- White, K., Li, G., & Japkowicz, N. (2012, December). Sampling Online Social Networks Using Coupling from the Past. In Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on (pp. 266-272). IEEE.↵