Back in June, Ontario held an election for its provincial government and by extension its premier (equivalent to a state governor). There were strong candidates from the three main parties: Kathleen Wynne, the Liberal incumbent; Tim Hudak, representing the Progressive Conservatives; and Andrea Horwath of the New Democatic Party (NDP). Election day was June 12 and the results were 38.7% for the Liberals, 31.2% for the Conservatives, and 23.7% for the NDP.
This isn’t going to be a post about using Twitter to predict who won. There have been several academic studies exploring the correlation between mentions on Twitter and election results. This early research showed that the 2009 German election and the Irish general election of 2011 results were remarkably similar to the percentage of tweets mentioning each party. Further research on other elections found that these examples may have been flukes: for each successful example of a “prediction” there are other elections where Twitter failed to predict the correct outcome. For the record, applying naive counts of tweets from the most recent Ontario election would have “predicted” the Progressive Conservatives winning with 49.4% of the vote, the Liberals with 32.5% and the NDP with 18.1% —extremely different results from the actual outcome.
The largest objection to using social media to predict anything is that most “prediction” models are post-hoc: the researchers know the results and look for correlations between the known outcomes and the data they have collected. With all of the various measures and statistical techniques available to the modern researcher, it is possible to predict practically anything from almost any data when you know the answer. Rather than looking backwards and finding correlations, a fairer test would be to predict results a priori: before the final results are known and (if possible) in real time.
Once we start working in real time, we leave the realm of prediction and move into the forecasting. Prediction says “this is what the outcome is going to be.” Forecasting, on the other hand, says “given the available data, here is what the outcome could be —but that will change as more data comes in.” Think of the weather forecast (and notice how we don’t call it weather prediction). The weekend weather forecast could be very different on Friday than it was on Wednesday. It’s not that the weather has changed, only that the meteorologist has more information on Friday than she did on Wednesday.
So this will be a blog post about forecasting —not predicting— election results. And when it comes to forecasting election results, we already have an extremely accurate and time-tested tool: polls.
Every political junkie is familiar with how polls work. Independent organizations survey a random sample of the population, either over the telephone or online, asking them how they plan to vote. This raw data is weighted according to population demographics to provide what should be an accurate view of how the electorate is feeling on any given day.
This measurement is not static. Events change how people feel about the candidates. Undecided voters may choose a side. People can change their minds. To account for this, polls are updated frequently. Some of the most accurate forecasts come from aggregating polls from multiple sources. In Canada, the threehundredeight.com blog reports the current forecast for elections based on about half a dozen independent polls.
So what are the limitations with polls? Polls take time to collect, analyze, and report.
Typically poll results are delayed between 24-48 hours. For example, in this recent Ontario election, there was a leaders’ debate on June 3. Going into the debate, the most recent polls (from June 2) were showing the Liberals with a seven-point lead over the Progressive Conservatives. After the debate, many political pundits suggested that PCs leader Tim Hudak had the best performance. However, post-debate poll numbers didn’t start appearing until June 5, when it was first shown that, as a result of the debate, the PCs had gained considerably and were now tied with the Liberals.
For a full day following the debate, the best polling forecasts were showing an (inaccurate) 6-7 point lead for the Liberals.
This is the crux of the problem. While polls are an accurate forecasting tool, they do not react in real time to the changing electoral landscape.
This is precisely the gap that social media can fill. Rather than forecasting the overall election results, social media is well suited for forecasting polls, providing real-time adjustments to the poll numbers based on what people are talking about right now.
To forecast the polling results, I used Auto Regression with Moving Averages (ARMA) with exogenous variables. The ARMA model essentially looks at the existing poll data and tries to predict the next point using the previous points. This is often called momentum analysis or trend analysis, and is fairly standard in forecasting. And exogenous variable is a data from a different source, in this case the information from Twitter. Looking at the historical relationship between the output variable (the polls) and the input variable (Twitter data), the model adjusts the value from the standard ARMA model either up or down based on the additional data.
Using this technique I can adjust the forecast using information from Twitter. At the end of June 4, the first full day following the leaders’ debate, I forecasted that the Progressive Conservatives would go from 30% (their standing at the last poll of June 2) to 32±2.8%, and that the Liberals would lose ground from 39% to 36.7±1.3% (see Figure 1). This was based solely on the forecast from the ARMA model with input from Twitter and no additional poll information. On June 5 the first post-debate poll came out, showing the PCs at 30.9% and the Liberals at 35.7%, well within the margins of error on the forecast. Based on what was happening on Twitter post-debate, the model adjusted the Liberals forecast down and the PC forecast up.
To test the forecasting technique, I recorded the forecasted poll numbers from June 5-11 at the end of each day, before any official poll numbers were released. Figure 2 shows how these daily forecasts compare to the aggregated poll numbers. The forecast tracks the actual poll results remarkably well, each day predicting what the aggregate polls will be, but doing so before the polls are released the next day.
I won’t go so far to claim that Twitter can predict elections. That is a complex problem that requires much more work. However, Twitter can forecast poll results in real time, 24-48 hours before those survey results are made available. Furthermore, it can forecast changes in poll numbers based on events that are happening in real time.
The drawback to this approach is that it is only as accurate as the poll data. If the poll data is inaccurate or incomplete, this method will not adjust for those errors.
At this point you may be wondering how much improvement using Twitter data provides over traditional forecasting techniques, or whether there is any improvement at all? I’ll address this question in the next blog post.
Featured Image courtesy of Wikimedia Commons, the free media repository
- Tumasjan, Andranik, Timm Oliver Sprenger, Philipp G. Sandner, and Isabell M. Welpe. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment.” ICWSM 10 (2010): 178-185.↵
- Bermingham, Adam, and Alan F. Smeaton. “On using Twitter to monitor political sentiment and predict election results.” (2011).↵
- Gayo-Avello, Daniel, Panagiotis Takis Metaxas, and Eni Mustafaraj. “Limits of electoral predictions using twitter.” In ICWSM. 2011.↵