What Went Wrong with the Poll Forecast in Ontario’s Last Election?

What Went Wrong with the Poll Forecast in Ontario’s Last Election?


In my two previous posts about the June 2014 Ontario provincial election[1][2], I reported on the results of forecasting opinion polls in real time using Twitter, and compared how that forecast fared against traditional forecasting techniques. Those techniques were very successful, but somewhat unsatisfying for me, personally: I felt they needed to have outside information, such as the aggregate polling history. But they did mirror the polls quite closely. This is fine when the polls are accurate, but they are unable to forecast when the polls are off. For example, in this particular election, the polls were calling for a slight victory by the Liberals, but not enough to give the Liberals a majority government. The election, however, produced a much larger gap between the Liberals and the Progressive Conservatives than predicted.

I am curious how direct measurement from Twitter would fare against the forecast method and the actual election results. In this post I will investigate a variety of direct measurement techniques and compare them to the aggregate polls and the actual election results. Please keep in mind that these predictions were not run during the election, but after the fact. As a result, I have the benefit of hindsight to pick and choose which methods are the best predictors.

Prior Work

In previous work by Tumasjan et al.[3] it was suggested that raw counts of mentions correlated with party victory. They studied the 2009 German election by examining 104,003 tweets published in the weeks leading up to the federal vote on Sept. 27, 2009. They looked for mentions of the six main parties. When looking at the percentage of tweets mentioning each party, they found that the party ranking by number of mentions was the same as the election outcomes. They also examined the relative volume of tweets and found that the percentages mirrored the actual election percentage results. Their conclusion was that the number of tweets was a plausible reflection of vote share.

Adam et al.[4] conducted similar research on the Feb. 25, 2011 general election in Ireland. They collected 32,578 tweets relevant to the five main parties between Feb. 8-25, 2011. In addition to looking at raw counts (as in the previous study), this group also looked at the sentiment associated with each party based on party mentions in tweets. They found that the relative volume of tweets and the relative volume of positive tweets was an accurate predictor of the election results. However, they found that the mean sentiment associated with each party was not an accurate predictor of the election results.

A third study, by Gayo-Avello et al[5], looked at the 2010 U.S. Senate special election in Massachusetts and five contested Senate seats in the U.S. general congressional election that same year. They collected 234,697 tweets from 56,165 different users between Jan. 13-20, 2010, and 13,019 tweets from 6,970 different users from Oct. 26-Nov. 1, 2010.  They looked at both Twitter volume and mean sentiment as predictors for the six Senate races. They found that in only three of the six races did Twitter volume predicted the outcome. Sentiment analysis also only accurately predicted the results of three of the six races. Further, both methods agreed on the results for only one out of the six races — in five of the six races studied, Twitter volume and sentiment analysis gave conflicting results. Their conclusion is that using Twitter to predict elections was no better than random chance.

Gayo-Avello followed up with some practical suggestions for improving election prediction using social media[6]. These points can be summarized as:

  • Work with proper random samples rather than streams.
  • Adjust for demographic skews.
  • Calculate sentiment based on political issues, not positive/negative words.
  • Compare against established base lines, such as incumbency and traditional polling.
  • Actually predict an election!

At this point, I have the tools in place to address these objectives. In the first article, I ran a prediction in real time, not a post hoc analysis (although this particular analysis will be post hoc). I can create random samples with equal probability of Twitter users that is as good as the random digit dialling technique used by most professional political pollsters[7]. I can accurately classify demographics and properly weight the social data to overcome skews[8]. And I have sentiment models that can be trained from hand-classified data.

With these tools in place, I looked to apply the different direct measures suggested in the literature to the last Ontario election.


I used my sampling methodology to collect 2,500,821 tweets from 32,339 Ontario Twitter accounts between May 1, 2014 and 8pm on June 12, 2014 (when the polls closed). I explicitly looked in the tweets for mentions of the three main provincial party leaders: Kathleen Wynne (Liberals), Tim Hudak (Progressive Conservatives), and Andrea Horwath (NDP).

As I noted in previous articles, there were 21,960 tweets from 2,171 people mentioning the PC leader, 14,390 tweets from 1,780 users mentioning the Liberal leader, and 7,998 tweets from 1,203 accounts mentioning the NDP Leader. Based on the raw counts, the Progressive Conservatives should have won the election.

For 5% of the collected tweets that mentioned a candidate, I rated the tweet by hand as being either positive or negative to the leader in question. For each candidate, two naive bayes classifiers were trained on the hand-classified data: one for positive mentions and one for negative mentions. The remaining tweets were then assigned a sentiment score based on the probability that they were positive or negative, the sentiment value normalized to be between 0 (negative sentiment) and 1 (positive sentiment).

For each unique user mentioning a party leader, the user was classified according to four demographic groups:

  • Type (person or organization)
  • Gender
  • Age
  • Ethnicity

Classification was done manually with assistance from a machine classifier. The machine classifier suggested a value for each demographic group based on First + Last name, image, and content of the Twitter stream. The researcher then either accepted the machine classification or adjusted it based on the Twitter account information.

I looked at six suggested measures:

  • Tweet Volume (number of tweets mentioning a candidate each day).
  • Tweet SoV (number of unique people mentioning a candidate each day).
  • Positive Volume (number of tweets mentioning a candidate positively each day).
  • Positive SoV (number of unique people mentioning a candidate positively each day).
  • Mean Sentiment Volume (mean sentiment score for all tweets mentioning a candidate each day).
  • Mean Sentiment SoV (mean sentiment score for each unique person mentioning a candidate each day).

For each of the six measures, I looked at the results unweighted by the demographics and the results when the tweets were weighted to produce a demographic distribution similar to the Ontario demographic distribution from the 2011 Census. The value recorded was that for June 12, 2014: the day of the election.


Table 1 summarizes the results from the 12 separate experiments. They represent what was measured from Twitter on election day. There are several interesting things to note from this experiment.

Without DemographicsWith Demographics
Tweet Volume30.2%50.7%19.1%32.4%46.7%20.9%
Tweet SoV32.3%50.4%17.4%35.1%50.3%14.6%
Positive Volume34.2%54.9%10.9%35.0%51.9%13.1%
Positive SoV38.5%50.3%11.2%40.5%49.0%10.5%
Mean Sentiment Volume39.5%37.4%23.1%38.8%39.0%22.2%
Mean Sentiment SoV41.1%35.7%23.2%40.1%33.7%26.2%
Election Results*41.3%33.3%25.3%41.3%33.3%25.3%

Table 1: Results of different methods compared. *Election results are scaled to the top three parties.

Tweet counts were very wrong for this election. It did not matter if the counts were of raw tweets, positive tweets, or of the users making the tweets. In every scenario the PCs were well ahead of the Liberals on election day. It is only when considering the daily mean sentiment for each party that the results from Twitter favour the Liberals.

Second, it appears the demographic weighting has a noticeable impact on the measurements. In all cases adding demographic weighting changes the party scores by two to four percentage points —a large difference in political races. The best model —Mean Sentiment SoV— is one that includes the demographic weighting.

Finally, it should be noted that one of the models (Mean Sentiment SoV with Demographics) came very close to the actual results for June 12. This being a post hoc study I wouldn’t read too much into that result. Still it is intriguing and worthwhile to test against other elections.

It is interesting to look at how the Mean Sentiment SoV compares to polls over the entire campaign. Figure 1 shows the results from Twitter (solid lines) against the polls (dashed lines). The results from Twitter have been averaged over a three-day moving window to smooth out day-to-day fluctuations and to better approximate the results of multi-day polls. The Twitter measurements mostly track the poll results for the latter half of the campaign. Twitter consistently shows a much larger lead between the Liberals and Progressive Conservatives than the polls suggested.


Figure 1: Twitter measurements (solid line) vs. Polling (dashed lines)


Many people consider the polling for the 2014 Ontario election to have been a failure[9]. The resulting Liberal victory caught many people by surprise. In hindsight (and re-analysis of the polling data) it is possible that the Liberals and Conservatives were never as close as the polls suggested. This is consistent with the measurements made with the winning Twitter model, measuring average sentiment per user when weighted by the demographics.

I’m pleased that there exists a model that may have been used to better predict the outcome of the Ontario election than the polls or even my previous forecast method. I am a bit suspicious that the other models fared so poorly. This could very much be a case where with hindsight I selected a model that is extremely tuned to this particular election. The only way to tell is to run the models against more elections, both past and future. Check back here as I have more results!

Featured Image courtesy of Wikimedia Commons, the free media repository

Footnotes    (↵ returns to text)

  1. http://xplane.us/right-here-right-now-how-social-media-is-great-at-forecasting-polls/
  2. http://xplane.us/how-much-can-social-media-improve-poll-forecasting-youll-be-surprised/
  3. Tumasjan, Andranik, Timm Oliver Sprenger, Philipp G. Sandner, and Isabell M. Welpe. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment.” ICWSM 10 (2010): 178-185.
  4. Bermingham, Adam, and Alan F. Smeaton. “On using Twitter to monitor political sentiment and predict election results.” (2011).
  5. Gayo-Avello, Daniel, Panagiotis Takis Metaxas, and Eni Mustafaraj. “Limits of electoral predictions using twitter.” In ICWSM. 2011.
  6. Gayo-Avello, Daniel. “” I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper”–A Balanced Survey on Election Prediction using Twitter Data.” arXiv preprint arXiv:1204.6441 (2012).
  7. K. White, J. Li, N. Japkowicz, “Sampling Online Social Networks Using Coupling From the Past,” 2012 IEEE 12th International Conference on Data Mining Workshop on Data Mining in Networks, pp. 266-272
  8. http://xplane.us/older-crowd-loves-hortons-while-youth-flock-to-starbucks-and-mcdonalds/
  9. http://www.threehundredeight.com/2014/07/the-alternate-and-truer-history-of.html

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.