Applying Data Science and Mathematical Modelling to Predict Election Verdict
By : Jobin Wilson
Principal R&D Architect - Data Science
Exit polls, or combinations of exit polls, have been traditionally used to predict the results of elections. In the absence of exit polls, an interesting question is whether the results of an election can be predicted through mathematical models using publicly available data, including social media sources such as Twitter. Conducting exit polls has its own associated logistical challenges and most importantly, it is conducted after voters have already casted their vote and hence cannot be of much use prior to elections. In the US Presidential Elections, Obama’s campaign made heavy use of data sciences to understand people’s opinion and its temporal change closely, so that their campaigns could be optimized. Now, our attempt is to make use of similar predictive models in the Indian context, to predict election outcomes in advance, leveraging our expertise in data sciences and mathematical modelling.
As the outcome of the Lok Sabha election depends on the outcome of the election in each state (or union territory) which in turn depends on the individual outcome in each constituency, it may be appropriate to consider the question of predicting the election outcome at the level of constituencies. The election outcome for a constituency is the aggregated preference of the voters in that constituency, hence, a possible method of predicting the election outcome is to estimate the voting preference through statistical models, making use of historic data (previous election outcomes) and changing user preferences over time, expressed through opinion polls and social media.
Flytxt made its first attempt in predicting the results of the 2014 Lok Sabha elections for Kerala and Delhi, using a predictive statistical model based on Bayesian inference. Building the model and generating the predictions, per se, was just a part of the problem we attempted to solve. Many weeks of data collection, cleaning and data preparations was involved to get to the model building stage. Raw data of prior Lok Sabha and Assembly elections have been obtained from the Election Commission of India website. Opinion poll results have been collected from Wikipedia. Raw data (tweets) for Twitter sentiment analysis have been collected using the Twitter API.
Tweets have been used in our analysis since Twitter is a heavily used social media platform for online political deliberation, where people generally express their opinions/views about political issues. The sentiment of a sample of random users towards the political parties can be obtained by analysing their tweets using standard NLP (Natural Language Processing) techniques. The final sentiment score (see Figures 1(A) and 1(B) for the aggregated scores) for each tweet was calculated by taking into account the importance of the tweet and that of the user who has tweeted.
Twitter sentiment analysis
The predictive model is chosen with the assumption that there are trends in the typical voter preference for each party. Whenever an election happens and the results are observed, or when an opinion poll is conducted and the results are observed, or when voters tweet their views on Twitter, a perturbed (noisy) version of this typical voter preference is emerges. Since voter preferences do change (sometimes drastically) over the time period leading to the elections, it is non-trivial to learn this trend as a function of time from the above perturbed observations and apply that for predictive purposes. Our predictive model learns the trend of the typical voter (using a Bayesian approach) from prior election results, opinion polls, and Twitter sentiment, and uses the trend to predict the voter preference on the day of the election. The voter preference on the election day is used to obtain two types of forecasts: (a) the probability of a party (or alliance) winning in a particular constituency (as in Figures 2(A) and 2(B)), which can be interpreted as a belief of the chance of a party winning in a particular constituency,
and (b) a set of possibilities for the number of seats won in a particular state (as in Figures 3(A) and 3(B)) with a score that shows the chance of that possibility occurring among the set of possibilities considered.
In (b) above the set of possibilities has been selected heuristically, and includes possibilities usually suggested by opinion polls.
Other methods of combining opinion polls, prior election results, and expert opinions have been used earlier in forecasting election results, especially in the US presidential elections. The most famous of these forecasts being that by Nate Silver.
The current version of our model can be applied to obtain forecasts for other states also, and therefore, can be used to estimate total seat count for the Lok Sabha. However, as this is our first foray into election forecasting by leveraging our analytics expertise, Kerala and Delhi have been chosen as test cases for validating our model. Delhi has been chosen especially because of its strong social media presence. Future versions of the predictive model are planned, where it would be possible to attribute voter preference behaviour to specific causes.