The US presidential election is scheduled to be held around a week from now. On 08th November, the electorates of the most powerful nation in the world will vote to elect their leader for the next four years.
The long and entertaining election cycle of this year has enjoyed wall to wall press coverage not only within the United States, but also outside it. Journalists, predominantly based out of the US, but also some foreign journalists, have captured our imagination writing analytical pieces explaining the issues at stake, the candidates on ballot and the preference of the voters. Well-researched conventional journalistic pieces have covered the election in all its gory details and have aided and abetted the release of a number of actual ‘breaking news’ that has had deep, lasting impact on the movement in polls.
One group that stood out among these motley bunch of commentators include a particularly nerdy one – the kind of people who you usually do not associate with the sensationalism of 24*7 news culture or the clichéd platitudes indulged in by horse race pundits. They are the data journalists, a relatively new breed of journalists, who frame their story on data and statistics rather than on hearsay and armchair analysis. Instead of relying on their sources or their instincts, they depend on their excel sheets to get to the bottom of an issue.
Nate Silver, the most famous of this bunch, gained fame during the 2008 Democratic Primary when his demographics based model was actually able to predict the winners in the bruising state by state fight between Hillary Clinton and Barack Obama better than many of the polls. He gained a contract with the NewYork Times where he ran the fivethirtyeight blog for some time. He attracted more attention during the 2012 election when he was able to predict the results of each of the 50 states correctly. He now maintains his own website, one of the few sites that are focused solely on data journalism, which is owned by ESPN. Other prominent data wizards who have dabbled in statistical political analysis and forecasting of elections include Nate Cohn (who runs the Upshot blog at the NewYork Times), Sam Wang (of Princeton University) and the members of the Monkey Cage blog at Washington Post.
Thanks to the efforts of these data journalists, it is now possible to get a first hand estimate about the probability of victory of the candidates in the presidential election in each state as well as the individual Senate races. For example, most of these websites now predict a probability of Hillary Clinton victory at around 75%-90%, in keeping with the strong lead that she has maintained at the state and the national polls. Similarly, the data nerds currently give the Democrats a slightly more than an even chance of flipping the Senate in their favour on the Election Day.
These statistical experts base their forecasting on complex mathematical models, using a combination of demographic, economic, polling data and other indicators to predict the winner in each election. With largely accurate predictions over the last few election cycles, they have also gradually captured the imagination of the voters as well as the reverence of the mainstream media. Unlike in the past election cycle, it is nowadays rare to find experts who make gut value predictions that are overwhelmingly against the formal narrative of how the elections are proceeding, which is more or less defined by what these data based models say.
Given the success of these data journalists in US, it is pertinent to ask why such high profile election forecasters have not yet been able to make their mark in Indian politics. After all, unlike in the US, Indian elections are held almost every year, sometimes multiple times a year, with some state or the other holding its assembly elections. Thus with the high frequency of data available and the opportunity to impress and hold accountable the dreariness of conventional punditry, it is indeed surprising that we are mostly clutching at straws when it becomes to predicting the results of any election with any degree of accuracy.
There are actually a number of reasons why India does not have its Nate Silver yet. Or in other words, why the chaotic, multiparty, first past the post electoral politics practiced in a vast and diverse country like India is much more difficult to lend itself to any kind of statistical forecasting compared to rather predictable, bland, two-party, winner take all politics of USA.
- India has first past the post system
The executive branch in the US has a complicated structure, including a president and two chambers of the Congress. The President is not elected by the voters, but by the members of the Electoral College, who are in turn elected by the voters. If a presidential candidate wins a state, he gets the support of all the electoral votes assigned to that particular state. The members of the Senate are elected by the voters of the respective state while the members of the House are elected by the voters of the respective congressional district.
While it may sound complicated to Indians who are used to the relative simplicity of the Westminster model that India follows, from a forecasting point of view, it is much easier to forecast the results of the US presidential election or even Senate elections. This is because in presidential and senate elections, the candidate who wins the maximum number of votes in a state wins the particular state. It is as simple as that (except in Maine and Nebraska which also allot their electoral college votes based on the performance of the candidates in the Congressional Districts, but let’s set aside that complication here). For example, if Hillary Clinton wins California on 8th November (which she almost certainly will), she will receive all the 55 electoral votes of that state.
As a result, speaking in a simplistic manner, the task of forecasting election in a US is reduced to forecasting the winner in each and every state and then summing up the electoral votes of the candidates. Further, although the elections are held in all the 50 states, only around a dozen states are considered swing states i.e. states where there is reasonable probability of either party winning the election. The partisan tides are so strong that it is well known to everyone who shall be winning the remaining states. In the current election cycle, the swing states include Arizona, Georgia, North Carolina, Florida, Ohio, Iowa, Nevada, New Hampshire, Pennsylvania, Wisconsin, Colorado and Virginia. Even among this subset, some of the states have pronounced Republican or Democratic tilt. For example, in case the polls are pointing to a tight election on the Election Day, it is more or less certain that states like Georgia and Arizona shall be voting for the Republican candidate, while Virginia and Pennsylvania shall be voting for the Democratic one.
India, on the other hand, practices a first past the post primary system. Which means it is not enough to get right which party shall win a particular state; one also has to translate the same into the number of seats the party shall win in that state. And in a first past the post system, simple swings of votes by a few percentage points may result in dramatic swings in the number of seats gained or lost. For example, in the 2014 general elections, a number of analysts would have estimated correctly that BJP would get the highest vote share in Uttar Pradesh. However, only the bravest of the brave would have gone out and predicted that the BJP would go on to win 71 out of the 80 seats while the Bahujan Samaj Party (BSP) would end up drawing a blank. Similarly, during the Delhi Assembly election of 2015, a number of analysts had predicted a narrow or even a decisive victory of the Aam Aadmi Party (AAP) but no one had seen it coming that it would go on to win 67 of the 70 seats.
The difficulty in predicting the number of seats won by a party in a first past the post system was also evident from the poor track record of forecasters in the UK parliamentary election. Forecasters, including Fivethirtyeight, famously flubbed the UK elections in both 2010 and 2015. In 2010, they had hugely overestimated the number of seats won by the Liberal Democrats whereas in 2015, they were predicted a hung parliament (with Labour as the largest party) while in reality, the Conservative Party won simple majority. If predicting first past the post election in a smaller, more polled and more homogeneous country like the UK has proved to be so daunting so far, you can only imagine the challenges any such model would run into in a complex, vast and less polled country like India.
- India has multi-party system
The US political system has two main parties competing for votes. While third party candidates, pre-dominantly from the Libertarian Party, the Reform Party, the Socialist Party and the Green Party, do contest in the presidential elections, it is extremely difficult for them to gain a respectable showing. There are exceptional years like 1992 and 1968 when third party candidates did better than their historical showings; but in most elections, they have been also-rans failing to have much of an impact on the polls. Even if you zoom further into the make-up of Congress and state governors’ houses, the dominance of the two parties stands out. There is only one independent senator (out of 100) in the Senate, no independent member in the House of Representatives (out of 435) and only one independent governor (out of 50) in the states.
The two party system makes life much easier for the forecasters. This is because the undecided voters will typically vote for one party or the other. So, one party’s gain generally translates into equivalent loss for the other party. Since the parties are present across the states, it is also easier to observe the demographic traits of voters supporting such parties and predict if a particular voter or a particular geographical unit shall end up supporting one of the two parties.
India, on the other hand, has a bewildering array of small and big parties competing for a pie in the vote share. In a number of states, there are three or four cornered contests. Add to this the complexity of regional parties which contest in one or two states only. Then there are smaller parties and independents which have limited local appeal in a particular region of a particular state. Also, parties often splinter and new parties are formed between elections; some of these parties flame out while others perform exceptionally well. But without a track record of performance, it is impossible to predict what kind of voters these new parties will attract and how much impact they will have on the election.
It makes sense to view the American and to an extent, the UK political systems as stable systems where changes happen incrementally and in stages. On the other hand, Indian democracy is extremely dynamic system where political loyalty is ephemeral and changes happen in a jiffy. And it is very difficult to make predictions in a rapidly changing system where the parameters do not remain stable.
Interestingly, the US primary process is the closest approximation of a multi-party race within the US political system. A number of candidates, some fresh candidates and some veterans, compete to gain the voting share of the party supporters. The US primary process exhibits the difficulty of modelling a multi-party race vis-a-vis a two party one. For example, the 2008 and the 2016 Democratic primaries which were reduced to two candidate races were largely predictable along demographic lines. But the 2016 Republican primary (which at one stage featured as many as 17 candidates) stumped many forecasters who failed to take seriously the prospect of Donald Trump emerging as the nominee. Apart from this, primaries over the years have thrown many surprises, where unheralded candidates often crush the more touted ones, pointing out the inherent unpredictability of such multi-candidate and multi-party races.
- India does not have high quality polling firms
US has some of the best polling companies in the world which come up with high frequency and high quality polls throughout the election season. Companies like Gallup, Pew Research, etc were pioneers in the polling industry. More recently a number of pollsters, media companies and even academic universities commission their national level and state level polls which come in at regular intervals, the frequency generally increasing as we get closer to the Election Day.
This is important as most of the forecasting models employed are heavily influenced by the polling data even as the impact of demographic factors gradually wanes away with the approaching date of the election. Thus, the presence of high quality polling from multiple firms enables these models to become more confident and accurate in giving their forecasts. Contests which are not polled regularly or not polled by credible pollsters often have high degree of uncertainties involved with the final results.
In India, unfortunately, the polling industry does not have a very sound track record. Even exit polls which interview voters leaving the booths give are not very good at predicting the final outcome. The track record of pre-election surveys is even worse. The difficulties include designing a random sample that can adequately capture the myriad diversities of the electorates and the logistical difficulties associated with carrying out a survey in remote areas. But, many a times, Indian pollsters are unable to even correctly predict a landslide win in a small state (for example, a number of polls showed BJP ahead in Delhi during the 2015 assembly elections).
Thus, even if you construct a model where the polling results are part of the inputs (similar to the practice followed by forecasters in US), in absence of quality polling data, the output may very well be gibberish. This is a serious problem for which no appealing solution appears to be in sight.
This problem is further exacerbated by the fact that there are no quality exit polls in India which capture the political preference along demographic lines. To my knowledge, CSDS is the only organization which carries out such surveys, but the results are not publicly available. This is in stark contrast to the US, where mainstream media outlets devote reams of paper in analyzing how each and every voter bloc voted and how their preference changed over time and across states.
- India Does not Have Good Quality Demographic Data
In continuation of the last point, India also does not have high quality demographic data available. Many of the data like caste, income, etc of particular geographical units are not available at all. A few others, like population, religious data, etc are updated infrequently and not extremely reliable.
The US, on the other hand, has county level, congressional district level and state level data for a variety of demographic factors including racial makeup, income, educational qualification, religion, sex, age wise break-up, place of birth, ancestry, disability status, housing occupancy, etc. These data are also updated periodically and are far more reliable for usage in a statistical analysis.
In the absence of high quality demographic data, it again becomes difficult to construct a model where demographic variables may act as an input variable. For example, I may want to use the number of voters of a particular caste or religion in an assembly constituency as one of the input variables for predicting the vote share of various parties. However, in absence of the relevant data, it is impossible to do so.
To sum up, India has a messy democratic process, with an ever increasing number of parties and politicians whose appeals do not stretch beyond particular areas, a first past the post process that is difficult to predict and lack of quality polls and demographic data that makes any such forecasting even more difficult. Add to this, the large and humungous size of the electorate, where voting patterns may change drastically within the same state, not to mention the country, and you have a forecasting challenge in your hands. But while it is easy to talk about the difficulties, it is also true, at least to my knowledge that not many serious and successful attempts have been made to construct statistical models that can capture the behaviour of the voters in India. With a little bit of imagination and ingenuity, maybe someone will succeed in such an endeavour,
Of course, an election forecasting tool is the shiny object of political data journalism, which attracts readers and draws clicks; however, the regular insights obtained through data based analysis might be more useful. All analysis does not have to result in a forecasting output; a proper analysis of historical data also tells us a lot about the future. Consider for example, Harry Enten of Fivethirtyeight analyzing the exit poll data in the Iowa Senate election to conclude that the state may gradually be becoming more Republican in the near future. Similarly, consider the excellent analysis written by Nate Cohn on how the poor performance of Ted Cruz among the less conservative voters in the Iowa caucus portends how he will struggle in more moderate states and larger states with primaries. Fivethirtyeight also ran an excellent article on how contrary to the popular wisdom, the voters of Donald Trump actually are not doing so badly economically.
Sadly though, similarly impactful number driven articles are hard to come by in the Indian media, where old school conventional punditry still rules the roost. This is partly a function of the factors discussed above as well and partly a function of the lack of imagination and initiatives. The time has definitely come that Indian readers be offered data based, razor sharp analysis than more of the same mundane, hackneyed stuff. Considering the way data journalism has attained rapid popularity in the US, with a bit of effort, they may go on to capture the imagination of readers in India.