Friday, May 13, 2016

What Can We Learn from the Stat Wars?

A lot of friends and fellow nerds have posted their take on the controversial analysis posted by David Yap and Antonio Contreras suggesting that the VP race is tainted with electoral fraud. I think that there is already enough content to go around to challenge the validity of their claims and analysis, so I won’t go into too much detail on that (reference to those posts below). Instead, my focus will be on what we can learn about the basic principles of data analysis that are often overlooked (even by myself) when we are “in the zone” of analyzing data.
 
I can neither prove or disprove the fraud accusations. What I can do is point out other inconsistencies in the accusations and how they could have been avoided. This is especially important today, with the proliferation of data analysts or data scientists in the workplace. With more information available to us and more expectations from people who analyze data, it is critical that we are aware of the principles of data analysis to save us from making costly errors. The faulty analysis committed by Yap and Contreras can easily be the same mistakes we make ourselves when analyzing data to make business or organizational decisions.
 
DISCLAIMER: No, I am not related to David Yap. Hehe. And since we’re talking about being cautious with analysis errors, feel free to comment if you think I made a mistake as well in my analysis.
 
To non-stat friends, forgive me for the stat jargon and focus on the lessons in analysis. I think this applies to any form of critical thinking, not just in statistics or data analysis.
 

1. “Statistics means never having to say you’re certain.”

 

- Dr. Philip B. Stark, Professor of Statistics at University of California - Berkeley.
(Quite fittingly, Dr. Stark’s says that his research includes applications in “election auditing” among others.)
 
One of the issues I have with the analysis posted by Yap and Contreras is the way in which they call fraud, as if it is definite or there is no reasonable doubt. See here:








In statistics, we never say that we are certain about something. You have error estimates, confidence intervals, and measures of (statistical) bias. Nothing is certain, because we do not have all the data. These concepts are taught in elementary-level statistics. A true statistician knows how to report results in relation to uncertainty.
 
If we really wanted to prove electoral fraud, take a look at the electronic ballots, vote transmissions, and printed receipts instead of sample data from a partial and unofficial count. If BBM’s team can provide that kind of hard evidence, then we can approach a better conclusion about the fraud accusations.
 

2. Always validate your model.

 

Data analysis is not about selecting a model with the best fit (or in this case, “suspicious” R-squared values). Quoting statistician George Box, “all models are wrong, but some are useful.”
 
We use models to try to understand data, but each model comes with its own share of assumptions and prerequisites. It is our responsibility to check for model validity or “usefulness” before jumping into any conclusions. Consider the following texts:






SUMMARY: R-squared is not enough, plot the residuals.


This step is often overlooked, especially when we are trying to pull insights under time pressure. But in the end, there is no excuse for trusting an inappropriate model. 
To add to that, here’s Prof. Stark again on linear regression diagnostics: 

If the data are heteroscedastic, nonlinearly associated, or have outliers, the regression line is not a good summary of the data, and it is not appropriate to use regression to summarize the data. Heteroscedasticity, nonlinearity and outliers are easier to see in a residual plot than in a scatterplot of the raw data. There can be errors of arithmetic in calculating the regression line, so that the slope or intercept is wrong. It is easy to catch such errors by looking at residual plots, where they show up as a nonzero mean or a trend.” (heteroscedastic means that the variability of the data differs across different subsets; hetero = different, skedasis = dispersion)
 
Basically, what this means is that we should examine the plot of residuals (differences between actual values and values predicted by the model) and check for any anomalies. Ideally, there should be no patterns and outliers. We will do this graphically because it is easier to see and understand it that way.
 
Together with the standard xy-plot, I plotted the residuals of the analysis from Yap’s post and came up with this (see bottom of the post for the data and code):
Note: The data plotted here uses a subset of the entire VP Race data taken from GMA’s website. This was shared by Earl Bautista. Although this is not the exact dataset used by Yap and Contreras, I did a quick check on some data points. The numbers are the same, but Bautista’s just has more data points in between.
 
Subset is from May 9, 7:40pm (after the supposed glitch or peak of BBM’s lead) to May 10, 5:50pm. This is the period I call #LeniStrikesBack, heh.





The infamous trend line.


The linear model applied to this gives an R-squared of 0.99+, consistent with Yap’s analysis.
> summary(leniStrikesBack.lm) Call: lm(formula = BBM_Lead ~ Transmission_Rate, data = leniStrikesBack) Residuals: Min 1Q Median 3Q Max -73776 -13331 -104 16010 44524 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3741068.9 32441.9 115.3 <2e-16 *** Transmission_Rate -42798.5 382.8 -111.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24680 on 51 degrees of freedom Multiple R-squared: 0.9959, Adjusted R-squared: 0.9959 F-statistic: 1.25e+04 on 1 and 51 DF, p-value: < 2.2e-16

Plotting the residuals, we get:



Curve centered at 0 with points clustered towards the right side.

If the model was an appropriate fit, the plot should show a random dispersion of points around the 0 line. Unfortunately, the plot shows a parabolic curve approximately centered at 0. No random dispersion here. 

An article on r-bloggers.com provides a guide to interpreting residual plots. See here:








Different types of residual plots and their interpretations


Ideally, our plot should be unbiased and homoscedastic (a). Unfortunately, it clearly is not. It actually falls under (f) Biased and Heteroscedastic because of its curved form and uneven dispersion (many points clustered towards the end). #bias

If Yap and Contreras did this from the start, it would have been clear that it is not appropriate to use linear regression in this case. There are other factors (non-linearity, non-randomness of samples, other relationships, etc.) affecting the two variables (% Transmission and BBM’s Lead). This should have been a warning sign for them to consider explanations such as those presented by other posters.
 
This does not remove any doubt of cheating. Again, we are never certain. It just says that the linear model is not appropriate, so other models must be considered.
 

3. Dig deeper into your data.

 

One of the claims made was that the trend line suggested that Leni was gaining ~40k votes per 1% of transmitted ballots. The “straightness” of the line probably further suggested that this was happening at a constant rate, making it dubious.
 
Model assumptions aside, an easy way to check this assumption is to go down one step into the data and see the rate of change of BBM’s lead over the transmission rate. This is essentially just getting the slope of the line between each data point. If the initial accusation was correct, we would expect a relatively straight line at around 40k if we plotted the slopes versus the observation numbers. Instead, we get this:







Not exactly a straight, horizontal lineat 40k.

The data is averaging close to 40k in favor of Leni, but you can argue that this looks like a parabolic, maybe even a higher-order polynomial, curve. The important point here is that it does not appear linear. This is another warning sign that the initial conclusion by Yap and Contreras is faulty and that assumptions need to be checked.
 

Putting it all together

 

I can summarize all 3 points by saying that any data analysis requires the analyst to always be skeptical, no matter the results. You always have to consider uncertainty and model validity when modeling, and go through your own sanity checks once you have initial conclusions. As shown by the examples above, a few extra steps to make sure these points are covered can go a long way in helping us understand the validity of our analysis.
 
Data analysis is extremely difficult because of the need to consider all these things. I have tremendous respect for anyone who is an expert in this field. But it is inevitable that we make faulty assumptions or analyses from time to time, so keep being skeptical about your conclusions. Show your work to other subject matter experts, your colleagues, or other nerdy friends. Be eager to be proven wrong! It is only in being proven wrong that we are able to progress forward with knowledge. Those who are always right (or think they are always right) never learn.
 
On the bright side, it is exciting to see the energy with which the math/stat community has approached this issue. With President-elect Duterte pushing for freedom of information, I see this as a new hope for our society. We may find ourselves with more public data to play around with, more collaboration on analysis projects, and hopefully, more informed decisions across different sectors.
 
To Mr. Yap and Mr. Contreras, and to all others involved in this, thank you for sharing your thoughts and insights. I am looking forward to a more progressive society under our newly elected leaders. Mabuhay tayong lahat!

Sources:
Data:
R Script:

No comments:

Post a Comment