Data validation is more than limiting entries to a drop down on a spreadsheet or making sure that columns are all formatted correctly. It is a systematic procedure to ensure accuracy, consistency, and quality in your data at every step of the extract, transform, load and processing cycles.
The 2020 Iowa Caucus Debacle provides an interesting example of the importance of clean data, an example of data auditing and validation, and a painful lesson in how an error in your master data can be extremely costly.
When the tabulation app was found to be reporting vote totals different from what the precinct chairs were entering; many precincts, counties, and caucus supervisors posted their raw data online. Having access to the raw data, processing procedures, and recorded steps, amateur internet sleuths and news outlets were able to provide their own independent verification of the results. What kind of tests did they perform?
1) Duplication checks: check if any precincts were entered twice in base data.
2) Intermediate checksum: examine vote totals to ensure that the second round did not have more voters than the first round.
3) Process validation: ensured candidates that did not meet the viability threshold were not awarded delegates.
4) Final checksum: ensure excess delegates were not awarded.
5) Mathematical validation: check that delegates were properly calculated.
6) Improper data mapping: see if votes were switched from their assigned viable candidate from one round to the next.
7) Error frequency: look to see the kinds of errors and produce a histogram to confirm if they are random or point to a biasing flaw.
After a performance- and time-intensive operation, the very worst outcome is to have an output that is wrong but looks right. Because that leads to a sober and rational decision that is uninformed and misinformed.
That is exactly what happened in Iowa. While the results are still being litigated, these independent audits found that the caucus failed on all of these validation checks. The New York Times estimated as many as 10% of precincts may have one or more of these errors. Voters in subsequent states are making decisions informed by news reporting that was generated by these bad reports. The potential second- and third-order consequence is huge.
As all these errors have been identified and reported, it should be easy enough to do a re-canvass, and find accurate results now that the errors in the process and data import are known, right?
No. Because there is a fundamental flaw in the master data here. Unlike a primary, where there are ballots to go back to, the caucus entailed getting people in a room and following a process to generate the master data. The validation checks found that many precinct secretaries improperly followed their realignment rules, lowering numbers for viable candidates with some candidates going from viable to not viable. This resulted in the master data being “incorrect” under the established rules.
Iowa can re-canvass and recount all they want, but this flaw in the master data cannot be corrected. It is likely we will never know who “won” the 2020 Iowa caucus, simply who was ahead when we stopped fighting over it.
What are the expensive consequences of this? It appears that Iowa will no longer be awarded its status as the start of the presidential election. The economic drop off from the loss of tourism, advertising, and spending will be in the tens of millions of dollars, and hundreds of millions in intangible valuation.
It is further likely that the loss of political status will see a drop off in government expenditures and subsidies that benefit the state. In the aggregate, this error in the master data will do billions of dollars of economic damage to the state.
Very costly, for such a little thing.