Cleaning up bad data is one of the less romantic aspects of data science but it is an unfortunate necessity. But in our haste to make our data as clean as possible, might we be overdoing it a little bit.
Obviously we all want our data to be free from mistakes, error strewn information is of no use to anyone. But when cleaning it up, often, these bad inputs and outliers get discarded without a second thought, when actually they can be useful. Isn’t it better to understand where an error originates from, so as to prevent it happening again, than simply throwing the bad result away.
When you get a bad reading, there are a myriad of different reasons to explain it. From faulty equipment or inexperience operators to localised anomalies, data can be affected negatively by all sorts of factors.
Now when these results come in, and it is immediately apparent that the data is faulty, it is common practice to remove it so as to prevent it affecting the other results. But some data analysis companies have encouraged their employees to treated bad data as an outlier, and ignore it when collecting the results. Crucially, it isn’t deleted immediately. This means that these outliers can be analysed to determine what the problem at the core of the reading is. So next time you have to cleanse your big data, keep hold of the bad results and use them to try to prevent more errors in the future.