Why is Data-Cleaning important?
Google claims that their services are not good because of high-quality models, but because of high-quality data.
If this statement doesn't stress about the importance of data-cleaning in machine learning, then I don't know what will. Data-cleaning is one of the most important step in machine learning. No matter how excellently designed your model is, if the dataset it is trained on is not properly cleaned, then that model might as well end up in the trashcan.
Data Analogy
Data is like food and the training the model is like workout. No matter how much time you spend in the gym, if your food isn't high quality and nutritious, the workout is WASTED.
Steps for Data Cleaning :-
Steps that you can take in the right direction are:-
- Collect data from reliable sources. Do not blindly test any source for high-quality data, always be skeptical about the usefullness of your dataset. You can collect data from reliable publicly available datasets like Government surveys, kaggle datasets, university datasets, etc.
- Even though your data might be from a reliable source, but make sure to check the validity of your data. The data you collected might be so old that it makes no sense to use it now.
- Check out your data for unwanted biases. For example, the boston dataset in scikit learn has a racist feature. Training in this data might make our model racist. That is why you must take in account these little problems.
I know this was a relatively short blog on the importance of Data-Cleaning. More awesome stuff is coming up! Meanwhile, please follow Me on Twitter and LinkedIn too.