- Published on
Handling Missing Values in Datasets for Machine Learning
Figure out why the data is missing by asking the question: 'Is this value missing because it wasn't recorded or because it doesn't exist?' A key part of understanding why some data is missing is through the development of the skill of 'data intuition'. This skill consists of throughly looking at the dataset, establishing why it is in its current state and how this current state will affect your analysis
Imputation is the practice of estimating what the missing values should be by basing them on other values in the same column and row.
Ways for handling missing values:
- Drop the missing values: This option is not the most optimal in terms of dataset accuracy but if you're in a rush, it can be a viable method of handing missing data. The
dropna()
method can be used for this purpose. When using it without passing in any arguments(oraxis=0
), it will delete any rows containing at least one missing value, whereas including theaxis=1
argument will delete all columns containing at least one missing value. Also, theinplace=True
argument is required if you need to drop some data in the current dataset itself, otherwise a new dataset containing the remaining values will be returned. Here is an example of its usage:sample_dataset.dropna(axis=0, inplace=True)
Note:
- The total number of columns in a dataset can be revealed through the
shape[1]
attribute. For example:
sample_dataset.shape[1]
// Outputs the number of columns in the dataset
- The total number of rows in a dataset can be revealed through the
len()
method like so:
len(sf_permits)
- Fill in the missing values via Imputation: The second option is to try and fill in the missing values using the
fillna()
method. For example, the following line of code:
sample_dataset.fillna(5)
Would replace all NaN
values with the number 5.
There is also another way of filling in missing values: Copy the last or next valid observation of a value to fill the missing value. This feature can be accomplished using the method=bfill
argument, to use the next valid observation as the basis for replacing a given NaN
value, and the method=ffil
argument, for using the last valid observation as the basis for replacing a given NaN
value.
- Fill in the missing values with an extension to Imputation: The third option is derived from Imputation with the extension of adding an extra column to denote whether or not a value in the current row was imputed. This is a way of denoting to the reader of the dataset that, if the extra column indicates a
True
value, that row's value was imputed from the adjacent values and that it should be used with some skepticism. This approach is helpful in improving the prediction power of the models in some cases and seem to have no effect in other cases.
Here is a visual example of this method:
Figure 3: Before applying Imputation Extension method
Figure 4: After applying Imputation Extension method
Conclusion
Well that's it for this post! Thanks for following along in this article and if you have any questions or concerns please feel free to post a comment in this post and I will get back to you when I find the time.
If you found this article helpful please share it and make sure to follow me on Twitter and GitHub, connect with me on LinkedIn and subscribe to my YouTube channel.