Start the new year with a new approach to planning your days

The challenge is ensuring the things that you’re ticking off the list are the ones you should be doing in order to meet the goals you want. I’ve tried lots of different things to enable me to get…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The Basics about Missing Data and What to Do About It

Data analytics has a constant requirement for complete and accurate data. Often you will find yourself coming up short as there are blank spaces in your tables. This is normal and there are routine ways to deal with this issue that you can eventually automate in your cleaning processes. However, as a beginning it is always good to understand the rationale and circumstances behind how and when the dreaded NA is disposed of.

The first thing to note is that many statistical summary functions allow you to easily remove your missing data from any calculations.

If you are an R user this means the function:

na.omit( )

or perhaps setting an argument within another function (such as mean):

na.rm = TRUE

As far as your maths, charts and model fitting, missing data is simply removed. However, this isn’t an easy fix and the end of the story. If you remove missing values, an entire observation is often removed in the process. You could be left with significant holes in your data that actually harm your analysis. Sometimes it is more important to determine how many missing values there are and/or to replace them in some way.

Starting from the very basics, you should know that you can’t just do a logical test on a vector to find the NA values:

age <- c(38, 20, NA, 41, 49, NA, 28, 32)

age == NA

[1] NA NA NA NA NA NA NA NA

This is because we are not asking R whether or not each particular value in the vector is equal to NA in a way that it understands. You need to use alternative functions for determining whether or not a value is actually missing. One of the most popular is is.na( ).

is.na(age)

[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE

This makes more sense. is.na( ) is one of a whole series of “is.x” functions that allow you to efficiently test if data is a particular type.

You can count the number of cases of missing data or even generate a table showing the number of missing and non-missing cases. For example:

Add a comment

Related posts:

One year on.

As I look back at this day last year, I remember feeling like my world was about to shut down. I remember going to Yin Yoga on a Thursday night with a friend of mine and telling her at dinner that I…