Data Cleaning

'Data Cleaning' is the process of finding and either removing or fixing 'bad data'.

By ‘bad data’ we mean missing, corrupt and/or inaccurate data points.

# Imports
import numpy as np
import pandas as pd

Missing Values

Missing Values are simply data points that are missing.

Missing values can be indicated in several ways.

Values may be literally empty, or encoded as a special value, such as the Python ‘None’, or ‘NaN’, a numpy object (short for ‘not a number’).

Sometimes missing values are indicated by an arbitrarily chosen value, for example being indicated by some impossible value, such as ‘-999’.

Missing values usually need dealing with before any analysis.

Python - None Type

# Python has the special value 'None', which can encode a missing, or null value
data_none = None
# None is actually it's own type
print(type(None))
<class 'NoneType'>
# Note that 'None' acts like a null type (as if the variable doesn't exist)
assert data_none
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-9-55488c15a8f5> in <module>
      1 # Note that 'None' acts like a null type (as if the variable doesn't exist)
----> 2 assert data_none

AssertionError: 
# Since None is a null type, basic operations will fail when None is in the data
data_lst = [1, 2, 3, None]
sum(data_lst) / len(data_lst)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-69cfc6b74806> in <module>
      1 # Since None is a null type, basic operations will fail when None is in the data
      2 data_lst = [1, 2, 3, None]
----> 3 sum(data_lst) / len(data_lst)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Numpy - NaN

# Numpy also has a special value for 'not a number' - NaN
data_nan = np.nan
# It's actually a special float value
type(data_nan)
float
# It doesn't evaluate as null (unlike None)
assert data_nan
# Numpy has multiple ways to write NaN - but they are all actually the same.
np.nan is np.NaN is np.NAN
True
# NaN values won't fail (unlike None) but they will return undefined (NaN) answers
dat_a = np.array([1, 2, 3, np.nan])
print(np.mean(dat_a))
nan
# You can tell numpy to do calculations, ignoring NaN values, but you have to explicitly tell it to do so
np.nanmean(np.array([1, 2, 3, np.nan]))
2.0

The ‘Art’ of Data Cleaning

Dealing with missing data is a decision point: what do you do?

  • Do you drop the observation?

    • What if this entails dropping a lot of observations?

  • Do you keep it, but ignore it in any calculations?

    • What if you end up with different N’s in different calculcations?

  • Do you recode that data point?

    • What do you recode it to?

Impossible Values

Be wary that datasets may also encode missing data as a special value - for example using ‘-999’ for missing age.

These have to be dealt with, or they will skew your results.

Data cleaning includes checking for and dealing with impossible values. Impossible values can also occur due to encoding or data entry errors.

Data Cleaning in Pandas

Example problem: we have two separate files that collectively have id number, age, weight, and height for a group of people.

Let’s say that ultimately, we are interested in how age relates to height. We might want to ask, for example, if older people have a lower average height than younger people (is it really true that older people shrink!?)

Data Files:

  • messy_data.json, has id & height information

  • messy_data.csv, has id, age, & weight information

# Load in the json file
df1 = pd.read_json('files/messy_data.json')

# Since JSON files read in columns alphabetically, re-arrange columns
df1 = df1[['id', 'height']]
# Check out the data. We have a NaN value!
df1
id height
0 1 168.0
1 2 155.0
2 3 NaN
3 4 173.0
# Let's use pandas to drop the NaN value(s)
#  Note the inplace argument: this performs the operation on the dataframe we call it on
#   without having having to return and re-save the dataframe to a new variable
df1.dropna(inplace=True)
# Check out the data after dropping NaNs
df1
id height
0 1 168.0
1 2 155.0
3 4 173.0
# Read in the CSV data file
df2 = pd.read_csv('files/messy_data.csv')
# Check out the data
df2
id age weight
0 1 20 11.0
1 2 27 NaN
2 3 25 14.0
3 4 -999 12.0

Note that we have another NaN value! However, it is in the weight column, a feature we actually are not planning to use for our current analysis. If we drop NaN’s from this dataframe, we are actually rejecting good data - since we will drop subject 1, who actually does have the age and height information we need.

# So, since we don't need it, lets drop the weight column instead
df2.drop('weight', axis=1, inplace=True)
# Let's check if there are any NaN values in the age column (that we do need)
#  isnull() return booleans for each data point indicating whether it is NaN or not
#    We can sum across the boolean array to see how many NaN values we have
sum(df2['age'].isnull())
0

There aren’t any NaN values in the data column that we need! Let’s proceed!

# Now lets merge our data together
#  Note that here we specify to use the 'id' column to combine the data
#    This means that data points will be combined based on them having the same id.
df = pd.merge(df1, df2, on='id')
# Check out our merged dataframe
df
id height age
0 1 168.0 20
1 2 155.0 27
2 4 173.0 -999
# Check out basic descriptive statistics to see if things look reasonable
df.describe()
id height age
count 3.000000 3.000000 3.000000
mean 2.333333 165.333333 -317.333333
std 1.527525 9.291573 590.351026
min 1.000000 155.000000 -999.000000
25% 1.500000 161.500000 -489.500000
50% 2.000000 168.000000 20.000000
75% 3.000000 170.500000 23.500000
max 4.000000 173.000000 27.000000

So, it looks like our average age is about -300. That… doesn’t seem right.

At some point in data collection, missing age values seem to have been encoded as -999. We need to deal with these data.

# Drop all rows with an impossible age
df = df[df['age'] > 0]
# So what is the actual average age?
df['age'].mean()
23.5
# Check out the cleaned data frame! It is now ready for doing real analysis with!
df
id height age
0 1 168.0 20
1 2 155.0 27

Note that in this example the problematic or missing values were relatively easy to locate - since we could see all our data. In real datasets, we may have hundreds to thousands of rows and potentially dozens of columns. In those cases, searching manually for missing or problematic values will not work very well. Strategies and programmatic approaches for identifying and dealing any bad values are necessary for any data analysis project.

Data Cleaning Notes

This is really just the start of data cleaning - getting data into a fit shape for analysis can include a considerable amount of exploration and work to ensure high quality data goes into the analysis.

Tips for data cleaning:

  • Read any documentation for the dataset you have

    • Things like missing values might be arbitrarily encoded, but should (hopefully) be documented somewhere

  • Check that data types are as expected. If you are reading in mixed type data, make sure you end up with the correct encodings

    • Having numbers read in as strings, for example, is a common way data wrangling can go wrong, and this can cause analysis errors

  • Visualize your data! Have a look that the distribution seems reasonable (more on this later)

  • Check basic statistics. df.describe() can give you a sense if the data is really skewed

  • Keep in mind how your data were collected

    • If anything comes from humans entering information into forms, this might take a lot of cleaning

      • Fixing data entry errors (typos)

      • Dealing with inputs using different units / formats / conventions

    • Cleaning this kind of data is likely to take more manual work (since mistakes are likely idiosyncratic)

Note that in many real cases, visually scanning through data tables to look for missing or bad data is likely intractable, and/or very inefficient. Looking at your data will likely entail looking at distributions and descriptive statistics, as opposed to raw data.

Quartz has a useful Bad Data Guide, and the Pandas tutorials have lots of relevant materials, including a chapter (#7) on data cleaning.