Clean your data efficiently using Python

Everyone has seen messy data before, extras spaces, symbols mixed with numbers, inconsistent spelling or data formats. All of these examples make the analysis of data more difficult.

The data in itself is not “bad” data, but just not ready for analyses yet!

In today’s session, I want to help you show how you can check whether your data is “messy” and how to get it ready for your analysis.

You may now wonder “Why should I clean my data? Can’t I just analyse the messy data?”. There are multiple answers to this, but in general, messy data (in general) can lead to wrong results, make your analysis slower, increases your chance of mistakes and a lot harder to reproduce. If “Amsterdam”, for example, appears in three different spellings in your data, your analysis might think you have three cities instead of one.

Let’s learn together, how to spot messy data, how to fix it and how to log the changes so everything is transparent and reproducible!

Before we begin, I want to show you an example of messy data! Let’s have a look at the data, open the file and try to spot some irregularities together. I want you to think of issues such as unwanted characters, inconsistent spelling, mixed data types, extra whitespace, duplicate rows and missing values.

You can download the data here and find the jupyter notebook here

While we did the first examination with our own eyes, we can also use pandas (the python package) to inspect the data.

# First, let's import pandas (or install if you have not installed it yet)
import pandas

# Let's read the csv file
df = pandas.read_csv("messy_HR_data.csv")

# Look at the first five rows
df.head()

	Name	Age	Salary	Gender	Department	Position	Joining Date	Performance Score	Email	Phone Number
0	grace	25	50000	Male	HR	Manager	April 5, 2018	D	email@example.com	NaN
1	david	NaN	65000	Female	Finance	Director	2020/02/20	F	user@domain.com	123-456-7890
2	hannah	35	SIXTY THOUSAND	Female	Sales	Director	01/15/2020	C	email@example.com	098-765-4321
3	eve	NaN	50000	Female	IT	Manager	April 5, 2018	A	name@company.org
4	grace	NaN	NAN	Female	Finance	Manager	01/15/2020	F	name@company.org	098-765-4321

# And the general information
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Name               1000 non-null   str  
 1   Age                841 non-null    str  
 2   Salary             1000 non-null   str  
 3   Gender             1000 non-null   str  
 4   Department         1000 non-null   str  
 5   Position           1000 non-null   str  
 6   Joining Date       1000 non-null   str  
 7   Performance Score  1000 non-null   str  
 8   Email              610 non-null    str  
 9   Phone Number       815 non-null    str  
dtypes: str(10)
memory usage: 78.3 KB

Do you see the first problem already? Some columns do not have 1000 rows, but 841 (Age), 610 (Email) and 815 (Phone number instead). Let’s remember this and fix it later.

# Let's have a look at the summary of the data
df.describe(include="all")

	Name	Age	Salary	Gender	Department	Position	Joining Date	Performance Score	Email	Phone Number
count	1000	841	1000	1000	1000	1000	1000	1000	610	815
unique	10	5	6	3	5	5	5	5	3	4
top	alice	thirty	65000	Male	Finance	Assistant	2020/02/20	B	user@domain.com	123-456-7890
freq	118	176	184	355	218	214	232	225	213	236

This overview gives us another hint. Now, of course our file only is an example, but you may wonder (in real data) why a phone number is used 236 times.

# Let's have a look at the unique values of each column and check whether we find something there
df.Name.unique()

<StringArray>
[  ' grace ',   ' david ',  ' hannah ',     ' eve ',    ' jack ', ' charlie ',
   ' frank ',     ' bob ',   ' alice ',     ' ivy ']
Length: 10, dtype: str

The names look normal to me, but if we look closer, we can see that there are whitespaces around the names. Is that needed? I do not think so! Let’s delete them.

WAIT! Logging?

We just wanted to start deleting, however, we want to keep track of everthing. For that I want to write a small function that we can reuse.

import datetime

def log_change(description):
    with open("log_file.txt", mode = 'a', encoding = 'utf-8') as log_file:
        print(description, datetime.datetime.now(), sep = '\t', file = log_file)

log_change("Started process!")

Back to deleting whitespace!

Deleting Whitespace

# We can do that using strip()
df.Name = df.Name.str.strip()
log_change("Deleted whitespace in column Name")

df.Name.unique()

<StringArray>
[  'grace',   'david',  'hannah',     'eve',    'jack', 'charlie',   'frank',
     'bob',   'alice',     'ivy']
Length: 10, dtype: str

df.Name.str.strip() removes extra whitespace from the beginning and end of each string. We use str to say that this method should be used to every value in the column.

Since we looked at names, we may also want to have then capitalized, right?

Manipulate cases

We can do that using str.capitalize(). In a similar manner, we can use lowercase, uppercase or different settings:

str.lower(): Converts all characters to lowercase.
str.upper(): Converts all characters to uppercase.
str.title(): Converts first character of each word to uppercase and remaining to lowercase.
str.capitalize(): Converts first character to uppercase and remaining to lowercase.
str.swapcase(): Converts uppercase to lowercase and lowercase to uppercase.

df.Name = df.Name.str.capitalize()
log_change("Capitalized strings in column Name")

Let’s have a look at the age next…

df.Age.unique()

<StringArray>
['25', nan, '35', '40', 'thirty', '50']
Length: 6, dtype: str

It seems that age is sometimes written out as a string “thirty” and sometimes saved as a number.

Replacing values

If there are only some instances, it may be handy to just replace the string with a number.

df.Age = df.Age.replace("thirty", "30")
log_change("Replacing thirty with 30 in column Age")

df.Age.unique()

<StringArray>
['25', nan, '35', '40', '30', '50']
Length: 6, dtype: str

# We can also ask pandas to convert all values to numbers and everything that is not a number, to NA
# ‘coerce’ means that invalid parsing will be set as NaN.
df.Age = pandas.to_numeric(df.Age, downcast="integer", errors="coerce")
log_change("Converting Age to numeric")

df.Age.unique()

array([25., nan, 35., 40., 30., 50.])

Let’s check whether we find a similar problem in the Salary.

df.Salary.unique()

<StringArray>
['50000', '65000', 'SIXTY THOUSAND', ' NAN ', '70000', '55000']
Length: 6, dtype: str

# Let's replace SIXTY THOUSAND by the number
df.Salary = df.Salary.replace("SIXTY THOUSAND", "60000")
log_change("Replace SIXTY THOUSAND with 60000 in column Salary")

# And tell pandas that we have numerics here
df.Salary = pandas.to_numeric(df.Salary, downcast="integer",errors="coerce")
log_change("Converting Salary to numeric")

df.Salary.unique()

array([50000., 65000., 60000.,    nan, 70000., 55000.])

Let’s have a short look at Gender…

df.Gender.unique()

<StringArray>
['Male', 'Female', 'Other']
Length: 3, dtype: str

Ok! Looks good to me! But we may want to change it to “m”, “f” and “d”? For that, we could create a so-called mapping. We create a dictionary that contains the words that are currently in the dataset as keys and the words that we want to use as values. With this, we can also standardize the spelling!

gender_mapping = {'Male': 'm', 'Female': 'f', 'Other':'d'}
# Important: To use this function, you need the newest version of pandas (3.0) which may require to update your Python instance
df.Gender = df.Gender.str.replace(gender_mapping)
log_change("Replace Male, Female and Other with f,m and d in column Gender")

df.Gender

0      m
1      f
2      f
3      f
4      f
      ..
995    f
996    m
997    m
998    d
999    m
Name: Gender, Length: 1000, dtype: str

I did not find any problems in the Department and Position column, however, I do not like that the columns “Joining Date”, “Performance Score” and “Phone Number” use two words! That may make the work with them more difficult as we have to use a different approach (df['Joining Date'] instead of df.joining_date).

Renaming columns

Let’s rename them!

df = df.rename(columns={"Joining Date": "Joining_Date", "Performance Score": "Performance_Score", "Phone Number": "Phone_Number"})
log_change("Renaming Column Joining Date, Performance Score and Phone Number to Joining_Date, Performance_Score and Phone_Number")

Missing values and using na

While having a first look at our data, we have seen that Age, Salary, Email and Phone Number, were not fully filled and contained empty cells. We also saw that “NaN” has not been used for non-existing values all the time. Let’s unify that!

# First we want to replace empty cells with a NA value
df = df.replace("", pandas.NA)
log_change("Filling all empty cells with NA")

# We can also decide to name our "NA" different, but NaN works best with pandas. Let's unify it for every column
df = df.fillna(pandas.NA)
log_change("Unify NAs to pandas NA")

df

	Name	Age	Salary	Gender	Department	Position	Joining_Date	Performance_Score	Email	Phone_Number
0	Grace	25.0	50000.0	m	HR	Manager	April 5, 2018	D	email@example.com	NaN
1	David	NaN	65000.0	f	Finance	Director	2020/02/20	F	user@domain.com	123-456-7890
2	Hannah	35.0	60000.0	f	Sales	Director	01/15/2020	C	email@example.com	098-765-4321
3	Eve	NaN	50000.0	f	IT	Manager	April 5, 2018	A	name@company.org
4	Grace	NaN	NaN	f	Finance	Manager	01/15/2020	F	name@company.org	098-765-4321
...	...	...	...	...	...	...	...	...	...	...
995	Jack	50.0	65000.0	f	HR	Manager	2020/02/20	F	NaN	098-765-4321
996	Jack	30.0	50000.0	m	Finance	Analyst	April 5, 2018	C	NaN	555-555-5555
997	Hannah	30.0	70000.0	m	IT	Assistant	01/15/2020	D	user@domain.com	NaN
998	Bob	25.0	65000.0	d	Marketing	Manager	April 5, 2018	D	email@example.com	NaN
999	Ivy	30.0	60000.0	m	Finance	Manager	2020/02/20	C	user@domain.com	123-456-7890

1000 rows × 10 columns

Unifying time stamps

You may have guessed it already, but the last aspect that we want to clean are the different time stamps. The dataset contains formats such as month, day year and year/month/day as well as month/day/year.

Let’s unify that using pandas. A possible approach would be df.Joining_Date = pandas.to_datetime(df.Joining_Date, errors="coerce"). However, since our data is a bit more complex, we may have to adapt to different formats.

def parse_date(date):
    """Function that takes the date and time and tries out to parse it using every possible format"""
    for format in ["%B %d, %Y", "%Y/%m/%d","%m/%d/%Y", "%m-%d-%Y", "%Y.%m.%d"]:
        try:
            return pandas.to_datetime(date, format=format)
        except:
            pass
    return pandas.NaT

# Call the function
df.Joining_Date = df.Joining_Date.apply(parse_date)
log_change("Unifying time stamps in column Joining_Date")

# Unify it to a human-readable version
df.Joining_Date = df.Joining_Date.dt.strftime("%Y-%m-%d")
log_change("Converting time stamps to %Y-%m-%d in column Joining_Date")

# Let's have a final look at our data set! It looks so much cleaner!
df

	Name	Age	Salary	Gender	Department	Position	Joining_Date	Performance_Score	Email	Phone_Number
0	Grace	25.0	50000.0	m	HR	Manager	2018-04-05	D	email@example.com	NaN
1	David	NaN	65000.0	f	Finance	Director	2020-02-20	F	user@domain.com	123-456-7890
2	Hannah	35.0	60000.0	f	Sales	Director	2020-01-15	C	email@example.com	098-765-4321
3	Eve	NaN	50000.0	f	IT	Manager	2018-04-05	A	name@company.org
4	Grace	NaN	NaN	f	Finance	Manager	2020-01-15	F	name@company.org	098-765-4321
...	...	...	...	...	...	...	...	...	...	...
995	Jack	50.0	65000.0	f	HR	Manager	2020-02-20	F	NaN	098-765-4321
996	Jack	30.0	50000.0	m	Finance	Analyst	2018-04-05	C	NaN	555-555-5555
997	Hannah	30.0	70000.0	m	IT	Assistant	2020-01-15	D	user@domain.com	NaN
998	Bob	25.0	65000.0	d	Marketing	Manager	2018-04-05	D	email@example.com	NaN
999	Ivy	30.0	60000.0	m	Finance	Manager	2020-02-20	C	user@domain.com	123-456-7890

1000 rows × 10 columns

# Now we can start doing some statistics!
df.describe(include="all")

	Name	Age	Salary	Gender	Department	Position	Joining_Date	Performance_Score	Email	Phone_Number
count	1000	841.000000	833.000000	1000	1000	1000	1000	1000	610	815
unique	10	NaN	NaN	3	5	5	5	5	3	4
top	Alice	NaN	NaN	m	Finance	Assistant	2020-02-20	B	user@domain.com	123-456-7890
freq	118	NaN	NaN	355	218	214	232	225	213	236
mean	NaN	35.802616	60216.086435	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	NaN	8.551889	7127.032811	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	NaN	25.000000	50000.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	NaN	30.000000	55000.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	NaN	35.000000	60000.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	NaN	40.000000	65000.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	NaN	50.000000	70000.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN

# And we can save the results in a file
df.to_csv("new_file.csv", index = False, na_rep = "NaN")