Clean your data efficiently using Python
Everyone has seen messy data before, extras spaces, symbols mixed with numbers, inconsistent spelling or data formats. All of these examples make the analysis of data more difficult.
The data in itself is not “bad” data, but just not ready for analyses yet!
In today’s session, I want to help you show how you can check whether your data is “messy” and how to get it ready for your analysis.
You may now wonder “Why should I clean my data? Can’t I just analyse the messy data?”. There are multiple answers to this, but in general, messy data (in general) can lead to wrong results, make your analysis slower, increases your chance of mistakes and a lot harder to reproduce. If “Amsterdam”, for example, appears in three different spellings in your data, your analysis might think you have three cities instead of one.
Let’s learn together, how to spot messy data, how to fix it and how to log the changes so everything is transparent and reproducible!
Before we begin, I want to show you an example of messy data! Let’s have a look at the data, open the file and try to spot some irregularities together. I want you to think of issues such as unwanted characters, inconsistent spelling, mixed data types, extra whitespace, duplicate rows and missing values.
You can download the data here and find the jupyter notebook here
While we did the first examination with our own eyes, we can also use pandas (the python package) to inspect the data.
# First, let's import pandas (or install if you have not installed it yet)
import pandas
# Let's read the csv file
df = pandas.read_csv("messy_HR_data.csv")
# Look at the first five rows
df.head()
| Name | Age | Salary | Gender | Department | Position | Joining Date | Performance Score | Phone Number | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | grace | 25 | 50000 | Male | HR | Manager | April 5, 2018 | D | email@example.com | NaN |
| 1 | david | NaN | 65000 | Female | Finance | Director | 2020/02/20 | F | user@domain.com | 123-456-7890 |
| 2 | hannah | 35 | SIXTY THOUSAND | Female | Sales | Director | 01/15/2020 | C | email@example.com | 098-765-4321 |
| 3 | eve | NaN | 50000 | Female | IT | Manager | April 5, 2018 | A | name@company.org | |
| 4 | grace | NaN | NAN | Female | Finance | Manager | 01/15/2020 | F | name@company.org | 098-765-4321 |
# And the general information
df.info()
<class 'pandas.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1000 non-null str
1 Age 841 non-null str
2 Salary 1000 non-null str
3 Gender 1000 non-null str
4 Department 1000 non-null str
5 Position 1000 non-null str
6 Joining Date 1000 non-null str
7 Performance Score 1000 non-null str
8 Email 610 non-null str
9 Phone Number 815 non-null str
dtypes: str(10)
memory usage: 78.3 KB
I
Do you see the first problem already? Some columns do not have 1000 rows, but 841 (Age), 610 (Email) and 815 (Phone number instead). Let’s remember this and fix it later.
# Let's have a look at the summary of the data
df.describe(include="all")
| Name | Age | Salary | Gender | Department | Position | Joining Date | Performance Score | Phone Number | ||
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 841 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 610 | 815 |
| unique | 10 | 5 | 6 | 3 | 5 | 5 | 5 | 5 | 3 | 4 |
| top | alice | thirty | 65000 | Male | Finance | Assistant | 2020/02/20 | B | user@domain.com | 123-456-7890 |
| freq | 118 | 176 | 184 | 355 | 218 | 214 | 232 | 225 | 213 | 236 |
This overview gives us another hint. Now, of course our file only is an example, but you may wonder (in real data) why a phone number is used 236 times.
# Let's have a look at the unique values of each column and check whether we find something there
df.Name.unique()
<StringArray>
[ ' grace ', ' david ', ' hannah ', ' eve ', ' jack ', ' charlie ',
' frank ', ' bob ', ' alice ', ' ivy ']
Length: 10, dtype: str
The names look normal to me, but if we look closer, we can see that there are whitespaces around the names. Is that needed? I do not think so! Let’s delete them.
WAIT! Logging?
We just wanted to start deleting, however, we want to keep track of everthing. For that I want to write a small function that we can reuse.
import datetime
def log_change(description):
with open("log_file.txt", mode = 'a', encoding = 'utf-8') as log_file:
print(description, datetime.datetime.now(), sep = '\t', file = log_file)
log_change("Started process!")
Back to deleting whitespace!
Deleting Whitespace
# We can do that using strip()
df.Name = df.Name.str.strip()
log_change("Deleted whitespace in column Name")
df.Name.unique()
<StringArray>
[ 'grace', 'david', 'hannah', 'eve', 'jack', 'charlie', 'frank',
'bob', 'alice', 'ivy']
Length: 10, dtype: str
df.Name.str.strip() removes extra whitespace from the beginning and end of each string. We use str to say that this method should be used to every value in the column.
Since we looked at names, we may also want to have then capitalized, right?
Manipulate cases
We can do that using str.capitalize(). In a similar manner, we can use lowercase, uppercase or different settings:
- str.lower(): Converts all characters to lowercase.
- str.upper(): Converts all characters to uppercase.
- str.title(): Converts first character of each word to uppercase and remaining to lowercase.
- str.capitalize(): Converts first character to uppercase and remaining to lowercase.
- str.swapcase(): Converts uppercase to lowercase and lowercase to uppercase.
df.Name = df.Name.str.capitalize()
log_change("Capitalized strings in column Name")
Let’s have a look at the age next…
df.Age.unique()
<StringArray>
['25', nan, '35', '40', 'thirty', '50']
Length: 6, dtype: str
It seems that age is sometimes written out as a string “thirty” and sometimes saved as a number.
Replacing values
If there are only some instances, it may be handy to just replace the string with a number.
df.Age = df.Age.replace("thirty", "30")
log_change("Replacing thirty with 30 in column Age")
df.Age.unique()
<StringArray>
['25', nan, '35', '40', '30', '50']
Length: 6, dtype: str
# We can also ask pandas to convert all values to numbers and everything that is not a number, to NA
# ‘coerce’ means that invalid parsing will be set as NaN.
df.Age = pandas.to_numeric(df.Age, downcast="integer", errors="coerce")
log_change("Converting Age to numeric")
df.Age.unique()
array([25., nan, 35., 40., 30., 50.])
Let’s check whether we find a similar problem in the Salary.
df.Salary.unique()
<StringArray>
['50000', '65000', 'SIXTY THOUSAND', ' NAN ', '70000', '55000']
Length: 6, dtype: str
# Let's replace SIXTY THOUSAND by the number
df.Salary = df.Salary.replace("SIXTY THOUSAND", "60000")
log_change("Replace SIXTY THOUSAND with 60000 in column Salary")
# And tell pandas that we have numerics here
df.Salary = pandas.to_numeric(df.Salary, downcast="integer",errors="coerce")
log_change("Converting Salary to numeric")
df.Salary.unique()
array([50000., 65000., 60000., nan, 70000., 55000.])
Let’s have a short look at Gender…
df.Gender.unique()
<StringArray>
['Male', 'Female', 'Other']
Length: 3, dtype: str
Ok! Looks good to me! But we may want to change it to “m”, “f” and “d”? For that, we could create a so-called mapping. We create a dictionary that contains the words that are currently in the dataset as keys and the words that we want to use as values. With this, we can also standardize the spelling!
gender_mapping = {'Male': 'm', 'Female': 'f', 'Other':'d'}
# Important: To use this function, you need the newest version of pandas (3.0) which may require to update your Python instance
df.Gender = df.Gender.str.replace(gender_mapping)
log_change("Replace Male, Female and Other with f,m and d in column Gender")
df.Gender
0 m
1 f
2 f
3 f
4 f
..
995 f
996 m
997 m
998 d
999 m
Name: Gender, Length: 1000, dtype: str
I did not find any problems in the Department and Position column, however, I do not like that the columns “Joining Date”, “Performance Score” and “Phone Number” use two words! That may make the work with them more difficult as we have to use a different approach (df['Joining Date'] instead of df.joining_date).
Renaming columns
Let’s rename them!
df = df.rename(columns={"Joining Date": "Joining_Date", "Performance Score": "Performance_Score", "Phone Number": "Phone_Number"})
log_change("Renaming Column Joining Date, Performance Score and Phone Number to Joining_Date, Performance_Score and Phone_Number")
Missing values and using na
While having a first look at our data, we have seen that Age, Salary, Email and Phone Number, were not fully filled and contained empty cells. We also saw that “NaN” has not been used for non-existing values all the time. Let’s unify that!
# First we want to replace empty cells with a NA value
df = df.replace("", pandas.NA)
log_change("Filling all empty cells with NA")
# We can also decide to name our "NA" different, but NaN works best with pandas. Let's unify it for every column
df = df.fillna(pandas.NA)
log_change("Unify NAs to pandas NA")
df
| Name | Age | Salary | Gender | Department | Position | Joining_Date | Performance_Score | Phone_Number | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Grace | 25.0 | 50000.0 | m | HR | Manager | April 5, 2018 | D | email@example.com | NaN |
| 1 | David | NaN | 65000.0 | f | Finance | Director | 2020/02/20 | F | user@domain.com | 123-456-7890 |
| 2 | Hannah | 35.0 | 60000.0 | f | Sales | Director | 01/15/2020 | C | email@example.com | 098-765-4321 |
| 3 | Eve | NaN | 50000.0 | f | IT | Manager | April 5, 2018 | A | name@company.org | |
| 4 | Grace | NaN | NaN | f | Finance | Manager | 01/15/2020 | F | name@company.org | 098-765-4321 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Jack | 50.0 | 65000.0 | f | HR | Manager | 2020/02/20 | F | NaN | 098-765-4321 |
| 996 | Jack | 30.0 | 50000.0 | m | Finance | Analyst | April 5, 2018 | C | NaN | 555-555-5555 |
| 997 | Hannah | 30.0 | 70000.0 | m | IT | Assistant | 01/15/2020 | D | user@domain.com | NaN |
| 998 | Bob | 25.0 | 65000.0 | d | Marketing | Manager | April 5, 2018 | D | email@example.com | NaN |
| 999 | Ivy | 30.0 | 60000.0 | m | Finance | Manager | 2020/02/20 | C | user@domain.com | 123-456-7890 |
1000 rows × 10 columns
Unifying time stamps
You may have guessed it already, but the last aspect that we want to clean are the different time stamps. The dataset contains formats such as month, day year and year/month/day as well as month/day/year.
Let’s unify that using pandas. A possible approach would be df.Joining_Date = pandas.to_datetime(df.Joining_Date, errors="coerce"). However, since our data is a bit more complex, we may have to adapt to different formats.
def parse_date(date):
"""Function that takes the date and time and tries out to parse it using every possible format"""
for format in ["%B %d, %Y", "%Y/%m/%d","%m/%d/%Y", "%m-%d-%Y", "%Y.%m.%d"]:
try:
return pandas.to_datetime(date, format=format)
except:
pass
return pandas.NaT
# Call the function
df.Joining_Date = df.Joining_Date.apply(parse_date)
log_change("Unifying time stamps in column Joining_Date")
# Unify it to a human-readable version
df.Joining_Date = df.Joining_Date.dt.strftime("%Y-%m-%d")
log_change("Converting time stamps to %Y-%m-%d in column Joining_Date")
# Let's have a final look at our data set! It looks so much cleaner!
df
| Name | Age | Salary | Gender | Department | Position | Joining_Date | Performance_Score | Phone_Number | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Grace | 25.0 | 50000.0 | m | HR | Manager | 2018-04-05 | D | email@example.com | NaN |
| 1 | David | NaN | 65000.0 | f | Finance | Director | 2020-02-20 | F | user@domain.com | 123-456-7890 |
| 2 | Hannah | 35.0 | 60000.0 | f | Sales | Director | 2020-01-15 | C | email@example.com | 098-765-4321 |
| 3 | Eve | NaN | 50000.0 | f | IT | Manager | 2018-04-05 | A | name@company.org | |
| 4 | Grace | NaN | NaN | f | Finance | Manager | 2020-01-15 | F | name@company.org | 098-765-4321 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Jack | 50.0 | 65000.0 | f | HR | Manager | 2020-02-20 | F | NaN | 098-765-4321 |
| 996 | Jack | 30.0 | 50000.0 | m | Finance | Analyst | 2018-04-05 | C | NaN | 555-555-5555 |
| 997 | Hannah | 30.0 | 70000.0 | m | IT | Assistant | 2020-01-15 | D | user@domain.com | NaN |
| 998 | Bob | 25.0 | 65000.0 | d | Marketing | Manager | 2018-04-05 | D | email@example.com | NaN |
| 999 | Ivy | 30.0 | 60000.0 | m | Finance | Manager | 2020-02-20 | C | user@domain.com | 123-456-7890 |
1000 rows × 10 columns
# Now we can start doing some statistics!
df.describe(include="all")
| Name | Age | Salary | Gender | Department | Position | Joining_Date | Performance_Score | Phone_Number | ||
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 841.000000 | 833.000000 | 1000 | 1000 | 1000 | 1000 | 1000 | 610 | 815 |
| unique | 10 | NaN | NaN | 3 | 5 | 5 | 5 | 5 | 3 | 4 |
| top | Alice | NaN | NaN | m | Finance | Assistant | 2020-02-20 | B | user@domain.com | 123-456-7890 |
| freq | 118 | NaN | NaN | 355 | 218 | 214 | 232 | 225 | 213 | 236 |
| mean | NaN | 35.802616 | 60216.086435 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | NaN | 8.551889 | 7127.032811 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | 25.000000 | 50000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | 30.000000 | 55000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | 35.000000 | 60000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | 40.000000 | 65000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | 50.000000 | 70000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# And we can save the results in a file
df.to_csv("new_file.csv", index = False, na_rep = "NaN")