Machine Learning: Cleaning data

Tavish Aggarwal

June 18, 2020

n the previous post, we saw how to import data from different type of file sources using various packages available in python. If you haven't gone through Importing data using Python. I recommend going through it before stepping into this post.

Once we have data imported into dataset its very important to understand the data before proceeding further. It's important to understand the attributes (columns) of our dataset provided.

Whenever we obtain a new dataset, our first task is to do the exploratory analysis to get a better understanding of the data.

When we think that we have enough knowledge about the dataset and we are sure that the answer we are expecting from the dataset is possible to get, we can procced one step further.

There are high chances that the data provided to us is not clean data. What do we mean by not clean or dirty data? Below are some of the points that will help you understand what all we have to check to make our data clean:

Check the data type of the column
Check for data in the column. Check do we need to split the columns?
Check the max, min and median of the columns
Check the duplication in the dataset
Reshaping the data is required?
Does the dataset contain unnecessary white spaces between text
Does our dataset contain null values?

In this post, we will be exploring the mentioned points and will try to clean the dataset provided. As part of this post, we will be working on the dataset TMDB and Avocado dataset from Kaggle.

Note: I recommend to download above mentioned dataset to follow the course and do hands on side by side.

As we have discussed earlier the first step is to explore the dataset. To explore the Pandas dataset below are few of the commands that we can use:

head() and tail()
columns
shape
info()
describe()
value_counts
median

Let's take a deeper look at above commands and see what information can be revealed. Let's start with head() and tail() command.

head() and tail()

The head command will by default show the first 5 rows of the dataset and tail command will show the last 5 rows of the dataset.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.head())
print(tmdbDataSet.tail())

We can also limit the head and tail command to the number of rows that we want to see. Suppose if I want to see a top and last row of the dataset then I can use head and tail function with the argument as shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.head(1))
print(tmdbDataSet.tail(1))

columns

We can see what all the different columns does our dataset have. Code to view columns in a dataset:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.columns)

shape

We can also see the shape of the data. Which means the number of rows and columns dataset have.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.shape)

The output is in the form of (number of rows, number of columns).

info()

The info describes the data type of each column in our dataset. It also gives an idea of how many rows are there in each column whose value is not null.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.info())


# Observation

# Output describes that dataset has rows for which values of homepage column has null values. 
# There is total 3091 rows (4803 - 1712) for which home age has null values.

# Also there is one movie which doesn't have release date.
# Two if the rows doesn't have runtime.

# And 844 rows doesn't have tagline values.

# Its an important observation at this point of time. 
# We have taken good steps to know our dataset.

Another way to get data types of the columns is using dtypes.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.dtypes)

We can also change the data type of the columns. To change the datatype refer the code shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
tmdbDataSet['status'] = tmdbDataSet['status'].astype('category')

# We can also covert to back to string 
tmdbDataSet['status'] = tmdbDataSet['status'].astype(str)

To convert the datatype of a column to numeric pandas provide function to_numeric.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
tmdbDataSet['vote_count'] = pd.to_numeric(tmdbDataSet['vote_count'], errors='coerce')

Notice there is a parameter error passed to the function. The use of it is If there is any value which pandas is not able to convert to numeric then it will insert 'NaN'. If we won't specify it, the function will return an error.

describe()

Describe function in pandas returns the

count - Return number of count of not null entries in numeric column e.g df['col'].count()
mean - Returns the average of the series ignoring null entries e.g df['col'].mean()
std - Returns the standard deviation of not null values df.std()
min - Returns the list of smallest observation e.g - df.min()
25% - Returns the 25% or 1/4 percentile of the dataset
50% - Returns the 25% or 1/2 percentile of the dataset. Also, note 50% of the dataset is equal to a median
75% - Returns the 25% or 3/4 percentile of the dataset.
max - Returns the list of maximum observation e.g - df.max()

results. Describe function can only be used on numeric columns. Look at the example shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.describe())


# Observations

# Describe function can only be used on numeric columns and from info method we know that 
# there are only 7 columns whose value is as integer/float. 
# So output of describe method will be 7 columns.

We can also call 'describe' method to non-integer columns. To do this we need to call describe method on a column.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.status.describe())

# Observations
# What we observed is that out of 4803 movies, 4795 movies has been released. 
# And there are total three unique values of status column

Also, make a note that we can calculate percentile of data manually using quantile function. Using this we can calculate any percentile of data. Refer script shown below to calculate 55% of the dataset.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

print(tmdbDataSet.vote_count.quantile([0.55]))


# Observations
# It seems bit confusing how quantile is being calculated
# Easiest way to get this is as shown below
# As we know our dataset has 4803 records
# So 55% of 4803 is 2641
# As we are calculating quantile of vote_count we will sort by column
# Also we will find out first 2641 rows

tempDataSet = tmdbDataSet.sort_values(['vote_count']).head(2641)
print(tempDataSet.vote_count.tail(1))

# You will notice tht output from both print command will match

values_counts

The value_counts method returns the frequency counts for each unique value in a column.

This method also has an optional parameter called dropna which is True by default. If the parameter is set to True then we won't get the count of missing data if any. But if it is set to False we will get the count of missing data as well as frequency count.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.release_date.value_counts(dropna=False))

# Observations

# This is again an important observation. 
# There are 10 movies in dataset which are released on '2006-01-01'
# Also there is one movie which doesn't have release date. 
# This we already know with info() method


print(tmdbDataSet.status.value_counts())

# Observations

# Previously, we observed that there are 4795 movies which have been released.
# Also status doesn't have any null alues so we havn't used dropna flag here.
# There are three different unique values which we saw in describe method.
# The above command reviled different values in status column and its count

median

Median in dataset can be calculated with the following command:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.median())

The command returns median of the series. We can calculate the median of the integer/float column only. If we try to calculate the median of non-integer value the above command will return the error, which makes sense.

Now as we have an understanding of the data which we are dealing with its time to explore more and clean our dataset.

There is a paper written by 'Hadley Wickham Ph.D.' where he stated that there are generally two rules which dataset has to follow in order to be called a clean dataset. These are:

Columns represent separate variables
Each row as a separate observation
Observational units form the tables

To meet the above rules we definitely have to reshape our data. Pandas provide melt function which helps us to reshape our data the way we want to. Let's look at the example:

Melting data

Melting data is the process of turning columns of your data into rows of data. To melt the data pandas provides function melt(). There are two parameters of melt function which we should know. These are:

id_vars - It represents the columns of the data you do not want to melt or don't want to reshape
value_vars - It represents the columns that we wish to melt into rows.

Let's see how melting works in action:

import pandas as pd

avocadoDataSet = pd.read_csv('avocado.csv')

avocadoDataSet_melt_4770 = pd.melt(avocadoDataSet, id_vars=['Date', 'AveragePrice', 
'Total Volume', '4046', '4225', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags',
 'type', 'year', 'region'], var_name='avocados_variety', value_name='4770')

avocadoDataSet_melt_4225 = pd.melt(avocadoDataSet_melt_4770, id_vars=['Date', 
'AveragePrice', 'Total Volume', '4046', 'Total Bags', 'Small Bags', 'Large Bags', 
'XLarge Bags', 'type', 'year', 'region'], var_name='avocados_variety', value_name='4225')

avocadoDataSet_melt_4046 = pd.melt(avocadoDataSet_melt_4225, id_vars=['Date', 
'AveragePrice', 'Total Volume', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 
'type', 'year', 'region'], var_name='avocados_variety', value_name='4046')

print(avocadoDataSet_melt_4046.head())


# Observations
# Melting of TMDB dataset is out of scope.
# We have taken new dataset from Kaggle
# Here in dataset we have three different variety of avocado: 4770, 4225 and 4046
# Its not the best way to melt multiple columns.
# To understand how melt works, the above example can be useful

This looks great. But sometimes we have a dataset where we need to turn rows into the columns at that time what to do? The answer is pivoting the dataset.

Pivoting data

Pivoting data is opposite of melting of the dataset. Pandas provide pivot_table() method to pivot the dataset. There are three parameters of melt function which we should know. These are:

index - It represents the columns that we don't want to pivot
columns - It represents column that we want to pivot
values - It represents the values to be used when the column is pivoted

Let's see how pivoting works in action:

# After melting the dataset, here we will see how pivoting works with dataset
# Pivoting of dataset
avocadoDataSet_melt_4046['4046'] = pd.to_numeric(avocadoDataSet_melt_4046['4046'], 
                                                           errors='coerce')

avocadoDataSet_pivot = avocadoDataSet_melt_4046.pivot_table(index=['Date'], 
columns='type', values='4046', aggfunc=np.mean)

print(avocadoDataSet_pivot.head())


#Observations
# We saw that type column has only two distinct values.
# So what if we want to see the mean of avocado variety by the type?
# We can do this by pivoting the dataSet as we did above.

NOTE:
The default aggregation function used by .pivot_table() is np.mean(). So we could have ran above sript without specifying aggfunc parameter to pivot_table() function.
Sometimes you may observe that index of dataset after pivioting dataset is not correct. So to correct it you have to reset the index using command shown below:
tmdbDataSet_pivot = tmdbDataSet_pivot.reset_index()

We can also pivot the dataset without specifying the values attribute. In this case each and every column in data set will be splitted by the columns argument which we specify in pivot method.

# Pivoting of dataset without specifying values
avocadoDataSet_pivot = avocadoDataSet_melt_4046.pivot_table(index=['Date'], 
columns='type')

print(avocadoDataSet_pivot.head())

# Observations
# Each and every column of data set will have two sub divided columns
# Two sub divided columns will be of type as we have specified in columns parameter

We can also specify margins property to pivot_table function that will display the sum of each and every individual column as well.

# Pivoting of dataset with margin
avocadoDataSet_pivot_margin = avocadoDataSet_melt_4046.pivot_table(index=['Date'], 
columns='type', margins=True)

print(avocadoDataSet_pivot_margin.tail())

Apart from the techniques we have discussed above, there are some other techniques to clean our dataset. These include:

Splitting columns
Concatenation data or files
Merging data
Using regular expressions and apply function
Replacing duplicate/missing data

Let's discuss the above concepts in detail and understand how they help to clean our datasets.

Stacking and unstacking dataset

Stacking and unstacking is another way to rearrange data according to the way we want. We will understand it more clearly with the help of examples.

Stacking dataset

Stacking a DataFrame means moving the innermost column index to become the innermost row index. Consider an example as shown below:

# Stacking datasets
avocadoDataSet_stacked = avocadoDataSet_pivot.stack() 
# Or avocadoDataSet_pivot.stack(level=1)

display(avocadoDataSet_stacked.head())

The above describes the technique to stack the dataset. Here we have considered the same dataset which we have produced from our pivoting dataset example.

In our avocadoDataSet_pivot dataset, each and every column is further divided into conventional and organic.

We can also pass argument called 'level' which will decide the column index that we want to move as row index.

Unstacking dataset

Whereas unstacking is opposite of stacking the dataset. Unstacking a DataFrame means moving the innermost row index to become the innermost column index. Consider the example shown below:

# Unstacking the stacked datset
display(avocadoDataSet_stacked.unstack())

In an example shown above, we are unstacking the stacked dataset to get back the original pivoted dataset.

We can also pass argument called 'level' which will decide the row index that we want to move as the column index.

NOTE: The default value for level for both stacking and unstacking is innermost column or row level index respectivelly.

Splitting columns

Sometimes it may happen that a single column in the dataset may contain more information than expected. For example first name and last name in the same column. Let's see how we can split information from one column to two columns:

import pandas as pd

avocadoDataSet = pd.read_csv('avocado.csv')


month = {1: 'JAN', 2: 'FEB', 3: 'MAR', 4: 'APR',5: 'MAY', 6: 'JUNE',
                7: 'JULY', 8: 'AUG', 9: 'SEP', 10: 'OCT', 11: 'NOV', 12: 'DEC' }

avocadoDataSet['Month'] = avocadoDataSet.Date.str.split('-').str.get(1)
avocadoDataSet['Month'] = pd.to_numeric(avocadoDataSet['Month'], errors='coerce')
avocadoDataSet['Month'] = avocadoDataSet['Month'].map(month)

print(avocadoDataSet.head())


# Observations
# We have fetched out month column from date using split fuction

The above example shows the approach to fetch out month from date column.

Concatenation data or files

We may get our information from various files. Possible reasons for this to happens may be:

The huge file need to be divided
Our source generates data file a day

Pandas provide concat function using which we can concat multiple datasets into a single dataset.

import pandas as pd

row_concat = pd.concat([dataSet_1, dataSet_2, dataSet_3])

We can also do column wise concatenation to our dataset. For this also we can use the concat function but with an additional argument 'axis'.

import pandas as pd

col_concat = pd.concat([dataSet_1, dataSet_2],axis=1)

NOTE: The default valuse of axis is 0 for row wise concatenation.

The above technique is useful when we have limited datasets to concat. But what if we have like 1000 datasets that we need to concat?

To solve this issue we have a package called glob. Glob package has a function called glob that takes an argument as a pattern and returns a list of the files in the working directory that match that pattern. Let's see an example:

import glob
import pandas as pd

pattern = '*.csv'
csv_files = glob.glob(pattern)
dataset = pd.read_csv(csv_files[1])

Merging data

Merging two datasets into one is similar to joining two tables in the relational database. We can join datasets based on columns from the tables.

o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

Using regular expressions and apply function

Using regular expressions we can test for the data for some particular pattern. Python provides 're' package using which we can perform regular expressions related tasks. Let's see how we can do this with help of an example:

import re

prog = re.compile('\d{3}-\d{3}-\d{4}')
result = prog.match('123-456-7890')
print(bool(result))

result = prog.match('1123-456-7890')
print(bool(result))

're' package provides findall function using which we can find all the matching pattern in the string.

matches = re.findall('\d+', 'Out of 3 houses, 2 houses have 4 people statying.')
print(matches)

# Observation
# The above code will find out all the numbers in the string

We have learned how to write functions in Python for Data Science - part 2. Here we will see how we can apply a function to the row column values.

def encode_avocado_type(value):
    if value == 'conventional':
        return 1
    elif value == 'organic':
        return 0

avocadoDataSet['type_coded'] = avocadoDataSet.type.apply(encode_avocado_type)

print(avocadoDataSet.head())

We have also read about lambda functions in Python for Data Science - part 2. Here we will see how we can use lambda function (shorthand notation of the function) to manipulate dataset.

import pandas as pd

avocadoDataSet = pd.read_csv('avocado.csv')
del avocadoDataSet['Unnamed: 0'] # removing unwanted column from dataset

avocadoDataSet['Date_replace'] = avocadoDataSet['Date'].apply(lambda x: x.replace('-', '/'))

avocadoDataSet['Date_find'] = avocadoDataSet['Date'].apply(lambda x: re.findall('\d', x))

print(avocadoDataSet.head())

Replacing duplicate/missing data

If we have duplicate or missing values in the dataset, it will definitely cause issues during our data analytics result. And the real time data we get surely have this problem. Duplicate data will also consume additional space and memory. So, it's really important to address this problem.

The drop_duplicates function is used to drop the duplicate rows in our datasets.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
avocadoDataSet = pd.read_csv('avocado.csv')
del avocadoDataSet['Unnamed: 0'] # removing unwanted column from dataset

print(tmdbDataSet.info())
print(avocadoDataSet.info())

avocadoDataSet.drop_duplicates()
tmdbDataSet.drop_duplicates()

print(tmdbDataSet.info())
print(avocadoDataSet.info())

# Observations
# Since there are no duplicates rows in both of the datsets
# Therefore we can't see any chnage in rows returned

We can also drop data from the dataset where we have missing data for the column. To do so we can use dropna function. Consider the example shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

print(tmdbDataSet.shape)

# remove the row if any column have null value
print(tmdbDataSet.dropna(how='any').shape)

# remove the row if all column have null value
print(tmdbDataSet.dropna(how='all').shape)

# returns dataset where 19 columns have non N/A values
print(tmdbDataSet.dropna(thresh=19).info())

# drop rows where we have N/A value for a specific column
print(tmdbDataSet.dropna(subset=['runtime']).shape)

Handling the missing data in the dataset is a crucial problem. Our analysis should start by checking how much missing data do we have. There are many ways to handle missing data depending on our need.

We can remove rows from the dataset for missing values. But we need to be careful, suppose if we have 100 rows of data and 40 rows have missing data. Removing rows is correct option?
We can fill missing values in a column with the mean of the column value.
We can use forward fill and backward fill strategy

Let's look at the code to fill missing values with a mean of the column.

runtime_mean = tmdbDataSet.runtime.mean()
tmdbDataSet['runtime'] = tmdbDataSet['runtime'].fillna(runtime_mean)

print(tmdbDataSet.info())

# Observations
# Filling values for two rows for which runtime value is null

Forward fill and backward fill

We have seen how to replace missing column values with the mean of the column. Now, let's see how to use forward fill and backward fill to fill the missing values in our dataset.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

# Forward fill and backward fill
tmdbDataSet['homepage_ffill'] = tmdbDataSet['homepage'].fillna(method='ffill')
tmdbDataSet['homepage_bfill'] = tmdbDataSet['homepage'].fillna(method='bfill')
print(tmdbDataSet[tmdbDataSet['homepage_ffill'].isnull()])

Using forward fill the row for the particular column whose value is null will be filled by the row value which is on the top of that specific row.

Whereas, using backward fill the row for the particular column whose value is null will be filled by the row value which is on the bottom of that specific row.

Please note that choosing the strategy to replace missing depends on the dataset you are analyzing. There is no strategy which is 100% correct or wrong.

Cleaning dataset is one of the most important process in the entire analysis and data analyst spend most of there time in this stage. Here in this post, I have tried to cover every possible technique which is being followed to clean the dirty dataset. Once we are done with the cleaning of dataSet, next step to look for the answer which you want from the dataSet.

We have also created a Jupyter Notebook where you can practice and enhance the dataset and make it cleaner according to the goals that you have. Hope that you have learned something new out of this post. Keep supporting and subscribe for the latest updates.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.