Improving the Supervised Learning Model using Python
December 13, 2023
Typically in Machine Learning, we generate a Machine Learning model and once the model is generated we use multiple techniques to measure the accuracy of the model. But, the question that might be coming to your mind is: "The accuracy of the model doesn't match the expectations. What can I do to improve the accuracy of the model?"
Valid thought. In this post, we will be looking at a few best practices to generate Models. In other words, we will be looking at techniques to tune the model.
Let's get started. We will be covering the mentioned techniques in detail:
- Cleaning the dataset
- Categorical Data
- Normalizing data
Cleaning the dataset
This is the most crucial step in the entire process of generating Data Science Models. If you don't have a clean dataset like:
- The dataset has missing values
- The dataset has duplicate records... etc
then, probably your accuracy will probably suffer a lot.
We can use Imputer
method provided by sklearn
to clean the dataset. Let's look at an example shown below:
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
# Pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
('SVM', SVC())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
In the post, Machine Learning: Cleaning data, I have explained in detail the most common techniques used by Data Scientists to clean the dataset. I recommend you go through it for a better understanding and to have the correct mindset.
Categorical Data
There are high chance that our dataset has a few columns which are having categorical data. Converting such columns to categories will have a drastic effect on the accuracy of the model. Consider an example where we have a student_status column and it has two values 'PASSED' and 'FAILED'.
Since the column values are text, scikit-learn, and many other packages are not capable of dealing with text. So we need to convert the column to an integer to improve model performance. Here we can convert 'PASSED' to 1 and 'FAILED' to 0. Consider an example shown below:
df_students = pd.get_dummies(df)
df_students = pd.get_dummies(df, drop_first=True)
print(df_students.columns)
You can observe in the above example that we are using the get_dummies function provided by pandas to convert columns to categorical columns. You might have observed column names are created in the format of columnName_categoryName
.
Once we have categorical columns created, you can continue with the process of generating the model again with a new dataset. There are high chances of improvement of performance in the model.
Normalizing Data
There are algorithms like knn (k-nearest neighbor algorithm) which use distance to make decisions. What if our data varies a lot (i.e. Dataset has high standard deviations)? Can we still expect accurate data models?
We need to normalize our data to reduce the high standard deviation in the dataset. There are various techniques to normalize the data. These are as shown below:
- Standardization: Subtract the mean and divide by the variance
- Subtract the minimum and divide by the range
- Normalize to range data between -1 to +1
Standardization
We can use the scale function to standardize the values. Consider an example shown below:
from sklearn.preprocessing import scale
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
irisData = load_iris()
X = irisData.data
y = irisData.target
X_scaled = scale(X)
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y,
test_size = 0.5, random_state=24)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=24)
knn_scaled = KNeighborsClassifier().fit(X_train_scaled, y_train_scaled)
knn = KNeighborsClassifier().fit(X_train, y_train)
print(knn_scaled.score(X_test_scaled, y_test_scaled))
print(knn.score(X_test, y_test))
print("Standard Deviation of features before scaling: ", np.std(X))
print("Standard Deviation of features after scaling: ", np.std(X_scaled))
In the example shown above to show the difference between scaled and unscaled data, I have:
- Computed the knn accuracy score for both scaled and unscaled data.
- The standard deviation for both scaled and unscaled data.
Please run the code to see the difference yourself between scaled and unscaled data.
We can also create a pipeline and perform operations to standardize the data. Consider an example shown below:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
irisData = load_iris()
X = irisData.data
y = irisData.target
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=24)
knn_scaled = pipeline.fit(X_train, y_train)
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
print('Accuracy after scaling: ',knn_scaled.score(X_test, y_test))
print('Accuracy without scaling: ', knn_unscaled.score(X_test, y_test))
It is similar to the first example which we have seen earlier. Here instead of scaling data and then splitting it, we are generating pa ipeline to scale the data.
Summary
In this post, we have learned techniques that must have to followed to generate the optimized model. We learned about Cleaning, Categorical, and Normalizing datasets that can result in much better performance.
We must consider the mentioned techniques to be MUST HAVE followed by our dataset (if possible). As a newbie, there are high chances that the dataset is not following the mentioned optimizations and you are scratching your head to improve the model performance.
Are you Cleaning, Categorical, and Normalizing your dataset before generating models? Please let me know in the comment section below. Happy Learning!