3 ways to evaluate and improve machine learning models

Training performance evaluation, prediction performance evaluation and baseline modeling can refine machine learning models. Learn how they work together to improve predictions.

Arcitura EducationGuest Contributor

Published: 27 Jul 2021

This article is excerpted from the course "Fundamental Machine Learning," part of the Machine Learning Specialist certification program from Arcitura Education. It is the twelfth part of the 13-part series, "Using machine learning algorithms, practices and patterns."

This article provides a set of machine learning techniques dedicated to measuring the effectiveness of trained models. These model-evaluation techniques are crucial in machine learning model development: Their application helps to determine how well a model performs. As explained in Part 4, these techniques are documented in a standard pattern profile format.

Training performance evaluation: Overview

Requirement: How can confidence be established in the efficacy of a machine learning model at training time?
Problem: A trained machine learning model may make predictions that are randomly correct or incorrect, or may make more incorrect predictions than correct ones. Producing such a model can seriously jeopardize the effectiveness and reliability of a system. Or: Different models can be developed to solve particular categories of machine learning problems. However, not knowing which model works best may lead to choosing a not-so-optimum model for the production system with the further consequence of below-par system performance.
Solution: The model's performance is quantified via established model evaluation techniques that make it possible to estimate the performance of a single model or to compare different models.
Application: Based on the type of machine learning problem, classification, clustering and regression, various statistics and visualizations are generated including accuracy, confusion matrix, receiver operating characteristic (ROC) curve, cluster distortion, and means squared error (MSE).

Training performance evaluation: Explained

Problem

When solving machine learning problems, simply training a model based on a problem-specific training machine learning algorithm does not guarantee either that the resulting model fully captures the underlying concept hidden in the training data or that the optimum parameter values were chosen for model training. Failing to test a model's performance means an underperforming model could be deployed on the production system, resulting in incorrect predictions. Choosing one model from the many available options based on intuition alone is risky. (See Figure 1.)

machine learning, ML training performance evaluation, ML model evaluation flowchart — Figure 1. A training data set is prepared (1). It is then used to train a binary classifier (2, 3). However, it is unclear whether the most optimum set of values are being used for the model's parameters (4).

Solution

By generating different metrics, the efficacy of the model can be assessed. Use of these metrics reveals how well the model fits the data on which it was trained. Through empirical evidence, the model can be improved repeatedly or different models can be compared to pick the most effective one.

Application

The machine learning software package used for model training normally provides a score or evaluate function to generate various model evaluation metrics. For regression, this includes mean squared error (MSE) and R squared (Figure 2).

Classification metrics include the following:

Accuracy is the proportion of correctly identified instances out of all identified instances.
Error rate, also known as the misclassification rate, is the proportion of incorrectly identified instances out of all identified instances.
Sensitivity, also known as the true positive rate (TPR), is the probability of getting a true positive.
Specificity, also known as the true negative rate (TNR), is the probability of getting a true negative.
Both sensitivity and specificity capture the confidence with which a classifier makes predictions.
Recall is the same as sensitivity: the proportion of correctly classified retrieved documents out of the set of all documents belonging to the class of interest.
Precision is the proportion of correctly classified retrieved documents out of the set of all retrieved documents.
F-score, also called F-measure and F1 score, takes both precision and recall into consideration in order to arrive at a single measure.
The confusion matrix, also known as a contingency table, is a cross-tabulation that shows a summary of the predicted class values against the actual class values. Columns in a confusion matrix contain the number of instances belonging to the predicted classes, and rows contain the number of instances belonging to the actual classes.

machine learning, confusion matrix example, contingency table example — Figure 2. A confusion matrix summarizing the results of a classification task that involves classifying 100 animals as mammals.

A receiver operating characteristic (ROC) curve is a visualization evaluation technique used to compare the performance of the same model or different models with different variations of the model generally obtained by trying out different variations of the model parameters. The X-axis plots the false positive rate (FPR), same as 1-specificity, while the Y-axis plots the TPR, same as sensitivity, obtained from different training runs of a model. The graph helps to find the combination of model parameters that result in a balance between sensitivity and specificity of a model. On the graph, this is the point where the curve starts to plateau with X-axis and the gain in TPR decreases but gain in FPR increases (Figure 3).

machine learning, graph shows balance between ML model sensitivity and model specificity — Figure 3. The diagonal red dashed line represents a baseline model that predicts at random, hence the same values of TPR and FPR. The straight green dashed line on the vertical and horizontal axis represents an ideal model with TPR equal to 1 and FPR equal to 0. It maintains a TPR of 1 before it starts to incorrectly predict the negative labels as positive. The blue solid curve represents a good model with a good increase in TPR before FPR starts to pick up.

With clustering, a cluster's degree of homogeneity can be measured by calculating the cluster's distortion. A cluster's distortion can be calculated by taking the sum of squared distances between all data points and its centroid. The lower the distortion, the higher the homogeneity and vice versa.

The training performance evaluation and prediction performance evaluation patterns are normally applied together to be able to evaluate model performance on unseen data. The application of this pattern can further benefit when applied together with the baseline modeling pattern. With a reference model available, different models can then easily be compared against each other as well as against the minimum acceptable performance (Figure 4).

machine learning, ML model training illustration, ML model performance evaluation flowchart — Figure 4. A training data set is prepared (1). It is then used to train a binary classifier (2, 3). Various performance metrics are generated for each training run (4). Based on the results, the parameter values are retuned and the model is retrained (5, 6). The metrics are evaluated again, and the parameter values are once again retuned and the model is trained for a third time (7, 8, 9). After the third training run, the model is evaluated for one last time (10). After training the model three times, the performance of the model is satisfactory and the parameters used in the third training run are selected as the final values of the model parameters (11).

Prediction performance evaluation: Overview

Requirement: How can confidence be established that a model's performance will not drop when it is produced and remain at par with training time performance?

Problem: A model's performance, as reported during training time, may suggest a high performing model. However, when deployed in a production environment, the same model may not perform as expected by training time performance metrics.

Solution: Rather than training the model on the entire available data set, some parts of the data set are held back to be used for evaluating the model before deploying it in a production environment.

Application: Techniques such as hold-out and cross-validation are applied to divide the available data set into subsets so that there is always one subset of data that the model has not seen before that can be used to evaluate the model's performance on unseen data, thereby simulating production environment data.