What is supervised learning?
Supervised learning is an approach to creating artificial intelligence (AI) where a computer algorithm is trained on input data that has been labeled for a particular output. The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels, enabling it to yield accurate labeling results when presented with never-before-seen data.
In supervised learning, the aim is to make sense of data within the context of a specific question. Supervised learning is good at classification and regression problems, such as determining what category a news article belongs to or predicting the volume of sales for a given future date. Organizations can use supervised learning in processes like anomaly detection, fraud detection, image classification, risk assessment and spam filtering.
In contrast to supervised learning is unsupervised machine learning. In this approach, the algorithm is presented with unlabeled data and is designed to detect patterns or similarities on its own, a process described in more detail below.
How does supervised learning work?
Like all machine learning algorithms, supervised learning is based on training. During its training phase, the system is fed with labeled data sets, which instruct the system what output variable is related to each specific input value. The trained model is then presented with test data. This is data that has been labeled, but the labels have not been revealed to the algorithm. The aim of the testing data is to measure how accurately the algorithm performs on unlabeled data.
General, basic steps while implementing supervised learning include the following:
- Determine the type of training data that will be used as a training set.
- Collect labeled training data.
- Divide the training data up into training, test and validation data sets.
- Determine an algorithm to use for the machine learning model.
- Run the algorithm with the training data set.
- Evaluate the model's accuracy. If the model predicts correct outputs, then it is accurate.
As an example, an algorithm could be trained to identify images of cats and dogs by being fed an ample amount of training data that would consist of different labeled images of cats and dogs. This training data would be a subset of photos from a much larger data set of images. After training, the model should then be capable of predicting if an output of an image is either a cat or a dog. Another set of images can be run through the algorithm to validate the model.
In neural network algorithms, the supervised learning process is improved by constantly measuring the resulting outputs of the model and fine-tuning the system to get closer to its target accuracy. The level of accuracy obtainable depends on two things: the available labeled data and the algorithm that is used. In addition, the following factors affect the process:
- Training data must be balanced and cleaned. Garbage or duplicate data skews the AI's understanding -- hence, data scientists must be careful with the data the model is trained on.
- The diversity of the data determines how well the AI performs when presented with new cases; if there are not enough samples in the training data set, the model falters and fails to yield reliable answers.
- High accuracy, paradoxically, is not necessarily a good indication; it could also mean the model is suffering from overfitting -- i.e., it is overtuned to its particular training data set. Such a data set might perform well in test scenarios but fail miserably when presented with real-world challenges. To avoid overfitting, it is important that the test data is different from the training data to ensure the model is not drawing answers from its previous experience, but instead, the model's inference is generalized.
- The algorithm, on the other hand, determines how that data can be put in use. For instance, deep learning algorithms can be trained to extract billions of parameters from their data and reach unprecedented levels of accuracy, as demonstrated by OpenAI's GPT-3.
Apart from neural networks, there are many other supervised learning algorithms. Supervised learning algorithms primarily generate two kinds of results: classification and regression.
Supervised learning algorithms are divided into two types: classification and regression.
A classification algorithm aims to sort inputs into a given number of categories -- or classes -- based on the labeled data it was trained on. Classification algorithms can be used for binary classifications, such as classifying an image as a dog or cat; filtering email into spam or nonspam; and categorizing customer feedback as positive or negative.
Examples of classification machine learning techniques include the following:
- A decision tree separates data points into two similar categories from a tree trunk to branches and then leaves, creating smaller categories within categories.
- Logistic regression analyzes independent variables to determine a binary outcome that falls into one of two categories.
- A random forest is a collection of decision trees that gathers results from multiple predictors. It is better at generalization but is less interpretable when compared to decision trees.
- A support vector machine finds a line that separates data in a given set into specific classes during model training and maximizes the margins of each class. These algorithms can be used to compare relative financial performance, value and investment gains.
Regression tasks are different, as they expect the model to produce a numerical relationship between the input and output data. Examples of regression models include predicting real estate prices based on ZIP code, predicting click rates in online ads in relation to time of day and determining how much customers would be willing to pay for a certain product based on their age.
Algorithms commonly used in supervised learning programs include the following:
- Bayesian logic analyzes statistical models, while incorporating previous knowledge about model parameters or the model itself.
- Linear regression predicts a variables value based on another variables value.
- Nonlinear regression is used when an output isn't reproducible from linear inputs. With this, data points share a nonlinear relationship, for example, the data might have a nonlinear, curvy trend.
- A regression tree is a decision tree where continuous values can be taken from a target variable.
When choosing a supervised learning algorithm, there are a few things that should be considered. The first is the bias and variance that exist within the algorithm, as there is a fine line between being flexible enough and too flexible. Another is the complexity of the model or function that the system is trying to learn. As noted, the heterogeneity, accuracy, redundancy and linearity of the data should also be analyzed before choosing an algorithm.
Supervised vs. unsupervised learning
The chief difference between unsupervised and supervised learning is in how the algorithm learns.
In unsupervised learning, the algorithm is given unlabeled data as a training set. Unlike supervised learning, there are no correct output values; the algorithm determines the patterns and similarities within the data, as opposed to relating it to some external measurement. In other words, algorithms can function freely to learn more about the data and discover interesting or unexpected findings that human beings weren't looking for.
Unsupervised learning is popular in clustering algorithms (the act of uncovering groups within data) and association (the act of predicting rules that describe the data).
Because the machine learning model works on its own to discover patterns in data, the model might not make the same classifications as in supervised learning. In the cats-and-dogs example, the unsupervised learning model might mark the differences, similarities and patterns between cats and dogs but can't label them as cats or dogs.
Benefits and limitations
Supervised learning models have some advantages over the unsupervised approach, but they also have limitations. Benefits include the following:
- Supervised learning systems are more likely to make judgments that humans can relate to because humans have provided the basis for decisions.
- Performance criteria is optimized due to additional experienced help.
- It can perform classification and regressive tasks.
- Users control the number of classes used in the training data.
- Models can make predictive outputs based on previous experience.
- The classes of objects are labeled in exact terms.
Limitations of supervised learning include the following:
- In the case of a retrieval-based method, supervised learning systems have trouble dealing with new information. If a system with categories for cats and dogs are presented with new data -- say, a zebra -- it would have to be incorrectly lumped in one category or the other. If the AI system was generative -- that is, unsupervised -- however, it may not know what the zebra is, but it would be able to recognize it as belonging to a separate category.
- Supervised learning also typically requires large amounts of correctly labeled data to reach acceptable performance levels, and such data may not always be available. Unsupervised learning does not suffer from this problem and can work with unlabeled data as well.
- Supervised models need time to be trained prior to use.
In cases where supervised learning is needed but there is a lack of quality data, semisupervised learning may be the appropriate learning method. This learning model resides between supervised learning and unsupervised; it accepts data that is partially labeled, i.e., most of the data lacks labels.
Semisupervised learning determines the correlations between the data points -- just like unsupervised learning -- and then uses the labeled data to mark those data points. Finally, the entire model is trained based on the newly applied labels.
Semisupervised learning can yield accurate results and is applicable to many real-world problems, where the small amount of labeled data would prevent supervised learning algorithms from functioning properly. As a rule of thumb, a data set with at least 25% labeled data is suitable for semisupervised learning.
Facial recognition, for instance, is ideal for semisupervised learning; the vast number of images of different people is clustered by similarity and then made sense of with a labeled picture, giving identity to the clustered photos.
Example of a supervised learning project
A possible use case of supervised learning is in news categorization. One approach is to determine which category each piece of news belongs to, such as business, finance, technology or sports. To solve this problem, a supervised model would be the best fit.
Humans would present the model with various news articles and their categories and have the model learn what kind of news belongs to each category. This way, the model becomes capable of recognizing the news category of any article it looks at based on its previous training experience.
However, humans might also conclude classifying news based on the predetermined categories is not sufficiently informative or flexible, as some news may talk about climate change technologies or the workforce problems in an industry. There are billions of news articles out there, and separating them into 40 or 50 categories may be an oversimplification.
Instead, a better approach could be to find the similarities between the news articles and group the news accordingly. That would be looking at news clusters instead, where similar articles would be grouped together. There are no specific categories anymore.
This is what unsupervised learning achieves by determining the patterns and similarities within the data, as opposed to relating it to some external measurement.
Learn about how supervised, unsupervised, semisupervised and reinforcement learning compare to each other. Further, explore the different types of AI algorithms and how they work.