What is supervised learning?
Supervised learning is an approach to creating artificial intelligence (AI), where a computer algorithm is trained on input data that has been labeled for a particular output. The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels, enabling it to yield accurate labeling results when presented with never-before-seen data.
Supervised learning is good at classification and regression problems, such as determining what category a news article belongs to or predicting the volume of sales for a given future date. In supervised learning, the aim is to make sense of data within the context of a specific question.
In contrast to supervised learning is unsupervised learning. In this approach, the algorithm is presented with unlabeled data and is designed to detect patterns or similarities on its own, a process described in more detail below.
How does supervised learning work?
Like all machine learning algorithms, supervised learning is based on training. During its training phase, the system is fed with labeled data sets, which instruct the system what output is related to each specific input value. The trained model is then presented with test data: This is data that has been labeled, but the labels have not been revealed to the algorithm. The aim of the testing data is to measure how accurately the algorithm will perform on unlabeled data.
This article is part of
In neural network algorithms, the supervised learning process is improved by constantly measuring the resulting outputs of the model and fine-tuning the system to get closer to its target accuracy. The level of accuracy obtainable depends on two things: the available labeled data and the algorithm that is used. In addition:
- Training data must be balanced and cleaned. Garbage or duplicate data will skew the AI's understanding -- hence, data scientists must be careful with the data the model is trained on.
- The diversity of the data determines how well the AI will perform when presented with new cases; if there are not enough samples in the training data set, the model will falter and fail to yield reliable answers.
- High accuracy, paradoxically, is not necessarily a good indication; it could also mean the model is suffering from overfitting -- i.e., it is overtuned to its particular training data set. Such a data set might perform well in test scenarios but fail miserably when presented with real-world challenges. To avoid overfitting, it is important that the test data is different from the training data to ensure the model is not drawing answers from its previous experience, but instead that the model's inference is generalized.
- The algorithm, on the other hand, determines how that data can be put in use. For instance, deep learning algorithms can be trained to extract billions of parameters from their data and reach unprecedented levels of accuracy, as demonstrated by OpenAI's GPT-3.
Apart from neural networks, there are many other supervised learning algorithms (see below). Supervised learning algorithms primarily generate two kinds of results: classification and regression.
A classification algorithm aims to sort inputs into a given number of categories or classes, based on the labeled data it was trained on. Classification algorithms can be used for binary classifications such as filtering email into spam or non-spam and categorizing customer feedback as positive or negative. Feature recognition, such as recognizing handwritten letters and numbers or classifying drugs into many different categories, is another classification problem solved by supervised learning.
Regression tasks are different, as they expect the model to produce a numerical relationship between the input and output data. Examples of regression models include predicting real estate prices based on zip code, or predicting click rates in online ads in relation to time of day, or determining how much customers would be willing to pay for a certain product based on their age.
Algorithms commonly used in supervised learning programs include the following:
- linear regression
- logistic regression
- neural networks
- linear discriminant analysis
- decision trees
- similarity learning
- Bayseian logic
- support vector machines (SVMs)
- random forests
When choosing a supervised learning algorithm, there are a few things that should be considered. The first is the bias and variance that exist within the algorithm, as there is a fine line between being flexible enough and too flexible. Another is the complexity of the model or function that the system is trying to learn. As noted, the heterogeneity, accuracy, redundancy and linearity of the data should also be analyzed before choosing an algorithm.
Learn more about supervised learning algorithms and how they are best applied in this supervised learning primer from Arcitura Education.
Supervised vs. unsupervised learning
The chief difference between unsupervised and supervised learning is in how the algorithm learns. In unsupervised learning, the algorithm is given unlabeled data as a training set. Unlike in supervised learning, there are no correct output values; the algorithm determines the patterns and similarities within the data, as opposed to relating it to some external measurement. In other words, algorithms are able to function freely in order to learn more about the data and find interesting or unexpected findings that human beings weren't looking for. Unsupervised learning is popular in applications of clustering (the act of uncovering groups within data) and association (the act of predicting rules that describe the data).
Benefits and limitations
Supervised learning models have some advantages over the unsupervised approach, but they also have limitations. Supervised learning systems are more likely to make judgments that humans can relate to, for example, because humans have provided the basis for decisions.
However, in the case of a retrieval-based method, supervised learning systems have trouble dealing with new information. If a system with categories for cars and trucks is presented with a bicycle, for example, it would have to be incorrectly lumped in one category or the other. If the AI system was generative (that is, unsupervised), however, it may not know what the bicycle is, but it would be able to recognize it as belonging to a separate category.
Supervised learning also typically requires large amounts of correctly labeled data to reach acceptable performance levels, and such data may not always be available. Unsupervised learning does not suffer from this problem and can work with unlabeled data as well.
In cases where supervised learning is needed but there is a lack of quality data, semi-supervised learning may be the appropriate learning method. This learning model resides between supervised learning and unsupervised; it accepts data that is partially labeled -- i.e., the majority of the data lacks labels.
Semi-supervised learning determines the correlations between the data points -- just like unsupervised learning -- and then uses the labeled data to mark those data points. Finally, the entire model is trained based on the newly applied labels.
Semi-supervised learning has proven to yield accurate results and is applicable to many real-world problems where the small amount of labeled data would prevent supervised learning algorithms from functioning properly. As a rule of thumb, a data set with at least 25% labeled data is suitable for semi-supervised learning.
Facial recognition, for instance, is ideal for semi-supervised learning; the vast number of images of different people is clustered by similarity and then made sense of with a labeled picture giving identity to the clustered photos.
Example of a supervised learning project
Consider the news categorization problem from earlier. One approach is to determine which category each piece of news belongs to, such as business, finance, technology or sports. To solve this problem, a supervised model would be the best fit.
Humans would present the model with various news articles and their categories and have the model learn what kind of news belongs to each category. This way, the model becomes capable of recognizing the news category of any article it looks at based on its previous training experience.
However, humans might also come to the conclusion that classifying news based on the predetermined categories is not sufficiently informative or flexible, as some news may talk about climate change technologies or the workforce problems in an industry. There are billions of news articles out there, and separating them into 40 or 50 categories may be an oversimplification. Instead, a better approach would be to find the similarities between the news articles and group the news accordingly. That would be looking at news clusters instead, where similar articles would be grouped together. There are no specific categories anymore.
This is what unsupervised learning achieves: It determines the patterns and similarities within the data, as opposed to relating it to some external measurement.
Learn about how semi-supervised learning and the new "one-shot learning" approach aim to reduce the need for large data sets and human intervention.