Unsupervised learning refers to the use of artificial intelligence (AI) algorithms to identify patterns in data sets containing data points that are neither classified nor labeled.
The algorithms are thus allowed to classify, label and/or group the data points contained within the data sets without having any external guidance in performing that task.
In other words, unsupervised learning allows the system to identify patterns within data sets on its own.
In unsupervised learning, an AI system will group unsorted information according to similarities and differences even though there are no categories provided.
Unsupervised learning algorithms can perform more complex processing tasks than supervised learning systems. Additionally, subjecting a system to unsupervised learning is one way of testing AI.
This article is part of
In-depth guide to machine learning in the enterprise
However, unsupervised learning can be more unpredictable than a supervised learning model. While an unsupervised learning AI system might, for example, figure out on its own how to sort cats from dogs, it might also add unforeseen and undesired categories to deal with unusual breeds, creating clutter instead of order.
AI systems capable of unsupervised learning are often associated with generative learning models, although they may also use a retrieval-based approach (which is most often associated with supervised learning). Chatbots, self-driving cars, facial recognition programs, expert systems and robots are among the systems that may use either supervised or unsupervised learning approaches, or both.
Unsupervised learning is sometimes also called unsupervised machine learning.
Unsupervised learning starts when machine learning engineers or data scientists pass data sets through algorithms to train them.
As previously stated, there are no labels or categories contained within the data sets being used to train such systems; each piece of data that's being passed through the algorithms during training is an unlabeled input object or sample.
The objective with unsupervised learning is to have the algorithms identify patterns within the training data sets and categorize the input objects based on the patterns that the system itself identifies. The algorithms analyze the underlying structure of the data sets by extracting useful information or features from them.
Thus, these algorithms are expected to develop specific outputs from the unstructured inputs by looking for relationships between each sample or input object.
Using animals again as an example, algorithms may be given data sets containing images of animals. The algorithms may then classify the animals into categories such as those with fur, those with scales and those with feathers. It may then group the images in increasingly more specific subgroups as it learns to identify distinctions within each category.
The algorithms do this by uncovering and identifying patterns, although in unsupervised learning this pattern recognition happens without the system having been fed data that teaches it to distinguish -- in this example -- between mammals, fishes and birds or to further distinguish in the mammal category between dogs and cats, for instance.
Unsupervised vs. supervised learning
Comparing supervised versus unsupervised learning, supervised learning uses labeled data sets to train algorithms to identify and sort based on provided labels.
The input object, or sample, has a corresponding label so that the algorithms learn to identify and classify those input objects which match with the same label.
In other words, the algorithms create maps from given inputs to specific outcomes based on what they learn from training data that has been labeled by machine learning engineers or data scientists.
Moreover, supervised learning uses both labeled training data and labeled validation data. This allows the accuracy of supervised learning outputs to be checked for accuracy in a way that unsupervised learning cannot be measured. Machine learning engineers or data scientists may opt to use a combination of labeled and unlabeled data to train their algorithms. This in-between option is appropriately called semi-supervised learning.
Clustering and other types of unsupervised learning
Unsupervised learning is often focused on clustering.
Clustering is the grouping of objects or data points that are similar to each other and dissimilar to objects in other clusters.
Machine learning engineers and data scientists can use different algorithms for clustering, with the algorithms themselves falling into different categories based on how they work. The categories include the following:
- exclusive clustering
- overlapping clustering
- hierarchical clustering
- probabilistic clustering
Some of the more widely used algorithms include the k-means clustering algorithm and the fuzzy k-means algorithm, as well as the hierarchical clustering and the density-based clustering algorithms.
The Latent Dirichlet Allocation (LDA) model and Gaussian mixture models are also commonly used in clustering.
In addition to clustering, unsupervised learning may be used to determine how data is distributed in space (density estimation).
Unsupervised machine learning can identify previously unknown patterns in data. It can be easier, faster and less costly to use than supervised learning as unsupervised learning does not require the manual work associated with labeling data that supervised learning requires. And unsupervised learning can work with real-time data to identify patterns.
Although organizations value those features of unsupervised learning, there are some disadvantages, including the following:
- uncertainty about the accuracy of the unsupervised learning outputs;
- difficulty checking the accuracy of the unsupervised learning outputs, as there are no labeled data sets to verify the results;
- the need for engineers and data scientists to spend more time interpreting and labeling results with unsupervised learning than they would with supervised learning; and
- the lack of full insight into how or why an unsupervised system reaches its results.
There is an additional disadvantage with clustering as well, in that cluster analysis could overestimate the similarities in the input objects and thereby obscure individual data points that may be important for some use cases, such as customer segmentation where the objective is to understand individual customers and their unique buying habits.
Examples and use cases
Exploratory analysis and dimensionality reduction are two of the most common uses for unsupervised learning.
Exploratory analysis, in which the algorithms are used to detect patterns that were previously unknown, has a range of enterprise applications. For example, businesses can utilize exploratory analysis as a starting point for their customer segmentation efforts.
In dimensionality reduction, algorithms reduce the number of variables or features (i.e., dimensions) within the data sets so that the focus can be given to the relevant features for various objectives. Some experts explain this by saying that dimensionality reduction removes noisy data. (Machine learning engineers often use latent variable model-based algorithms to do this work.) For example, an organization can use dimensionality reduction to read images that are blurry by reducing the background.
Additionally, organizations can use unsupervised learning for the following applications:
- clustering anomaly detection, whereby algorithms can identify unusual data points in data sets, a capability particularly useful to identity fraudulent activity or human errors or faulty products; and
- association mining, where algorithms find associations among data points, a capability that retailers, for example, can use to identify what products are often bought together.