https://www.techtarget.com/searchenterpriseai/definition/inception-score-IS
The inception score (IS) is a mathematical algorithm used to measure or determine the quality of images created by generative AI through a generative adversarial network (GAN). The word "inception" refers to the spark of creativity or initial beginning of a thought or action traditionally experienced by humans.
Without an inception score, humans are left to observe a generative image and make a visual evaluation of the image -- but such visual evaluations are highly subjective and can vary widely based on the preferences and biases of the human viewer. The inception score, and other metrics such as Fréchet inception distance (FID), offer objective and consistent measures of generated images; and by extension, the quality and capability of the underlying generative model.
The score produced by the IS algorithm can range from zero (worst) to infinity (best). The inception score algorithm measures two factors:
Generative AI developers use the inception score as a measure of image quality. The IS may be employed as a training mechanism by feeding the IS back to the AI model. This kind of training can provide more objective and explainable feedback than solely allowing human viewers to subjectively "score" generative images.
The inception score, first defined in a 2016 technical paper, is based on Google's "Inception" image classification network.
Calculating an inception score starts by using the image classification network to ingest a generated image and return a probability distribution for the image. The image classification network is fundamentally a pre-trained Inception v3 model, which can predict class probabilities -- what something might be -- for each computer-generated image. A probability distribution is simply a numbered list of what the image classification network "thinks" the image might be -- each with a fractional score that adds up to 1.0.
For example, the image classification network might see the generated image of a cat and return a series of potential results such as the following:
Cat | 0.5 |
Flower | 0.2 |
Car | 0.2 |
House | 0.1 |
Total | 1.0 |
The probability distribution helps to determine whether the generated image contains one well-defined thing, or a series of things that are harder (if not impossible) for the image classification network to identify. This is the foundation of the quality factor -- does the generated image look like something specific and identifiable?
Next, the inception score process compares the probability distribution for all the generated images. There may be as many as 50,000 generated images in a sample. This creates a second factor called marginal distribution, which indicates the amount of variety present in the generative AI's images.
For the cat example, the labels utilized in probability distribution are summed to show the focused distribution (the number of same images such as cats), and the uniform distribution (the number of flowers, cars, houses, and so on). These factors illustrate the variety in the generative AI's output. This is the foundation of the diversity factor -- can the AI produce varied items and scenes?
The last step is to combine probability distribution and marginal distribution into a single score, which can represent both the distinctiveness of the object as well as the diversity of the output. The more those two distributions differ, the higher the inception score. The actual score is calculated using a statistical method called the Kullback-Leibler divergence, or KL divergence.
When there is high KL divergence, there is a strong probability distribution and an even (flat) marginal distribution -- each image has a distinct label (such as a cat), but the overall set of images has many different labels. This yields the highest inception score.
Finally, the IS algorithm takes the exponent of the KL divergence and produces an average of the final number for every image in the sample set.
Although the inception score algorithm provides an objective means of measuring the quality and diversity of AI-generated images, the IS poses three principal limitations for AI developers:
Another metric used to evaluate the quality of AI-generated images is the Fréchet inception distance. FID was introduced in 2017 and has generally superseded inception score as the preferred measure of generative image model performance.
The principal difference between IS and FID is the comparative use and evaluation of real images, referred to as "ground truth." This allows FID to analyze real images alongside computer-generated images in a bid to better simulate human perception. By comparison, IS only evaluates computer-generated images.
Although FID has generally edged out IS as the preferred quality metric for GANs, FID has also been shown to demonstrate some statistical bias, and does not always accurately reflect human perception.
What is a large language model (LLM)?
The actual formula to calculate inception score requires the use of calculus and is beyond the scope of this definition. For a more complete explanation, however, an abbreviated mathematical expression for inception score can be shown as the following:
IS(G) = exp (Ex∼pg DKL (p(y|x) || p(y) ) )
The major components of the formula are as follows:
The common process for resolving this expression and determining a final inception score involves five basic steps:
This final result is the inception score for the given set of computer-generated images.
Although the mathematical formula for inception score can be resolved manually, the process of repeating advanced multi-step calculations across thousands of images can be a daunting and error-prone human challenge.
Instead of manual calculations, AI developers working with generative image models will typically implement a metric such as inception score using a mathematical software package. Common math processing alternatives include the following:
Implementing IS in a math package will require some amount of coding to derive probability distributions (or access to data where distributions are stored) and perform other required calculations. Coding may be performed by AI scientists already working on generative AI systems or supporting development staff.
02 May 2024