Nabugu -


Embedding models for semantic search: A guide

Embedding models in semantic search are changing how we interact with information by going beyond keyword matching to capture meaning and relationships in text and other data.

Embedding models for semantic search transform data into more efficient formats for symbolic and statistical computer processing. A type of neural network, an embedding model takes advantage of innovations in generative AI, vector databases and knowledge graphs to better grasp the connections between words and ideas. This enables more precise concept matching, compared to traditional keyword-matching approaches. This semantic capability makes embedding models especially useful in search engines, data analytics, customer support chatbots, recommendation engines and business process analysis tools.

How are embedding models used?

Embedding models for semantic search are used by developers and data scientists tasked with building better apps and experiences. Deciding which embedding model is best for the project at hand is an important first step. Data scientists and developers might explore the speed, size and accuracy of various embedding models for a particular task. These metrics not only test the performance of available embedding models, but can also give feedback to improve a new model's performance by using curated data sets of questions and relevant responses for common tasks.

Embedding models are often obscured underneath the covers of complex services and applications. For example, embedding models play a seminal role in Google Search, but these are not published to preclude low-quality content mills from gaming search results. Also, many enterprise apps for knowledge graphs, active metadata management, graph databases and process intelligence use embeddings that are not publicized to maintain a competitive advantage.

How do embedding models work?

Embedding models use a statistical approach to model connections between similar types of content. For example, in a word-embedding model, the terms queen and king might be statistically close to chief or president but further apart on the feature of gender. The approach is not limited to words. Innovations in generative AI can also create embedding models that describe similarity across sentences, paragraphs and larger documents. These help tease apart distinctions like the difference between "man bites dog" and "dog bites man."

Embedding models are trained to learn about the patterns and relationships in text on especially large data sets. When a trained model processes new data, it analyzes the text and generates a unique numerical value, or embedding vector, in a multidimensional space using fixed dimension embeddings.

Some of the best embedding models use thousands of dimensions, which seems like a lot. But this helps capture the complexity and subtleties found across common speech, abstract poetry and the descriptions of innovations in scientific research.

There are embedding models that let you specify the number of dimensions. For example, one of OpenAI's newest embedding models -- text-embedding-3-large -- lets you choose either 256, 1,024 or 3,072 dimensions. Using a high number of dimensions results in a better semantic search score, but the model runs slower and requires more memory. Low-dimensional embeddings are less computationally expensive.

Embedding vs. encoding models

Encoding models apply a structural, logical and mathematical approach to modeling connections. For example, database languages and schemas help structure business data for efficient processing or analysis. Knowledge graphs, ontologies, taxonomies and digital twins can also model the relationships between more complex types of information. An example is eXtensible Business Reporting Language (XBRL), which structures business reporting information for more efficient comparison, analysis and reuse.

Embedding models are slower and less accurate than encoding models, but they are more flexible and adaptable. Hence, they can structure unstructured text or translate between data stored using different encoding schemes. Aligning data silos into a relevant encoding scheme can require considerable effort and expense. The billions of dollars enterprises spend on data integration boils down to their need to synchronize data across different data schemas and formats.

Embedding models promise to automate the translation across encoding schemes through techniques like active metadata management and the semantic layers used in data fabrics and data mesh architectures. This can improve data reuse and connect information silos. Conversely, encoding models can bring structure when translating across different domain-specific embedding models to improve accuracy and performance through techniques like retrieval-augmented generation (RAG).

Connecting embeddings across modalities

Text embedding models have dominated R&D efforts because they can take advantage of the vast body of unstructured information on the internet, in business documents and in customer interaction data.

Other kinds of embedding models for semantic search are being developed to model patterns in images, audio, data schemas, math, science, graphs, robotics and other domains. These are sometimes used independently, such as a music or product recommendation engine based on song play history or purchasing behavior.

One big challenge is that the same embedding models used for encoding data must also be used to decode or process it. This has driven research toward larger and multimodal embedding models that are useful across more domains. However, these are a bit like Swiss Army knives, which are useful for an occasional task in a pinch but not as efficient as a dedicated screwdriver when building a house. A code-specific embedding model is faster for code autocomplete. A general-purpose embedding model could help ask questions about a code snippet or generate new snippets using a natural language prompt, but it is larger and slower.

There are many ways to bridge the gaps across embedding silos. The previously cited RAG technique can use one embedding model optimized for a particular type of semantic search and then submit the resulting text to prime a different embedding scheme in an LLM for more useful responses. Multimodal AI techniques use neuro-symbolic AI techniques to train an LLM across embeddings directly or join up a new embedding scheme with an existing LLM.

Hash collisions can be a problem when connecting too many types or modalities of data into one embedding model. This occurs when too many things get mapped to too small a space. As a result, the model encodes things that should mean different things with the same set of vectors. This can reduce accuracy and increase the risk of AI hallucination.

Types of embedding models in semantic search across modalities

Here are some examples of how embedding models for semantic search can be applied across various modalities:

  1. Unstructured text. Text-embedding schemes transform unstructured data into vectors to improve search and summarization across business documents, standard operating procedures, repair manuals, customer interactions, codebases and other enterprise sources.
  2. Structured text. These embeddings model the relationships in a particular domain, such as the connections in structured XBRL documents for financial disclosure or among payers, items, prices and terms in an invoice.
  3. Product recommendation engines. Customer behavior embeddings correlate patterns in customer interactions and purchases to refine product recommendations.
  4. Code. The largest LLMs all support mechanisms for correlating code properties with questions people might ask about them. Code-specific LLMs trained on specific languages and enterprise repositories can speed code autocomplete tasks that align with enterprise and security best practices.
  5. Audio. Spotify and Pandora use models optimized to represent audio features to improve music recommendations based on listening history. Enterprise content creation tools can help shortlist audio clips for marketing and advertising campaigns.
  6. Images. Vision-specific embedding models can help search for or generate imagery based on particular artistic styles, objects, scenes or contexts.
  7. Business process models. Business process-specific embeddings can help to ask questions about business processes, identify exceptions and recommend opportunities for improvement.
  8. Graphs. Embeddings like node2vec generate vector representations of nodes on a graph to help search through graphs used in fraud detection, supply chain analytics, customer recommendations and scientific research.
  9. Data schema matching. Active metadata management tools create vector representations of data schemas and associated metadata to improve data integration, transformation and reuse.
  10. Science and medicine. Domain-specific embedding schemes for proteins, molecules and physics can accelerate scientific discovery, improve product development and identify quality control issues earlier in development.

Metrics and leaderboards for embeddings

Numerous efforts have been made to compare the relative merits of various embedding schemes across semantic search-related tasks. In the early days, metrics focused on one specific task, such as labeling text, answering questions, summarizing documents or recommending products. Various research communities have proposed new metrics that compare the performance of various embedding schemes across multiple tasks, along with their relative sizes and speeds.

The most widespread and current efforts have focused on embeddings for unstructured text and text to speech. Less well-supported efforts measure embedding performance across audio- and visual-specific tasks. Here are some examples of these metrics:

  • Benchmarking Information Retrieval. BEIR supports metrics related to nine tasks: fact-checking, citation prediction, duplicate question retrieval, argument retrieval, news retrieval, question answering, tweet retrieval, biomedical information retrieval and entity retrieval.
  • Massive Text Embedding Benchmark. MTEB analyzes performance across eight tasks, including clustering, bitext mining (finding translated sentence pairs), retrieval, semantic textual similarity, clustering, classification, pair classification, and reranking. Hugging Face manages the MTEB leaderboard.
  • End-to-end Speech Benchmark. ESB compares embedding metrics used for matching speaking styles, reducing background noise and identifying punctuation across various speech data sets. These models improve transcription tools for interpreting speech across different dialects and languages and in the presence of common types of background noise. They are also useful in translating a speaker's voice into different languages and easing audio editing tasks in content production. Here is the ESB leaderboard.
  • Holistic Evaluation of Audio Representations. HEAR benchmarks audio classification and labeling tasks across speech, environmental sound and music. Here is the HEAR leaderboard.

Embedding architectures

Researchers have explored a wide variety of techniques for creating embedding models. Earlier versions focused on convolutional and recurrent neural network architectures. Recent progress in generative AI has inspired all major LLM vendors to develop and share embedding models built on transformer architectures. Here are six examples of embedding architectures that consistently show up on the MTEB leaderboard:

  1. Sentence-BERT. Sentence transformers, such as SBERT, are based on the Bidirectional Encoder Representations from Transformers model that Google introduced in 2018, which proved better than existing approaches at capturing the context of words. These models tend to be faster and smaller than newer architectures but perform poorly by comparison.
  2. SGPT. Niklas Muennighoff at Peking University introduced a decoder-only transformer approach called SGPT that improves the fine-tuning of embeddings and the speed of processing embeddings.
  3. Generalizable T5-based Retrievers. The newer open source GTR model from Google uses a T5 model that is fine-tuned with a much larger data set.
  4. EmbEddings from bidirectional Encoder rEpresentations. E5 is a new family of embedding models from Microsoft that supports multilingual semantic search.
  5. Embed v3. Cohere's Embed v3 family of models was optimized to perform well on MTEB and BEIR but also excels in RAG and compressing raw embeddings to reduce memory and improve search quality.
  6. Open AI text-embedding model. OpenAI calls its family of embedding models text-embedding followed by the version and size numbers. The most recent version is five times cheaper than its earlier model, achieving similar performance.

George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.

Next Steps

History of generative AI innovations spans decades

Generative AI challenges that businesses should consider

Assessing different types of generative AI applications

Generative models: VAEs, GANs, diffusion, transformers, NeRFs

The best large language models

Dig Deeper on AI technologies

Business Analytics
Data Management