https://www.techtarget.com/searchenterpriseai/tip/Embedding-models-for-semantic-search-A-guide
Embedding models for semantic search transform data into more efficient formats for symbolic and statistical computer processing. A type of neural network, an embedding model takes advantage of innovations in generative AI, vector databases and knowledge graphs to better grasp the connections between words and ideas. This enables more precise concept matching, compared to traditional keyword-matching approaches. This semantic capability makes embedding models especially useful in search engines, data analytics, customer support chatbots, recommendation engines and business process analysis tools.
Embedding models for semantic search are used by developers and data scientists tasked with building better apps and experiences. Deciding which embedding model is best for the project at hand is an important first step. Data scientists and developers might explore the speed, size and accuracy of various embedding models for a particular task. These metrics not only test the performance of available embedding models, but can also give feedback to improve a new model's performance by using curated data sets of questions and relevant responses for common tasks.
Embedding models are often obscured underneath the covers of complex services and applications. For example, embedding models play a seminal role in Google Search, but these are not published to preclude low-quality content mills from gaming search results. Also, many enterprise apps for knowledge graphs, active metadata management, graph databases and process intelligence use embeddings that are not publicized to maintain a competitive advantage.
Embedding models use a statistical approach to model connections between similar types of content. For example, in a word-embedding model, the terms queen and king might be statistically close to chief or president but further apart on the feature of gender. The approach is not limited to words. Innovations in generative AI can also create embedding models that describe similarity across sentences, paragraphs and larger documents. These help tease apart distinctions like the difference between "man bites dog" and "dog bites man."
Embedding models are trained to learn about the patterns and relationships in text on especially large data sets. When a trained model processes new data, it analyzes the text and generates a unique numerical value, or embedding vector, in a multidimensional space using fixed dimension embeddings.
Some of the best embedding models use thousands of dimensions, which seems like a lot. But this helps capture the complexity and subtleties found across common speech, abstract poetry and the descriptions of innovations in scientific research.
There are embedding models that let you specify the number of dimensions. For example, one of OpenAI's newest embedding models -- text-embedding-3-large -- lets you choose either 256, 1,024 or 3,072 dimensions. Using a high number of dimensions results in a better semantic search score, but the model runs slower and requires more memory. Low-dimensional embeddings are less computationally expensive.
Encoding models apply a structural, logical and mathematical approach to modeling connections. For example, database languages and schemas help structure business data for efficient processing or analysis. Knowledge graphs, ontologies, taxonomies and digital twins can also model the relationships between more complex types of information. An example is eXtensible Business Reporting Language (XBRL), which structures business reporting information for more efficient comparison, analysis and reuse.
Embedding models are slower and less accurate than encoding models, but they are more flexible and adaptable. Hence, they can structure unstructured text or translate between data stored using different encoding schemes. Aligning data silos into a relevant encoding scheme can require considerable effort and expense. The billions of dollars enterprises spend on data integration boils down to their need to synchronize data across different data schemas and formats.
Embedding models promise to automate the translation across encoding schemes through techniques like active metadata management and the semantic layers used in data fabrics and data mesh architectures. This can improve data reuse and connect information silos. Conversely, encoding models can bring structure when translating across different domain-specific embedding models to improve accuracy and performance through techniques like retrieval-augmented generation (RAG).
Text embedding models have dominated R&D efforts because they can take advantage of the vast body of unstructured information on the internet, in business documents and in customer interaction data.
Other kinds of embedding models for semantic search are being developed to model patterns in images, audio, data schemas, math, science, graphs, robotics and other domains. These are sometimes used independently, such as a music or product recommendation engine based on song play history or purchasing behavior.
One big challenge is that the same embedding models used for encoding data must also be used to decode or process it. This has driven research toward larger and multimodal embedding models that are useful across more domains. However, these are a bit like Swiss Army knives, which are useful for an occasional task in a pinch but not as efficient as a dedicated screwdriver when building a house. A code-specific embedding model is faster for code autocomplete. A general-purpose embedding model could help ask questions about a code snippet or generate new snippets using a natural language prompt, but it is larger and slower.
There are many ways to bridge the gaps across embedding silos. The previously cited RAG technique can use one embedding model optimized for a particular type of semantic search and then submit the resulting text to prime a different embedding scheme in an LLM for more useful responses. Multimodal AI techniques use neuro-symbolic AI techniques to train an LLM across embeddings directly or join up a new embedding scheme with an existing LLM.
Hash collisions can be a problem when connecting too many types or modalities of data into one embedding model. This occurs when too many things get mapped to too small a space. As a result, the model encodes things that should mean different things with the same set of vectors. This can reduce accuracy and increase the risk of AI hallucination.
Here are some examples of how embedding models for semantic search can be applied across various modalities:
Numerous efforts have been made to compare the relative merits of various embedding schemes across semantic search-related tasks. In the early days, metrics focused on one specific task, such as labeling text, answering questions, summarizing documents or recommending products. Various research communities have proposed new metrics that compare the performance of various embedding schemes across multiple tasks, along with their relative sizes and speeds.
The most widespread and current efforts have focused on embeddings for unstructured text and text to speech. Less well-supported efforts measure embedding performance across audio- and visual-specific tasks. Here are some examples of these metrics:
Researchers have explored a wide variety of techniques for creating embedding models. Earlier versions focused on convolutional and recurrent neural network architectures. Recent progress in generative AI has inspired all major LLM vendors to develop and share embedding models built on transformer architectures. Here are six examples of embedding architectures that consistently show up on the MTEB leaderboard:
George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.
22 Nov 2024