LLMs and Embeddings 101: Unleash the Power of GPT-4 with Unbounded Long-term Memory

An introductory-level catch-up for anyone interested in going beyond ChatGPT

Javier Toledo
The Agile Monkeys’ Journey

--

Midjourney. Prompt: a galaxy of embeddings inside of a robot mind
Generated with Midjourney. Promt: a galaxy of embeddings inside of a robot mind.

In the dynamic world of Artificial Intelligence, the tools and concepts we use are continually evolving. Language models, an integral part of this landscape, have grown in complexity and capability, resulting in the creation of Large Language Models (LLMs) such as GPT-4. Alongside LLMs, the concept of semantic embeddings has emerged, unlocking new ways to represent and utilize information. This article serves as a guide, detailing the essentials of LLMs and embeddings, and showcasing how they can be leveraged to extend the power of AI models, providing them with a form of unbounded long-term memory.

What is a Large Language Model (LLM)?

A Large Language Model is a type of AI model that is designed to generate human-like text. These models are trained on vast amounts of text data, enabling them to predict the next word in a sentence, given the context of the previous words. The “largeness” in LLMs refers to both the training data’s size and the model’s complexity, which often involves billions, or even trillions, of parameters (Each of the individual connections inside of the model). GPT-4, for example, leverages transformer-based architectures to process and generate text in a contextually and semantically rich way.

Thanks to the expansive and high-quality training data they employ, these models have acquired a certain level of problem-solving and reasoning capabilities. These skills are inherited from the vast array of language and literature they’ve been trained on. Consequently, they offer innovative ways to process and comprehend documents and data. Moreover, they provide more user-friendly and intuitive interfaces for human interaction.

What does mean to “train” a model?

Training a model, such as GPT-4, is an intricate and resource-intensive process. Drawing inspiration from the workings of biological brains, a large AI model is a network comprised of millions of nodes (or artificial neurons), connected by billions of weighted links. The training process necessitates the calculation of a specific set of weights to make the neural network function in the desired manner.

To illustrate, the fundamental objective of an LLM like GPT-4 is, quite simply, to predict the most sensible next word given a piece of text. To accomplish this, OpenAI engineers utilized a substantial corpus of text examples, presenting each document to the network and systematically adjusting the weights until the correct words began to surface.

Training a large language model is a remarkably complex undertaking, necessitating months of computation on specialized hardware and finely honed algorithms. Hence, it’s not a venture that just any organization can embark upon. To harness the power of LLMs using your own data — a prevalent use case — it’s more common not to train a custom model, but to utilize an existing model coupled with a database of embedding-indexed data.

What are the Embeddings?

Embeddings are numerical representations that encapsulate the semantic meanings and relationships among words, paragraphs, or even entire documents. While that may sound complex, consider this analogy: The word “apple” is semantically closer to the word “fruit”, signifying that an apple is a type of fruit, a relationship our brains naturally discern. However, we view the word “apple” as semantically distant from the word “car”. Embeddings provide a mathematical depiction of these meanings, maintaining the relative semantic distances as we cognitively perceive them. But how? By representing them as vectors or points in high-dimensional space. In this context, “apple” would be positioned closer to “fruit” and further from “car”, thereby reflecting the semantic relationships among these words. Specialized models are used to generate these embeddings.

In mathematical terms, a vector can be understood as a collection of coordinates that represent a point in space, corresponding to the size or dimensions of the vector itself. For instance, a vector of three numbers could denote a specific point in space, such as latitude, longitude, and altitude. Embeddings function on a similar principle, but typically involve thousands of dimensions, thus creating a high-dimensional space. Even though we cannot visually represent these in a way that’s comprehensible to humans, we can still manipulate them mathematically. For example, to calculate the distance between two text pieces, which would be a numerical measurement of how much their semantics are related.

Generating Embeddings for custom data

Embeddings can be created with specialized AI models that are specifically designed for this. OpenAI provides its own API endpoint to generate embeddings, but there are other models that can be used, the important thing is that you use the same model for all the documents you’re handling. The embeddings model processes a given text and outputs the corresponding high-dimensional vector.

This operation is not reversible, given an embedding, you can’t get the original text again, so the original text must be kept in order to retrieve it later. The interest in generating embeddings is that once we have them for two documents, we can compare how close or far they are from each other from a semantics point of view.

To enhance the efficiency of our embeddings and circumvent the size constraints inherent in embedding models, it might be necessary to partition large documents into smaller segments. The more distinct these segments are, the better. For instance, if we were to generate embeddings for this article, we would likely create one for each individual section.

Specialized Databases for Embeddings

When dealing with extensive text collections in Language Learning Models (LLMs), the large embeddings created require a special operation for retrieval, known as Nearest Neighbours Search. This operation is not supported by most conventional databases, necessitating the use of specialized databases adept at handling high-dimensional data.

There are many solutions designed for efficient storage, search, and management of vectors like Faiss by Meta, Milvus, Qdrant, Chroma DB, or Pinecone, among others​.

For more traditional setups, it is possible to use PostgreSQL with the pgvector extension, which could be convenient if you’re planning to keep all the data in the same place​, or use the advanced vector search features of indexing engines as ElasticSearch or OpenSearch.

Using Embeddings for Semantic Search

Once we have generated embeddings for all our documents and inserted them in a database that supports vector searches, they can be used to perform semantic searches, a type of search that understands the searcher’s intent and the contextual meaning of terms to generate more relevant results.

Because embeddings capture the semantic meaning of words and sentences, they can be used to find the “closest” (in terms of vector distance) entries in a database to a given query or question.

For example, if we input a query like “juicy fruit”, a semantic search powered by embeddings might return entries containing “apple” or “orange”, even if the exact phrase “juicy fruit” is not present. This is because the embeddings for “juicy fruit” and “apple” are close in the high-dimensional space.

In-Context Learning and Extending the LLM’s Long-Term Memory

A key aspect of LLMs like GPT-4 is their ability to generate text that is contextually relevant. However, these models traditionally do not have long-term memory, meaning they don’t remember information beyond the current context window (a few thousand tokens for GPT-4).

To expand this capacity and give the model a form of “long-term memory,” we can perform semantic searches for any user input, select the closest documents, and inject them as part of the model prompt to augment its knowledge of the matter before answering. This technique is known as in-context learning, because instead of training the model with new data, we inject relevant information as part of the prompt within the model’s context window.

For instance, if we’ve previously generated an embedding for a phrase like “Paris is the capital of France”, we can feed this embedding into the model when we ask it about French landmarks. Even though “Paris is the capital of France” is not in the immediate context, the model will still have this information available thanks to the injected embedding.

This method enables the model to recall and use information beyond its standard context window. This approach is particularly useful in applications where the model needs to maintain consistency or recall information over a large set of documents like a company’s Wiki or the documentation of a technical tool.

Conclusion

The development of LLMs like GPT-4 has been a major stride in AI and NLP with important implications at many levels, but when organizations want to use them with their own data, training a dedicated LLM from scratch is probably not the best solution. With the power of embeddings, we can store and use semantic information in previously unfeasible ways, allowing us to “teach” commercially available LLMs to work with data outside of their training corpus without having to re-train them.

This technique enables the possibility to any organization to build specialized LLM-powered tools that take advantage of private or very specific or private data and knowledge.

Do you want to know more? Visit our ‘Open Kitchen’ page to find everything we’ve learned on our AI journey. Don’t hesitate to contact us if you need help!

--

--

Javier Toledo
The Agile Monkeys’ Journey

Cofounder and CTO at The Agile Monkeys . Co-creator of the Booster Framework. Breaking cutting-edge technology remotely from the beautiful Canary Islands.