It’s been almost a year since a new generation of artificial intelligence took the world by storm. The capabilities of these new generative AI tools, many of which are powered by large language models (LLM), have forced every business and employee to rethink the way they work. Was this new technology a threat to their work or a tool that could boost their productivity? If you don’t know how to get the most out of GenAI, will you be outclassed by your peers?
This paradigm shift has placed a double burden on engineering and technology leaders. First, there is the internal demand to understand how your organization will adopt these new tools and what you need to do to avoid falling behind your competitors. Second, if you sell software and services to other companies, you’ll find that many have paused spending on new tools while they figure out exactly what their approach should be in the GenAI era.
There’s a lot of hype and it can be exhausting trying to figure out where to direct your resources. Before you can dive into the details of what to do with the responses or art created by your GenAI, you need a solid foundation to ensure it works properly. To help you, we’ve identified four key areas you’ll need to understand to get the most out of the time and resources you invest.
- Vector databases
- Model integration
- Augmented recovery generation
- Knowledge Bases
These are almost certainly foundational parts of your AI stack, so read below to learn about the four pillars needed to effectively add GenAI to your organization.
Vector databases
To use a large language model, you will need to vectorize your data. This means that the text you feed into the template will be reduced to arrays of numbers, and those numbers will be a vector on a map, albeit with thousands of dimensions. Finding similar text boils down to finding the distance between two vectors. This allows you to move from the old-fashioned approach of lexical keyword searching (typing in a few terms and getting results that share those keywords) to semantic searching, typing a query in natural language and getting an answer which includes a coding question about Python probably refers to the programming language and not the great snake.
“Traditional data structures, typically organized in structured tables, often fail to capture the complexity of the real world,” explains Philip Vollet of Weaviate. “Enter vector integrations. These integrations capture characteristics and representations of data, allowing machines to understand, abstract and calculate this data in sophisticated ways.
How to choose the right vector database? In some cases, this may depend on the technology stack your team already uses. Stack Overflow opted for Weaviate in part because it allowed us to continue using PySpark, which was the original choice for our OverflowAI efforts. On the other hand, you may have a database provider, like MongoDB, that has served you well. Mongo now includes vectors as part of their OLTP database, making it easy to integrate with your existing deployments. Expect this to become the standard for database vendors in the future. Like Louis Brady, VP of Engineering at Rockset explainMost businesses will find that a hybrid approach combining a vector database with your existing system gives you the greatest flexibility and best results.
Model integration
How do you enter your data into the vector database in a way that accurately organizes it by content? For this, you will need an integration template. This is the software system that will take your text and convert it into an array of numbers that you store in the vector database. There are many to choose from and vary greatly in cost and complexity. For this article, we’ll focus on embedding patterns that work with text, although embedding patterns can also be used to organize information on other types of media, like images or songs.
As Dale Markowitz wrote on the Google Cloud Blog, “If you want to embed text, that is, perform a text search or similarity search on text, you’re in luck. There are tons and tons of pre-trained text embeds that are free and readily available. An example is the Universal sentence decoder, which “encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.” With just a few lines of Python code, you can prepare your data for a GenAI chatbot-style interface. If you want to go further, Dale also has a great tutorial on how to prototype a language-based application using nothing more than Google Sheets and a plugin called Semantic Reactor.
You will need to weigh the trade-offs between the time and cost of cramming huge amounts of text into your onboarding model and how finely you slice the text, which is typically divided into sections such as chapters, pages, paragraphs, sentences or even individual words. . The other trade-off is the precision of the integration model: the number of decimal places to use on the vectors, as the size of each decimal place increases. Over thousands of vectors for millions of tokens, this adds up. You can use techniques like quantification to reduce the model, but it’s best to consider the amount of data and level of detail you’re looking for before choosing the integration method that’s right for you.
Augmented recovery generation (RAG)
Large AI models read the Internet to acquire knowledge. This means that they know that the earth is round…and they also know that it is flat.
One of the main problems with large language models like ChatGPT is that they were trained on a massive set of text from the Internet. This means they have read a lot about how the Earth is round, but also a lot about how the Earth is flat. The model is not trained to understand which of these assertions is correct, but only to determine the probability that a certain answer to a question correctly matches the query entered by the user. It also mixes these inputs into a new statistically probable input, where hallucinations can occur. He may not respond without any response, so it’s good to check sources.
With RAG, you can limit the data set the model looks for, meaning the model will hopefully not rely on inaccurate data. Second, you can ask the model to cite its sources, allowing you to check its answer against the ground truth. At Stack Overflow, this might mean containing queries only to questions on our site with an accepted answer. When a user asks a question, the system first searches for Q&A posts that are a good match. This is the recovery part of this equation. A hidden prompt then asks the model to do the following: synthesize a short answer for the user based on the answers you found and validated by our community, then provide the brief summary along with links to the three publications that best match the user’s search.
A third advantage of RAG is that it allows you to keep the data used by the model up to date. Training a large model is expensive. Most of the popular models available today are based on training data completed months or even years ago. Then ask him a question about something and he’ll happily hallucinate a convincing answer, but he doesn’t have any real information to work with. RAG allows you to point the model to a specific data set, which you can keep up to date without having to retrain the entire model.
RAG means that the user still gets the benefit of working with an LLM. They can ask questions in natural language and get a summary that synthesizes the most relevant information from a large data store. At the same time, relying on a predefined data set helps reduce hallucinations and gives the user links to the ground truth, so they can easily compare the model results with something generated by humans .
Database
As mentioned in the previous section, RAG can constrain the text your model relies on when generating its response. Ideally, this means you provide it with accurate data, not just a random sample of things you read on the internet. One of the most important laws of training an AI model is that data quality matters. “Garbage in, garbage out,” as the old adage goes, is very true for your LLM. Give it poor quality or poorly organized text, and the results will be just as uninspiring.
At Stack Overflow, we’ve been lucky when it comes to data quality. Q&A is the format adopted by most LLMs used within organizations, and our dataset has already been constructed this way. Our Q&A couplets can show us which information is accurate and which still does not have a sufficient confidence score by analyzing the number of votes or which question has an accepted answer. Votes can also be used to determine which of three similar answers might be most widely used and therefore most valuable. Last but not least, tags allow the system to better understand how different information in your dataset is related.
Learn more about how Stack overflow for teams helps the world’s largest companies share knowledge and lay the foundation for an AI future.