5 minutes

Building an AI-Powered Website Search Solution – Part Two

In the second of this two-part series, we share insights on some of the technical approaches we used to build the AI search tool.

In part one of this series we explained what AI-powered search is and where it benefits over traditional search. In this article, we will pop the hood and explain how all the mechanics work together to get the best search results for our user.

One of the most powerful benefits of AI-powered search over traditional search is context. AI and large language models (LLMs) use a method known as 'context windows' to achieve this contextual understanding of any given piece of content. To make the process more efficient and cost-effective, it is best to provide the model with only the most relevant content. We make this possible by breaking down all website content into smaller, more manageable chunks. This process is called Retrieval Augmented Generation, or RAG.

RAG

RAG is complex, but essentially involves chopping content up into chunks, and converting them into 'embeddings' - numerical representations of data that capture its meaning, essentially converting human-readable words into machine-readable data. Similar words are placed close together, while different words are placed further apart. This process enables the AI to understand and find data based on meaning and context, not just exact keyword matches. If you want to dive into this process there is a great article on our blog that explains it in more detail.

By breaking down each webpage into smaller chunks (e.g. paragraphs or sections), each embedding created is more focused on a single topic, making it easier to retrieve specific information related to a search query, and resulting in a much more efficient and accurate search result for the user.

If we stop here, we have a functional AI-powered search that is already better than a traditional search. However we need to take this further to optimise as much as possible and increase accuracy even more. There’s three more steps we can (and typically do) take:

Remove repetitive elements to increase the quality of our data
Remove any duplicate content to keep our database clean
Refine results by re-ranking them to increase accuracy even further

Removing Repetitive Elements

The quality of our data directly affects the accuracy of the search results we provide. By having cleaner data, we can achieve more accurate retrieval of relevant content when users search. Websites often contain a significant amount of repetitive content. The headers and footers found on each page are good examples of this.

This repetitive content appears on every page, which increases the data size and may mislead our embeddings. To clean up the data, it is necessary to implement a mechanism that identifies and stores these repetitive elements in the database only once. These elements can then be ignored on subsequent pages, leaving only unique body content. This approach ensures that we don’t get the same embeddings returned over and over again for a particular search term and instead we get a more diverse array of embedding data.

Removing Duplicate Content

Similarly to removing repetitive content, we also need to address duplicate content stored in the database. First, a quick explainer on how duplicate content gets in the database in the first place:

In order to store website data, we utilise a crawler that visits each web page. However, the crawler does not differentiate between different URLs that contain the same data. For instance, URLs like https://example.com and https://example.com/index most likely have identical data, resulting in the duplication of this data in our database.

One approach to overcome this is to compare each document using a mathematical technique called Cosine similarity, where we compare the embeddings of the two documents. We won't go into the details of how this mathematical approach works here, but essentially we “score” how similar embeddings are to each other - if the similarity score is close to a perfect match, the content is likely identical, and therefore one of the embeddings can be removed. By using mathematics to perform this function we save a huge amount of time and computing resources, making this a viable strategy for keeping our database clean and relatively noise-free.

Refining Results with Re-Ranking

This step is certainly optional, but if performed can have a powerful impact. We essentially cast a wider net to collect as many relevant embeddings as possible, then rank them all by relevance, resulting in a far more relevant list of embeddings to feed to the AI.

For this to work, we need more results than what the user’s initial query might have provided. To do this, we actually use the AI to generate a few alternative queries, with the same meaning but using different words or structure. For example, let’s say a user’s query is "best hiking boots". Our AI might generate these three additional alternatives:

"Top-rated trekking footwear"
"Durable and waterproof hiking boots"
"Budget-friendly hiking boots with good performance"

We then retrieve results for all four of these queries - the user’s original query, and the three AI-generated queries. This wider net catches a much larger number of embeddings. We then compare the relevance of all of these back to the user’s original query, and rank them accordingly, only returning the top matches.

While this step improves the overall retrieval process and provides more accurate data, there are two considerations to be aware of:

This approach may have an impact on the overall performance, as the use of the AI to generate additional queries can potentially increase the processing time of the retrieval process
Using the AI to generate queries may also involve additional costs

Conclusion

All of the above involves some fairly complex logic and mathematics happening behind the scenes, and there is even more to this process that we won’t go into here. However, what makes this most exciting is the fluid nature of this emerging technology.

As our AI-powered search solution continues to evolve, we are constantly exploring new advancements in AI to further enhance our search capabilities. At Revium, we are committed to delivering better user experiences and meeting the evolving needs of our users through innovative and cutting-edge AI technologies, and our AI-powered search is no exception. Watch this space for future updates on how our approach has evolved, and we hope this gives you a renewed appreciation for exactly what goes into getting you fast and accurate search results the next time you are shopping for your next pair of hiking boots.

Have a project or idea you think could use some AI expertise on? Contact us today.

Building an AI-Powered Website Search Solution – Part Two

RAG

Removing Repetitive Elements

Removing Duplicate Content

Refining Results with Re-Ranking

Conclusion

03 9429 2000

Level 5, 84 Cubitt St, Cremorne 3121

07 3709 2702

Level 19, 160 Ann St, Brisbane 4000

hello@revium

Visit us on LinkedIn