Improve RAG Output With Enhanced Retrieval Techniques

NBD Lite#49 Techniques to used for RAG retrieval process

Jan 31, 2025

All the code used here is present in the RAG-To-Know repository.

Retrieval-augmented generation (RAG) is a system that combines the power of dynamic data retrieval and LLM generation to produce an accurate response. This means that RAG output will depend on both the retrieval and generation parts.

In the previous article, we discussed how we can use Reranking to acquire better documents passed into the generation model.

Non-Brand Data

RAG Reranking To Elevate Retrieval Results

The previous article taught us how to evaluate the RAG System with LLM-as-a-Judge or LLM evaluator…

5 months ago · 6 likes · Cornellius Yudha Wijaya

The Reranking technique, though, was performed after we had retrieved the documents. We find that useful, but we will have a better result if all the retrieved documents are much more relevant.

In this article, we will discuss how we can enhance the query input itself to improve the retrieval result. We will try to improve the query before passing it for retrieval.

The diagram below summarizes what we discussed today, so let’s get into it.

Sponsor Section

Data Science Roadmap by aigents

Feeling lost in the data science jungle? 🌴 Don’t worry—we’ve got you covered!

Check out this AI-powered Data Science Roadmap 🗺️—your ultimate step-by-step guide to mastering data science! From stats 📊 to machine learning 🤖, and Python tools 🐍 to Tableau dashboards 📈, it’s all here.

✨ Why it’s awesome:

AI-powered explanations & Q&A 🤓
Free learning resources 🆓
Perfect for beginners & skill-builders 🚀

👉 Start your journey here: Data Science Roadmap

Need extra help? Try the AI-tutor for personalized guidance: AI-tutor

Let’s make data science simple and fun! 🎉

Introduction

We have learned how to develop a simple RAG implementation, especially to set up retrieval with the vector database. However, a naive implementation might not be enough for an RAG system in production.

A few challanges we have with a simple retrieval process including:

Limited Recall: Important documents may be overlooked due to incomplete queries.
Irrelevant or Duplicate Context: Results can include repetitive or misaligned information and misleading the generation stage.
Poor Query Expansion: Ambiguous or sparse queries often return incorrect or overly broad search results.
Inadequate Handling of Complex Queries: Single-pass retrieval struggles with queries that require multiple steps or logical reasoning.

The reasons above are why we need to improve the query to improve the retrieval result.

Enhanced Retrieval Techniques will address these shortcomings by refining how documents are retrieved. Many techniques can be used

Entity-Aware Retrieval: This technique leverages named entities (e.g., people, locations, organizations) to fine-tune search results, leading to more domain-relevant context.
Hybrid Sparse-Dense Retrieval: Combines keyword-based (sparse) and embedding-based (dense) searches, balancing precision and recall for improved relevance.
Multi-Step Document Retrieval: Conducts retrieval in stages, refining and filtering results iteratively for complex or ambiguous queries.
Hypothetical Document Embedding (HyDE): Generates a pseudo-document from the user’s query before actual retrieval, capturing more nuanced intent and enhancing overall accuracy.

Various techniques can still enhance the retrieval process, but let’s focus on these four to discuss.

Let’s try to develop them using enhanced retrieval techniques.

Enhance Retrieval Techniques

In this tutorial, I will assume you already understand how to build a Simple RAG system. If you need a refresher, you can always refer to the article below.

Non-Brand Data

Simple RAG Implementation With Contextual Semantic Search

Hi everyone! Cornellius here, back with another Lite series. This time, we’ll explore the advanced techniques and production methods of Retrieval-Augmented Generation (RAG)—tools that will be helpful for your use cases. I will make it into a long series, so stay tuned…

5 months ago · 11 likes · 2 comments · Cornellius Yudha Wijaya

Let’s explore our first technique.

1. Entity-Aware Retrieval

Entity-aware retrieval is a retrieval strategy that focuses on leveraging recognized entities (such as names of people, organizations, products, or locations) within a query to guide and refine the search process.

This technique helps pinpoint specific identifiers in the text. It uses them to produce more relevant and accurate search results, especially if the query we have can have a lot of ambiguous terms.

Some advantages of using this technique include:

Contextual Query Expansion boosts the retrieval process using words specific to the inputs.
It’s Domain-Specific Applications as it captures the domain’s unique terminology and avoids confusion from generic words.

The technique is easy to initiate as you can do it using the following code.

def entity_aware_retrieval(query, entities, top_k=2):
    # Add entity information to the query
    enriched_query = f"{query} Entities: {', '.join(entities)}"

    # Perform semantic search with the enriched query
    results = semantic_search(enriched_query, top_k)
    return results

query = "What is the insurance for car?"
entities = ["car", "insurance"]

# Entity-aware retrieval
entity_results = entity_aware_retrieval(query, entities)

You can see that you added the entities list directly to the query. You can always expand the entities query to get a much more specific result.

Try experimenting with this method to see if it improves your retrieval result.

2. Hybrid Sparse-Dense Retrieval

As the name implies, Hybrid Sparse-Dense Retrieval is a technique that combines the strengths of sparse (keyword-based) and dense (embedding-based) retrieval systems.

By merging these two approaches, hybrid retrieval methods aim to overcome each individual's limitations, resulting in more accurate and comprehensive search outcomes.

Combining both approaches we can have benefits:

Reduced Blind Spots, where each method complements what they are retrieved for.
Enhanced Recall and precision, as these two methods typically increase recall (fewer relevant documents are missed) without sacrificing overall precision.

Typically, there are three methods we can use to integrate both results from spares and dense. They include:

Score Merging: Compute separate relevance scores (one from the sparse retriever, one from the dense retriever) and combine them—often via weighted sum or rank fusion—to produce a final ranking.
Pipeline Approaches: Use one retriever as a first pass (e.g., BM25 for speed) to generate a candidate set and then re-rank using a dense model for better semantic precision.
Parallel Retrieval & Intersection: Retrieve in parallel from both systems, then combine (or intersect) the results to ensure coverage of both lexical and semantic matches.

Let’s try parallel retrieval for the integration process with the following code:

def hybrid_retrieval(query, top_k=2):
    # Dense retrieval using embeddings
    dense_results = semantic_search(query, top_k)

    # Sparse retrieval using BM25
    # Tokenize the chunks (split into words)
    tokenized_chunks = [chunk.split() for chunk in chunks]

    # Initialize BM25
    bm25 = BM25Okapi(tokenized_chunks)

    # Tokenize the query
    tokenized_query = query.split()

    # Get BM25 scores for the query
    bm25_scores = bm25.get_scores(tokenized_query)

    # Get the top_k indices based on BM25 scores
    sparse_indices = np.argsort(bm25_scores)[-top_k:][::-1]

    # Combine results
    combined_results = {
        "documents": dense_results['documents'][0] + [chunks[i] for i in sparse_indices],
        "metadatas": dense_results['metadatas'][0] + [{"source": "pdf", "chunk_id": i} for i in sparse_indices]
    }
    return combined_results

hybrid_results = hybrid_retrieval(query)

We will acquire both the sparse and dense retrieval results that we can choose to process further or pass into the generation model. You can try out all the strategies to see which works best for you.

3. Multi-Step Document Retrieval

Multi-Step Document Retrieval is a technique used in a retrieval approach that proceeds through several iterative phases rather than a single pass.

Each phase refines the query and potentially filters or re-ranks the retrieved documents based on insights gained from the previous stage. This strategy is especially useful for complex or multi-faceted queries that require careful context and relationship analysis.

A few benefits of this technique, including:

Increases relevance by narrowing search results iteratively.
Reduces noise through successive filtering of non-essential documents.
Adapts easily to complex, multi-hop question answering.

As I mentioned above, the strategy for Multi-Step Document Retrieval can include rewriting our query, filtering the documents, or reranking them, depending on what we want to focus on.

Let’s try the easiest one, which is where we pass the context retrieved to the query.

def multi_step_retrieval(query, steps=2):
    context = ""
    for step in range(steps):
        # Perform retrieval with the current query
        results = semantic_search(query)
        context += "\n".join(results['documents'][0]) + "\n"

        # Refine the query for the next step
        refined_query = f"{query} Specifically, {context}"
        query = refined_query

    return context

context_multi_step = multi_step_retrieval(query)
response_multi_step = generate_response(query, context_multi_step)

Of course, we can improve the result even further if we have methods to eliminate irrelevant retrieved documents, such as reranking. You can also tweak the query to add things like query rewriting based on the context.

Thanks for reading Non-Brand Data! This post is public so feel free to share it.

4. Hypothetical Document Embedding (HyDE)

Hypothetical Document Embedding (HyDE) is a retrieval technique that begins by generating a “hypothetical” or pseudo-document based on the user’s query. Rather than relying on a short or vague query alone, HyDE uses a language model to expand the query into a more descriptive text that captures the user’s potential intent.

his pseudo-document then serves as the basis for embedding and subsequent retrieval, enabling the system to draw connections that a sparse or direct embedding of the original query might miss.

Usually, HyDE works by feeding the user’s query into a generative component—often the same LLM used elsewhere in the system or a smaller model specialized for text generation. The model produces a paragraph or short article hypothesizing what the user might seek.

A few benefits of using HyDE include:

Enhances retrieval from short or ambiguous queries by creating richer context.
Leverages generative capabilities to capture deeper semantic associations.
Improves the overall completeness of retrieved information for complex topics.
Provides an adaptable technique for various domain-specific queries.

To build the HyDE system, we can use the following code:

def hyde_retrieval(query, top_k=2):
    # Generate a hypothetical document using a language model
    prompt = f"Generate a hypothetical document that answers the query: {query}"
    hypothetical_doc = completion(
        model="gemini/gemini-1.5-flash",
        messages=[{"content": prompt, "role": "user"}],
        api_key=GEMINI_API_KEY
    )['choices'][0]['message']['content']

    # Generate embedding for the hypothetical document
    hypothetical_embedding = text_embedding_model.encode(hypothetical_doc)

    # Query the collection using the hypothetical embedding
    results = collection.query(
        query_embeddings=[hypothetical_embedding.tolist()],
        n_results=top_k
    )
    return results

hyde_results = hyde_retrieval(query)

The code above will generate hypothetical documents that we will use to perform a semantic search. Of course, you can also pass the original query with the hypothetical document as context to retrieve the documents.

I hope it has helped!

Is there anything else you’d like to discuss? Let’s dive into it together!

👇👇👇