Retrieval Augmented Generation

Implementing the RAG data ingestion pipeline in AWS

Lakshmi Veeramani

and

Karen Smiley

Jan 10, 2025

Hi folks - happy weekend! This post is the second in

Lakshmi Veeramani

’s new article series on AI technologies and agents. It’s primarily for software and AI architects and developers and focuses on implementation in Amazon AWS Bedrock. (If these more technical topics aren’t “up your alley”, feel free to unsubscribe from the Platforms section; see sixpeas.substack.com/account to manage your section subscriptions.)

Introduction

Retrieval-Augmented Generation (RAG) is a framework that combines information retrieval techniques with generative AI to improve the quality, relevance, and factual accuracy of generated responses. Instead of relying solely on the inherent knowledge encoded in a language model which are achieved through training data, RAG enhances the accuracy by retrieving additional, contextually relevant documents or information to support and ground the generation process.

This helps to build lots of query-and-answer generation using the open source foundation models for many use cases based on their domain specific knowledge.

In essence, a RAG pipeline:

Retrieves: Finds the most relevant documents or data points from an external knowledge base using a query.
Augments: Combines the retrieved content with the user’s query.
Generates: Uses a language model to produce responses that are informed by the retrieved context.

This approach is particularly useful for:

Tasks requiring domain-specific knowledge.
Fact-heavy queries that demand up-to-date or external information.
Applications where accuracy and grounding are critical, such as medical, legal, or technical assistance.

Origin and Reference

The concept of RAG emerged from the broader field of neural information retrieval and natural language generation. It was formally introduced in 2020 by researchers from Facebook AI Research (FAIR) in their paper titled:

"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.

Key Contributions from the Paper:

Described the two-stage architecture:
1. A retriever model (e.g., Dense Passage Retrieval - DPR) retrieves relevant documents.
2. A generator model (e.g., BART or T5) produces responses based on retrieved content.
Demonstrated RAG's effectiveness on knowledge-intensive tasks like question answering and open-domain dialogue systems.
Highlighted the ability to answer out-of-scope queries better than closed-book models by leveraging external data.

Publication Details:

Published by: FAIR, 2020.
Conference: NeurIPS 2020.
DOI: 10.48550/arXiv.2005.11401

RAG-based QA Functional Agent Block Diagram

RAG diagram from Lakshmi Veeramani’s article **AI Agents - An Introduction**. © 2024 Lakshmi Veeramani - all rights reserved.

I have given this block diagram in my Introduction to Agents. As this block diagram, shows, RAG is basically implemented using 2 pipelines. In this shorter article, I am explaining creation of the RAG pipeline (the “Offline Indexing Pipeline” on the left) using AWS Bedrock.

Detail of left side “Offline Indexing Pipeline” diagram, explained in this article

I will cover the right side run time engine in my next list of articles, maybe one block at a time. :)

RAG - Data Ingestion Pipelines

The data ingestion pipeline for Retrieval-Augmented Generation (RAG) involves preparing and connecting a knowledge base (KB) to foundation models (FMs) in Amazon Bedrock. This process enables the models to retrieve and use external data dynamically, enhancing their accuracy and contextual relevance without requiring frequent retraining.

To create this data ingestion pipeline, the following steps are to be performed.

Step 1: Create Access Control

Amazon S3

Assuming the user creates this pipeline using AWS Bedrock APIs, the APIs need access to connect to the data source (S3) to read the documents.

In some use cases, we may have to create S3 buckets, and upload the documents. So S3 full access is required.

Amazon Bedrock

Amazon Bedrock provides the required APIs to create knowledge base, which is the knowledge required to enable the RAG feature. Hence, Bedrock full access is needed.

Amazon OpenSearch Serverless

Full access is needed to set up the OpenSearch vector database to store the vectors.

Create S3 Bucket

The naming convention is as per standard naming convention. It is recommended to start with bedrock-kb- to identify this data source is something connected to the bedrock knowledge base. But it is not mandatory. This can be automated as per standard organisation policy.

This bucket is created only first time. For the next times, it is good enough to check if this bucket exists, for every document upload.

Step 2: Create Amazon OpenSearch Serverless collection

The second step is to create an empty OpenSearch serverless index.

An Amazon OpenSearch Serverless collection is a serverless, logical grouping of indexed data in Amazon OpenSearch Serverless. It serves as a container with the capability of Semantic Search Storage.

OpenSearch Serverless collections play a critical role in storing and managing the knowledge base for RAG. The collection stores vectorized embeddings of your data, which are created by the Embedding LLMs.

These collections allow the search types SEARCH, TIMESERIES, or VECTORSEARCH. The RAG feature uses the VECTORSEARCH.

Step 3: Create Required Policies

Create the encryption_policy in OpenSearch to govern how data is encrypted at rest and in transit.

Create the network_policy to control how clients can access your collections and ensures secure communication.

Create the access_policy that defines which IAM (Identity and Access Management) users, roles, or policies can interact with your collections.

Step 4: Create Vector Index

To create a vector index in Amazon OpenSearch Serverless, the index mapping must be defined.

Some parameters in the mapping JSON fields are:

what search algorithm type to use,
the size of the dimension matching the output of the embedding model,
method of retrieving from the vector store, like Hierarchical Navigable Small World (HNSW) graph algorithm for efficient ANN (Artificial Neural Network) search or Facebook AI Similarity Search (FAISS) engine to to use L2 (Euclidean distance) metric to compute similarity.

As in the above, the mapping provides multiple options to improve the search accuracy in case of RAG applications, which in turn improves the relevance of the answers.

Step 5: Create Knowledge Base

Initialize Open search serverless configuration by providing the collection details created in step 2, vector index details created in step 4.

Provide the S3 bucket created to link as the data source with the knowledge base in the next step.

Provide the embedding model to be used to create embeddings of the data. Choose the embedding model provided by the Amazon Bedrock; we currently have minimum options here in AWS.

Define a unique name for the knowledge base.

Step 6: Create Data Source

Creating a data source is different from create S3 bucket step.

Define the chunking strategy, like what is the chunking type to be used and chunking size details.
Attach the name of the knowledge base created in the previous step with the S3 bucket

These steps are essential for ingesting and managing the knowledge that the system will use to retrieve relevant information during RAG workflows. This ensures that the data is optimally prepared for generating embeddings and performing semantic search.

Final Step: Start Ingestion Job

During data ingestion, the knowledge base will fetch the documents in the data source, extract text, chunk, create embeddings, and store in OpenSearch Serverless vector store.

Conclusion

This article provides the complete steps to be performed to create a RAG Data Ingestion pipeline in Amazon Bedrock. Please refer to the latest AWS documentation for the latest information on APIs and fields (I have not given them here, since they keep upgrading it time to time). This pipeline ensures every new document/data of any format uploaded to the data source is automatically processed, embedded, and indexed in the OpenSearch Serverless, maintaining an up-to-date vector index for RAG workflows.

In the next article, I will cover OpenSearch, since it is relatively new and complex topic. Then I will cover the remaining important blocks in RAG and how to ingest the data online.