|
| 1 | +# Knowledge base refresh pipeline with AWS Step Functions and Amazon S3 Vectors |
| 2 | + |
| 3 | +This pattern deploys an AWS Step Functions workflow that automates the ingestion of new documents into an Amazon S3 Vectors knowledge base so AI agents always answer from up-to-date information. When new documents land in S3, the workflow fans out via Distributed Map to generate vector embeddings using Amazon Bedrock and store them with `PutVectors` in parallel. After ingestion, `QueryVectors` validates that the new content is searchable, and a Choice state either confirms success or rolls back by deleting the newly added vectors if validation fails. |
| 4 | + |
| 5 | +Learn more about this pattern at Serverless Land Patterns: [https://serverlessland.com/patterns/sfn-s3vectors-rag-refresh-cdk](https://serverlessland.com/patterns/sfn-s3vectors-rag-refresh-cdk) |
| 6 | + |
| 7 | +Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example. |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources. |
| 12 | +* [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) (latest available version) installed and configured |
| 13 | +* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) |
| 14 | +* [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html) (version 2.221.0 or later) installed and configured |
| 15 | +* [Node.js 22.x](https://nodejs.org/) installed |
| 16 | + |
| 17 | +## Deployment Instructions |
| 18 | + |
| 19 | +1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository: |
| 20 | + |
| 21 | + ```bash |
| 22 | + git clone https://github.com/aws-samples/serverless-patterns |
| 23 | + ``` |
| 24 | + |
| 25 | +1. Change directory to the pattern directory: |
| 26 | + |
| 27 | + ```bash |
| 28 | + cd sfn-s3vectors-rag-refresh-cdk |
| 29 | + ``` |
| 30 | + |
| 31 | +1. Install the project dependencies: |
| 32 | + |
| 33 | + ```bash |
| 34 | + npm install |
| 35 | + ``` |
| 36 | + |
| 37 | +1. Install the Lambda dependencies: |
| 38 | + |
| 39 | + ```bash |
| 40 | + cd lambda && npm install && cd .. |
| 41 | + ``` |
| 42 | + |
| 43 | +1. Deploy the CDK stack: |
| 44 | + |
| 45 | + ```bash |
| 46 | + cdk deploy |
| 47 | + ``` |
| 48 | + |
| 49 | + Note: Deploy to your default AWS region. Please refer to the [AWS capabilities explorer](https://builder.aws.com/build/capabilities/explore) for feature availability in your desired region. |
| 50 | + |
| 51 | +1. Note the outputs from the CDK deployment process. These contain the resource names used for testing. |
| 52 | + |
| 53 | +## How it works |
| 54 | + |
| 55 | +This pattern creates a single stack with the following resources: |
| 56 | + |
| 57 | +1. **S3 Document Bucket** — Stores the source documents to be ingested. Upload files to the `documents/` prefix. |
| 58 | + |
| 59 | +2. **S3 Vectors Bucket & Index** — An S3 Vectors vector bucket with a `knowledge-base` index configured for 1024-dimensional cosine similarity (matching Amazon Titan Text Embeddings V2 output). |
| 60 | + |
| 61 | +3. **Step Functions State Machine** — Orchestrates the full ingestion pipeline: |
| 62 | + - **Distributed Map** fans out over every object under `s3://<bucket>/documents/`, processing up to 40 documents concurrently |
| 63 | + - For each document, the **EmbedAndStore** Lambda reads the file, calls Amazon Bedrock Titan Text Embeddings V2 to generate a 1024-dimensional vector, and writes it to the S3 Vectors index via `PutVectors` with the source file path as metadata |
| 64 | + - **ValidateIngestion** Lambda fetches the Distributed Map result manifest from S3, collects all vector keys from the SUCCEEDED results, generates a probe embedding, and calls `QueryVectors` to confirm at least one newly ingested vector is returned |
| 65 | + - A **Choice** state checks the validation result: on success the workflow completes; on failure the **RollbackVectors** Lambda calls `DeleteVectors` to remove all newly added vectors, then the workflow fails |
| 66 | + |
| 67 | +## Architecture |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +## Testing |
| 72 | + |
| 73 | +After deployment, upload sample documents and start the workflow. |
| 74 | + |
| 75 | +### Upload test documents |
| 76 | + |
| 77 | +```bash |
| 78 | +BUCKET=$(aws cloudformation describe-stacks \ |
| 79 | + --stack-name RagRefreshStack \ |
| 80 | + --query "Stacks[0].Outputs[?OutputKey=='DocumentBucketName'].OutputValue" \ |
| 81 | + --output text) |
| 82 | +
|
| 83 | +echo "Amazon S3 Vectors is a new vector storage capability." > /tmp/doc1.txt |
| 84 | +echo "Step Functions Distributed Map enables parallel processing at scale." > /tmp/doc2.txt |
| 85 | +
|
| 86 | +aws s3 cp /tmp/doc1.txt s3://$BUCKET/documents/doc1.txt |
| 87 | +aws s3 cp /tmp/doc2.txt s3://$BUCKET/documents/doc2.txt |
| 88 | +``` |
| 89 | + |
| 90 | +### Start the workflow |
| 91 | + |
| 92 | +```bash |
| 93 | +STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \ |
| 94 | + --stack-name RagRefreshStack \ |
| 95 | + --query "Stacks[0].Outputs[?OutputKey=='StateMachineArn'].OutputValue" \ |
| 96 | + --output text) |
| 97 | +
|
| 98 | +aws stepfunctions start-execution \ |
| 99 | + --state-machine-arn $STATE_MACHINE_ARN |
| 100 | +``` |
| 101 | + |
| 102 | +### Monitor execution |
| 103 | + |
| 104 | +```bash |
| 105 | +aws stepfunctions list-executions \ |
| 106 | + --state-machine-arn $STATE_MACHINE_ARN \ |
| 107 | + --max-results 1 |
| 108 | +``` |
| 109 | + |
| 110 | +### Expected result |
| 111 | + |
| 112 | +The workflow should complete successfully. In the Step Functions console you'll see: |
| 113 | +1. Distributed Map processed both documents in parallel |
| 114 | +2. Each document was embedded and stored as a vector |
| 115 | +3. Validation confirmed the vectors are queryable |
| 116 | +4. The workflow reached the `IngestionSucceeded` state |
| 117 | +
|
| 118 | +## Cleanup |
| 119 | +
|
| 120 | +1. Delete the stack: |
| 121 | +
|
| 122 | + ```bash |
| 123 | + cdk destroy |
| 124 | + ``` |
| 125 | +
|
| 126 | +1. Confirm the stack has been deleted: |
| 127 | +
|
| 128 | + ```bash |
| 129 | + aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE |
| 130 | + ``` |
| 131 | +
|
| 132 | +---- |
| 133 | +Copyright 2026 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
| 134 | +
|
| 135 | +SPDX-License-Identifier: MIT-0 |
0 commit comments