Skip to content

fix: skip empty knowledge-base embedding batches#6106

Open
stablegenius49 wants to merge 1 commit intoAstrBotDevs:masterfrom
stablegenius49:pr-factory/issue-3709-empty-batch-upload
Open

fix: skip empty knowledge-base embedding batches#6106
stablegenius49 wants to merge 1 commit intoAstrBotDevs:masterfrom
stablegenius49:pr-factory/issue-3709-empty-batch-upload

Conversation

@stablegenius49
Copy link
Contributor

@stablegenius49 stablegenius49 commented Mar 12, 2026

Fixes #3709

Modifications / 改动点

  • short-circuit FaissVecDB.insert_batch() when the parser/chunker produces an empty batch instead of flowing into the embedding/document stores and crashing with tuple index out of range
  • add a focused regression test to ensure empty knowledge-base batches return cleanly and never touch the embedding/document storage layer

This keeps PDF uploads from blowing up when a file yields zero indexable chunks (for example, scanned/image-only PDFs or PDFs whose extracted text is empty) and preserves the existing upload flow instead of surfacing an opaque vector-db exception.

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

Verification Steps:

python3.11 -m pytest tests/unit/test_faiss_vec_db.py -q

Result:

1 passed in 1.01s
python3.11 -m ruff check astrbot/core/db/vec_db/faiss_impl/vec_db.py tests/unit/test_faiss_vec_db.py

Result:

All checks passed!

Checklist / 检查清单

  • 😊 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。/ If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
  • 👀 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。/ My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
  • 🤓 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到了 requirements.txtpyproject.toml 文件相应位置。/ I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
  • 😮 我的更改没有引入恶意代码。/ My changes do not introduce malicious code.

Summary by Sourcery

在 FAISS 向量数据库的插入流程中安全地处理空知识库批次,当未生成可索引内容时避免出错。

Bug Fixes:

  • 当解析后的内容列表为空时,阻止批量插入尝试执行嵌入和存储操作,从而避免在空上传场景下的运行时错误。

Tests:

  • 添加一个异步回归测试,确保对空批次的插入返回空结果,并且从不调用嵌入层或文档存储层。
Original summary in English

Summary by Sourcery

Handle empty knowledge-base batches safely in the FAISS vector database insert path to avoid errors when no indexable content is produced.

Bug Fixes:

  • Prevent batch insert from attempting embedding and storage operations when the parsed contents list is empty, avoiding runtime errors on empty uploads.

Tests:

  • Add an async regression test ensuring empty batch inserts return an empty result and never call embedding or document storage layers.

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Mar 12, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial fix to enhance the robustness of the knowledge-base ingestion process. It prevents system crashes that previously occurred when FaissVecDB.insert_batch() received empty content batches, a common scenario with certain PDF types. The change ensures a smoother and more resilient data upload experience by gracefully handling these edge cases.

Highlights

  • Empty Batch Handling: Implemented a short-circuit mechanism in FaissVecDB.insert_batch() to prevent crashes when processing empty batches of content, specifically addressing issues with knowledge-base uploads from files like scanned PDFs that yield no indexable chunks.
  • Regression Testing: Added a dedicated regression test to ensure that FaissVecDB.insert_batch() correctly handles empty content batches by returning an empty list and avoiding unnecessary calls to embedding and document storage.
Changelog
  • astrbot/core/db/vec_db/faiss_impl/vec_db.py
    • Added a conditional check at the beginning of insert_batch to immediately return an empty list if contents is empty, preventing further processing.
  • tests/unit/test_faiss_vec_db.py
    • Added a new test file containing test_insert_batch_skips_empty_contents to verify that FaissVecDB.insert_batch correctly handles an empty list of contents by returning an empty result and not invoking embedding or document storage methods.
Activity
  • The author provided verification steps including pytest and ruff checks, both showing successful results.
  • The author confirmed the change is not breaking.
  • The author confirmed the changes are well-tested and provided verification steps.
  • The author confirmed no new dependencies were introduced.
  • The author confirmed no malicious code was introduced.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where empty batches passed to FaissVecDB.insert_batch() would cause a crash. The fix correctly adds a check to handle empty contents by returning early. A new unit test is also included to verify this behavior and prevent regressions. The changes are correct and well-tested. I have one minor suggestion in the new test file to improve code style for better maintainability.

vec_db.document_storage = AsyncMock()
vec_db.embedding_storage = AsyncMock()

result = await FaissVecDB.insert_batch(vec_db, [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved readability and adherence to standard Python conventions, it's better to call instance methods on the object instance directly. Calling the method on the class and passing the instance explicitly is less common and can be confusing for other developers. Using a keyword argument for contents also improves clarity.

Suggested change
result = await FaissVecDB.insert_batch(vec_db, [])
result = await vec_db.insert_batch(contents=[])

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我在这里给出了一些高层次的反馈:

  • 可以考虑将 if not contents 的提前返回移动到构造 metadatasids 之前,这样在空批次的情况下就不会分配永远不会被使用的列表。
  • 在测试中,与其使用 FaissVecDB.__new__,不如通过它的公开构造函数(或一个专门的测试 helper/factory)来构造 FaissVecDB,这样测试设置能更好地反映真实的使用场景。
给 AI Agent 的提示
Please address the comments from this code review:

## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.

Sourcery 对开源项目是免费的——如果你觉得我们的评审有帮助,欢迎分享 ✨
帮我变得更有用!请在每条评论上点击 👍 或 👎,我会根据你的反馈来改进代码评审。
Original comment in English

Hey - I've left some high level feedback:

  • Consider moving the if not contents early-return before constructing metadatas and ids so you don't allocate lists that will never be used in the empty-batch case.
  • In the test, instead of using FaissVecDB.__new__, it may be clearer to construct FaissVecDB via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

2 participants