From 3a533edac92d01a778aae2ef0499fa6f778d366b Mon Sep 17 00:00:00 2001 From: Leonid Borchuk Date: Tue, 12 May 2026 14:29:53 +0300 Subject: [PATCH 1/9] Add AI_POLICY for clarification how to use AI agents See detailed discussion in https://lists.apache.org/thread/3kq1391n3n0rzo0wchygmt0cyy59rzq9 As for the discussion results I've added: 1. AI_POLICY.md - note for the developer what using AI agents means 2. AGENTS.md - description for LLM models how to work with project code 3. .github/pull_request_template.md - new flag "This PR contains AI-assisted code generation" 4. README.md - link to policy from README file --- .github/pull_request_template.md | 1 + AGENTS.md | 175 +++++++++++++++++++++++++++++++ AI_POLICY.md | 70 +++++++++++++ README.md | 1 + 4 files changed, 247 insertions(+) create mode 100644 AGENTS.md create mode 100644 AI_POLICY.md diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 3b592500f9c..d2c001658f9 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -36,6 +36,7 @@ Fixes #ISSUE_Number - [ ] Followed [contribution guide](https://cloudberry.apache.org/contribute/code) - [ ] Added/updated documentation - [ ] Reviewed code for security implications +- [ ] This PR contains AI-assisted code generation - [ ] Requested review from [cloudberry committers](https://github.com/orgs/apache/teams/cloudberry-committers) ### Additional Context diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000000..25dec6cee29 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,175 @@ + + +# AGENTS.md + +Guidance for agent-style coding tools working in the Apache Cloudberry repository. + +## Project overview + +Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. + +Treat this repository as a database system, not as a typical application project. Small changes can affect SQL semantics, query planning, storage, distributed execution, management tooling, upgrade behavior, and user data safety. + +## Core principles for agents + +- Keep changes as small and direct as possible. +- Do not perform broad code refactoring. Cloudberry's core is PostgreSQL-based, and unnecessary refactoring makes familiar code harder for maintainers to recognize and review. +- Preserve PostgreSQL and Greenplum coding style in the area being edited. +- Prefer localized fixes over architecture rewrites unless explicitly requested. +- Read surrounding code before editing. Match existing naming, memory management, error handling, locking, and test patterns. +- Do not generate or import code with incompatible licensing. The project is Apache License 2.0. +- Never treat AI output as automatically correct. The contributor owns the final code. + +## Repository map + +- [README.md](README.md) — project introduction, community links, contribution overview, and license information. +- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution expectations and community guidance. +- [AI_POLICY.md](AI_POLICY.md) — rules for AI-assisted development. +- [SECURITY.md](SECURITY.md) — security reporting policy. +- [.github/pull_request_template.md](.github/pull_request_template.md) — PR checklist, test plan, impact, and AI disclosure checkbox. +- [src/](src/) — database source tree, including PostgreSQL-derived backend, frontend utilities, interfaces, tests, and build integration. +- [src/backend/](src/backend/) — main database backend. Important areas include parser, optimizer, executor, storage, catalog, commands, postmaster, replication, and Cloudberry distributed components. +- [src/backend/cdb/](src/backend/cdb/) — Cloudberry or Greenplum distributed database logic, including dispatch, gangs, motion, and MPP behavior. +- [src/backend/gporca/](src/backend/gporca/) and [src/backend/gpopt/](src/backend/gpopt/) — ORCA optimizer integration and optimizer-related code. +- [src/common/](src/common/) — code shared by backend and frontend utilities. +- [src/interfaces/](src/interfaces/) — client interfaces such as libpq, ECPG, and GPPC. +- [src/test/](src/test/) — regression, isolation, unit, and integration test infrastructure. +- [gpMgmt/](gpMgmt/) — Python management utilities and cluster administration tooling. +- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, packaging, and build helpers. +- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and contributed modules. +- [contrib/](contrib/) — PostgreSQL-style contributed modules and Cloudberry-specific extensions. +- [doc/](doc/) — SGML documentation sources. +- [devops/](devops/) — Docker, automation, sandbox, and build/deployment helper scripts. +- [mcp-server/](mcp-server/) — MCP server for AI-ready Cloudberry database interaction. + +## Architecture notes + +Cloudberry follows a PostgreSQL-style source layout with additional MPP database components inherited from Greenplum. The coordinator receives SQL, plans or optimizes it, dispatches work to segments, and collects results. Segment processes execute distributed pieces of the plan and interact through the interconnect. + +Key concepts agents should recognize: + +- Coordinator and segments are separate roles in a distributed database cluster. +- Query execution may involve dispatch, gangs, motion nodes, distributed transactions, snapshots, and interconnect behavior. +- Storage and catalog changes can affect upgrade, recovery, visibility, and distributed consistency. +- PostgreSQL compatibility matters. Avoid changing behavior that is inherited from PostgreSQL unless the task explicitly targets Cloudberry divergence. +- Extensions under [gpcontrib/](gpcontrib/) and [contrib/](contrib/) may have independent build or test workflows. + +## Working rules + +1. Start by identifying the subsystem and reading nearby files, tests, and documentation. +2. Prefer existing helpers, macros, memory contexts, error reporting conventions, and test infrastructure. +3. Avoid unrelated formatting changes. +4. Avoid renaming symbols or moving files unless explicitly required. +5. Do not silently change SQL-visible behavior, catalog definitions, on-disk format, wire protocol, GUC behavior, or user-facing messages. +6. If a change touches security-sensitive areas, call that out clearly in the PR description and request appropriate human review. +7. If a change touches distributed execution, verify whether it affects both coordinator and segment behavior. +8. If a change touches management scripts, check Python compatibility and existing unit or behave tests. +9. If a change touches documentation, keep examples accurate and consistent with project terminology. +10. If behavior is uncertain, add a small regression or unit test rather than relying on assumptions. + +## Build and test guidance + +Use the smallest relevant validation first, then broader validation when the change is ready. + +Common validation entry points mentioned by project docs and PR templates: + +- Configure and build through the repository's standard build flow or the automation in [devops/README.md](devops/README.md). +- Use Docker-based development and sandbox workflows under [devops/](devops/) when local system dependencies are not available. +- Run `make installcheck` for regression coverage when appropriate. +- Run `make -C src/test installcheck-cbdb-parallel` for Cloudberry parallel regression coverage when appropriate. +- For extension-specific changes, run the extension's local installcheck or documented test target. +- For management tooling under [gpMgmt/](gpMgmt/), inspect the relevant README and test targets before selecting a test command. + +Do not invent successful test results. If tests are not run, state that clearly in the final response or PR notes. + +## AI-assisted contribution policy + +Follow [AI_POLICY.md](AI_POLICY.md): + +- AI-generated code has the same responsibility and quality bar as human-written code. +- AI-assisted changes must pass normal review, testing, and CI standards. +- The contributor must ensure license compatibility. +- Significant AI-generated code should be disclosed using the PR template checkbox. +- Do not use AI to auto-generate responses to maintainer review feedback. +- Include or verify tests for AI-generated code. +- Keep changes simple and avoid code refactoring. + +## Security policy + +Follow [SECURITY.md](SECURITY.md): + +- Do not report security vulnerabilities in public issues, public mailing lists, or public forums. +- Send vulnerability reports to security@apache.org. +- For normal non-security bugs, use GitHub Issues, Discussions, the dev mailing list, or Slack. + +When working as an agent, do not expose secrets, credentials, private keys, database dumps with sensitive data, or vulnerability details in public-facing output. + +## Pull request expectations + +Use [.github/pull_request_template.md](.github/pull_request_template.md) as the checklist for final change summaries: + +- Explain what the PR does. +- Identify the type of change. +- Document breaking changes if any. +- Provide a test plan. +- Describe performance, user-facing, and dependency impact when applicable. +- Confirm documentation updates when needed. +- Confirm security review consideration. +- Disclose significant AI-assisted code generation. + +## Style expectations + +- C code should follow the surrounding PostgreSQL or Cloudberry style. +- Python code in [gpMgmt/](gpMgmt/) should follow nearby management script patterns and existing test style. +- SQL tests should include expected output files when required by the test framework. +- Documentation uses Markdown in many repository files and SGML under [doc/src/sgml/](doc/src/sgml/). +- Prefer project terminology: Apache Cloudberry, coordinator, segment, MPP, PostgreSQL kernel, Greenplum heritage. + +## High-risk areas + +Be especially conservative around: + +- Catalog definitions and upgrade-sensitive files. +- Storage formats, WAL, recovery, transactions, snapshots, and visibility. +- Planner, optimizer, executor, and motion/distributed execution logic. +- Authentication, cryptography, TLS, network protocol, and libpq behavior. +- Interconnect and dispatch paths. +- Cluster management commands that start, stop, expand, recover, or reconfigure clusters. +- Public SQL behavior, GUCs, system views, and extension APIs. + +## Recommended agent workflow + +1. Restate the requested change in concrete terms. +2. Locate the smallest relevant subsystem. +3. Read nearby implementation and tests. +4. Plan a minimal change. +5. Edit only files required for the task. +6. Add or update tests when behavior changes. +7. Run the narrowest relevant tests available. +8. Summarize changed files, test results, and any risks or follow-ups. + +## What not to do + +- Do not perform drive-by cleanup. +- Do not reformat unrelated code. +- Do not replace established PostgreSQL-style patterns with modern alternatives just for preference. +- Do not change public behavior without tests and documentation. +- Do not assume single-node behavior is enough for distributed database changes. +- Do not fabricate command output, test results, issue links, or reviewer decisions. diff --git a/AI_POLICY.md b/AI_POLICY.md new file mode 100644 index 00000000000..d1c2048b153 --- /dev/null +++ b/AI_POLICY.md @@ -0,0 +1,70 @@ +# AI Policy + +We welcome AI tools in Apache Cloudberry development — code assistants, LLMs, AI code review, and beyond. AI is a normal developer tool, like an IDE or a debugger. This document sets simple ground rules so everyone can use AI responsibly. + +## Guidelines + +### 1. You own the code + +AI-generated code carries the same responsibility as code you type yourself. Review it before submitting. If a bug ships, "the AI wrote it" is not a defense. + +**Example:** As an experiment, you used LLM to generate a new type of executor node. The results were impressive, and you wanted to share them with the community. Before opening PR, read every line, verify the logic, and make sure it fits with existing code patterns. Someone might use your code in production, not just for experiments. + +### 2. Same quality bar + +AI-assisted contributions must pass the same review, testing, and CI standards as any other code. No shortcuts. AI-generated code must come with corresponding tests, or be covered by existing ones. If the AI wrote the code, you should at least write or carefully verify the tests. + +**Example:** You use an LLM to implement a new aggregate function. The PR must include regression tests in `src/test` that exercise both normal and edge cases. + +### 3. Watch the license + +Don't let AI introduce code incompatible with the Apache License 2.0. You are responsible for ensuring all submitted code — AI-generated or not — has proper licensing. + +See [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. + +**Example:** If an AI tool reproduces a snippet from a GPL-licensed project, you must not include it. When in doubt, rewrite from scratch. + +### 4. Flag it + +When your PR includes significant AI-generated code, check the AI disclosure box in the PR template. You don't have to disclose minor AI assistance (autocomplete, reformatting), but be transparent about substantial generation. + +**Example:** Using LLM to autocomplete a single function signature - no need for a flag. Using LLMs to generate an entire new GUC parameter with validation logic - flag it. The flag doesn't mean that the PR doesn't need to be reviewed or merged, but it will give reviewers more information about the code generation method and allow them to focus more on checking the architecture and logic, rather than specific operators. + +### 5. No meaningless code refactoring + +Our core is PostgreSQL, and refactoring work has already been done here. Rewriting code significantly complicates rebase. Also, refactoring changes the code in a way that forces people to relearn the code they already know. Keep changes as simple as possible. + +**Example:** The point of LLM is to spend your tokens. One day, you will be asked: "This code is not very good. Do you want to improve it?" Of course! It could happens several times. Tokens are spent, but what is the point of such refactoring? (Rhetorical question) + +### 6. LLM code review + +So far, it is not possible to use paid LLM models for code review in open source ASF projects. However, one could use personal licenses for LLMs to do the same. + +**Example:** We primarily use GitHub Copilot for automated AI code review on pull requests. Here are some important points: + +- Copilot suggestions are **non-binding hints**, not requirements. +- If a suggestion is irrelevant or wrong, skip it — you know your code best. +- If a suggestion catches a real issue, fix it like you would for any review comment. +- Copilot does not replace human reviewers. All PRs still need approval from a committer. + +### 7. Talk to maintainers yourself + +Do not use AI to auto-generate responses to review feedback. Maintainers invest time reviewing your code; respond thoughtfully and personally. + +**Example:** A reviewer asks "why did you choose this approach over X?" — write your own answer explaining the tradeoff, don't paste an LLM-generated reply. + +## Good uses of AI + +- Bug fixing and root cause analysis +- Code review +- Writing and improving tests +- Documentation and code comments +- Build system and CI improvements +- Security research and vulnerability scanning +- Learning the codebase faster + +## Resources + +- [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) - Official Apache guidance on AI tool usage +- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer and code reviewer we use +- [LLM Leaderboard](https://llm-stats.com/) - LLM Stats Score, it's better to use high-ranked models \ No newline at end of file diff --git a/README.md b/README.md index 2a4b7146efd..9e2a5686317 100644 --- a/README.md +++ b/README.md @@ -111,6 +111,7 @@ with the contribution. | Code contribution | Learn how to contribute code to the Cloudberry, including coding preparation, conventions, workflow, review, and checklist following the [code contribution guide](https://cloudberry.apache.org/contribute/code).| | Submit the proposal | Proposing major changes to Cloudberry through [proposal guide](https://cloudberry.apache.org/contribute/proposal).| | Doc contribution | We need you to join us to help us improve the documentation, see the [doc contribution guide](https://cloudberry.apache.org/contribute/doc).| +| AI policy | For AI-assisted development, please review our [AI policy](AI_POLICY.md) for guidelines on responsible AI usage.| ## Roadmap From 469e2e2c0fde58251f8065c7f156ddda15ff0375 Mon Sep 17 00:00:00 2001 From: Leonid Borchuk Date: Tue, 12 May 2026 18:17:23 +0300 Subject: [PATCH 2/9] Remove AGENTS.md as totally unsuitable --- AGENTS.md | 175 --------------------------------------------------- AI_POLICY.md | 27 ++++++-- 2 files changed, 23 insertions(+), 179 deletions(-) delete mode 100644 AGENTS.md diff --git a/AGENTS.md b/AGENTS.md deleted file mode 100644 index 25dec6cee29..00000000000 --- a/AGENTS.md +++ /dev/null @@ -1,175 +0,0 @@ - - -# AGENTS.md - -Guidance for agent-style coding tools working in the Apache Cloudberry repository. - -## Project overview - -Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. - -Treat this repository as a database system, not as a typical application project. Small changes can affect SQL semantics, query planning, storage, distributed execution, management tooling, upgrade behavior, and user data safety. - -## Core principles for agents - -- Keep changes as small and direct as possible. -- Do not perform broad code refactoring. Cloudberry's core is PostgreSQL-based, and unnecessary refactoring makes familiar code harder for maintainers to recognize and review. -- Preserve PostgreSQL and Greenplum coding style in the area being edited. -- Prefer localized fixes over architecture rewrites unless explicitly requested. -- Read surrounding code before editing. Match existing naming, memory management, error handling, locking, and test patterns. -- Do not generate or import code with incompatible licensing. The project is Apache License 2.0. -- Never treat AI output as automatically correct. The contributor owns the final code. - -## Repository map - -- [README.md](README.md) — project introduction, community links, contribution overview, and license information. -- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution expectations and community guidance. -- [AI_POLICY.md](AI_POLICY.md) — rules for AI-assisted development. -- [SECURITY.md](SECURITY.md) — security reporting policy. -- [.github/pull_request_template.md](.github/pull_request_template.md) — PR checklist, test plan, impact, and AI disclosure checkbox. -- [src/](src/) — database source tree, including PostgreSQL-derived backend, frontend utilities, interfaces, tests, and build integration. -- [src/backend/](src/backend/) — main database backend. Important areas include parser, optimizer, executor, storage, catalog, commands, postmaster, replication, and Cloudberry distributed components. -- [src/backend/cdb/](src/backend/cdb/) — Cloudberry or Greenplum distributed database logic, including dispatch, gangs, motion, and MPP behavior. -- [src/backend/gporca/](src/backend/gporca/) and [src/backend/gpopt/](src/backend/gpopt/) — ORCA optimizer integration and optimizer-related code. -- [src/common/](src/common/) — code shared by backend and frontend utilities. -- [src/interfaces/](src/interfaces/) — client interfaces such as libpq, ECPG, and GPPC. -- [src/test/](src/test/) — regression, isolation, unit, and integration test infrastructure. -- [gpMgmt/](gpMgmt/) — Python management utilities and cluster administration tooling. -- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, packaging, and build helpers. -- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and contributed modules. -- [contrib/](contrib/) — PostgreSQL-style contributed modules and Cloudberry-specific extensions. -- [doc/](doc/) — SGML documentation sources. -- [devops/](devops/) — Docker, automation, sandbox, and build/deployment helper scripts. -- [mcp-server/](mcp-server/) — MCP server for AI-ready Cloudberry database interaction. - -## Architecture notes - -Cloudberry follows a PostgreSQL-style source layout with additional MPP database components inherited from Greenplum. The coordinator receives SQL, plans or optimizes it, dispatches work to segments, and collects results. Segment processes execute distributed pieces of the plan and interact through the interconnect. - -Key concepts agents should recognize: - -- Coordinator and segments are separate roles in a distributed database cluster. -- Query execution may involve dispatch, gangs, motion nodes, distributed transactions, snapshots, and interconnect behavior. -- Storage and catalog changes can affect upgrade, recovery, visibility, and distributed consistency. -- PostgreSQL compatibility matters. Avoid changing behavior that is inherited from PostgreSQL unless the task explicitly targets Cloudberry divergence. -- Extensions under [gpcontrib/](gpcontrib/) and [contrib/](contrib/) may have independent build or test workflows. - -## Working rules - -1. Start by identifying the subsystem and reading nearby files, tests, and documentation. -2. Prefer existing helpers, macros, memory contexts, error reporting conventions, and test infrastructure. -3. Avoid unrelated formatting changes. -4. Avoid renaming symbols or moving files unless explicitly required. -5. Do not silently change SQL-visible behavior, catalog definitions, on-disk format, wire protocol, GUC behavior, or user-facing messages. -6. If a change touches security-sensitive areas, call that out clearly in the PR description and request appropriate human review. -7. If a change touches distributed execution, verify whether it affects both coordinator and segment behavior. -8. If a change touches management scripts, check Python compatibility and existing unit or behave tests. -9. If a change touches documentation, keep examples accurate and consistent with project terminology. -10. If behavior is uncertain, add a small regression or unit test rather than relying on assumptions. - -## Build and test guidance - -Use the smallest relevant validation first, then broader validation when the change is ready. - -Common validation entry points mentioned by project docs and PR templates: - -- Configure and build through the repository's standard build flow or the automation in [devops/README.md](devops/README.md). -- Use Docker-based development and sandbox workflows under [devops/](devops/) when local system dependencies are not available. -- Run `make installcheck` for regression coverage when appropriate. -- Run `make -C src/test installcheck-cbdb-parallel` for Cloudberry parallel regression coverage when appropriate. -- For extension-specific changes, run the extension's local installcheck or documented test target. -- For management tooling under [gpMgmt/](gpMgmt/), inspect the relevant README and test targets before selecting a test command. - -Do not invent successful test results. If tests are not run, state that clearly in the final response or PR notes. - -## AI-assisted contribution policy - -Follow [AI_POLICY.md](AI_POLICY.md): - -- AI-generated code has the same responsibility and quality bar as human-written code. -- AI-assisted changes must pass normal review, testing, and CI standards. -- The contributor must ensure license compatibility. -- Significant AI-generated code should be disclosed using the PR template checkbox. -- Do not use AI to auto-generate responses to maintainer review feedback. -- Include or verify tests for AI-generated code. -- Keep changes simple and avoid code refactoring. - -## Security policy - -Follow [SECURITY.md](SECURITY.md): - -- Do not report security vulnerabilities in public issues, public mailing lists, or public forums. -- Send vulnerability reports to security@apache.org. -- For normal non-security bugs, use GitHub Issues, Discussions, the dev mailing list, or Slack. - -When working as an agent, do not expose secrets, credentials, private keys, database dumps with sensitive data, or vulnerability details in public-facing output. - -## Pull request expectations - -Use [.github/pull_request_template.md](.github/pull_request_template.md) as the checklist for final change summaries: - -- Explain what the PR does. -- Identify the type of change. -- Document breaking changes if any. -- Provide a test plan. -- Describe performance, user-facing, and dependency impact when applicable. -- Confirm documentation updates when needed. -- Confirm security review consideration. -- Disclose significant AI-assisted code generation. - -## Style expectations - -- C code should follow the surrounding PostgreSQL or Cloudberry style. -- Python code in [gpMgmt/](gpMgmt/) should follow nearby management script patterns and existing test style. -- SQL tests should include expected output files when required by the test framework. -- Documentation uses Markdown in many repository files and SGML under [doc/src/sgml/](doc/src/sgml/). -- Prefer project terminology: Apache Cloudberry, coordinator, segment, MPP, PostgreSQL kernel, Greenplum heritage. - -## High-risk areas - -Be especially conservative around: - -- Catalog definitions and upgrade-sensitive files. -- Storage formats, WAL, recovery, transactions, snapshots, and visibility. -- Planner, optimizer, executor, and motion/distributed execution logic. -- Authentication, cryptography, TLS, network protocol, and libpq behavior. -- Interconnect and dispatch paths. -- Cluster management commands that start, stop, expand, recover, or reconfigure clusters. -- Public SQL behavior, GUCs, system views, and extension APIs. - -## Recommended agent workflow - -1. Restate the requested change in concrete terms. -2. Locate the smallest relevant subsystem. -3. Read nearby implementation and tests. -4. Plan a minimal change. -5. Edit only files required for the task. -6. Add or update tests when behavior changes. -7. Run the narrowest relevant tests available. -8. Summarize changed files, test results, and any risks or follow-ups. - -## What not to do - -- Do not perform drive-by cleanup. -- Do not reformat unrelated code. -- Do not replace established PostgreSQL-style patterns with modern alternatives just for preference. -- Do not change public behavior without tests and documentation. -- Do not assume single-node behavior is enough for distributed database changes. -- Do not fabricate command output, test results, issue links, or reviewer decisions. diff --git a/AI_POLICY.md b/AI_POLICY.md index d1c2048b153..889a37a90f8 100644 --- a/AI_POLICY.md +++ b/AI_POLICY.md @@ -1,3 +1,22 @@ + + # AI Policy We welcome AI tools in Apache Cloudberry development — code assistants, LLMs, AI code review, and beyond. AI is a normal developer tool, like an IDE or a debugger. This document sets simple ground rules so everyone can use AI responsibly. @@ -34,13 +53,13 @@ When your PR includes significant AI-generated code, check the AI disclosure box Our core is PostgreSQL, and refactoring work has already been done here. Rewriting code significantly complicates rebase. Also, refactoring changes the code in a way that forces people to relearn the code they already know. Keep changes as simple as possible. -**Example:** The point of LLM is to spend your tokens. One day, you will be asked: "This code is not very good. Do you want to improve it?" Of course! It could happens several times. Tokens are spent, but what is the point of such refactoring? (Rhetorical question) +**Example:** The point of LLM is to spend your tokens. One day, you will be asked: "This code is not very good. Do you want to improve it?" Of course! It could happen several times. Tokens are spent, but what is the point of such refactoring? (Rhetorical question) ### 6. LLM code review So far, it is not possible to use paid LLM models for code review in open source ASF projects. However, one could use personal licenses for LLMs to do the same. -**Example:** We primarily use GitHub Copilot for automated AI code review on pull requests. Here are some important points: +**Example:** One could use GitHub Copilot for automated AI code review on pull requests. Here are some important points: - Copilot suggestions are **non-binding hints**, not requirements. - If a suggestion is irrelevant or wrong, skip it — you know your code best. @@ -66,5 +85,5 @@ Do not use AI to auto-generate responses to review feedback. Maintainers invest ## Resources - [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) - Official Apache guidance on AI tool usage -- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer and code reviewer we use -- [LLM Leaderboard](https://llm-stats.com/) - LLM Stats Score, it's better to use high-ranked models \ No newline at end of file +- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer and code reviewer +- [LLM Leaderboard](https://llm-stats.com/) - LLM Stats Score, it's better to use high-ranked models From cf3d7a8c133981a00557218f6fd36e63c5fcdfd5 Mon Sep 17 00:00:00 2001 From: Leonid Borchuk Date: Wed, 13 May 2026 12:22:48 +0300 Subject: [PATCH 3/9] Rename AI_POLICY to AI_GUIDELINE and add AGENTS.md.template file --- .gitmessage | 4 + AGENTS.md.template | 175 ++++++++++++++++++++++++++++++++++++++++++++ AI_GUIDELINE.md | 178 +++++++++++++++++++++++++++++++++++++++++++++ AI_POLICY.md | 89 ----------------------- README.md | 2 +- 5 files changed, 358 insertions(+), 90 deletions(-) create mode 100644 AGENTS.md.template create mode 100644 AI_GUIDELINE.md delete mode 100644 AI_POLICY.md diff --git a/.gitmessage b/.gitmessage index 9852789f9a9..0e5b8bc31d9 100644 --- a/.gitmessage +++ b/.gitmessage @@ -35,6 +35,10 @@ Add your commit body here # Discussions, please list them as a reference: #See: Issue#id ? #See: Discussion#id ? +# If AI tools substantially assisted in writing this commit, optionally +# note which tool(s) were used (one line per tool): +#Assisted-by: ChatGPT +#Assisted-by: GitHub Copilot ######################################################################## # # diff --git a/AGENTS.md.template b/AGENTS.md.template new file mode 100644 index 00000000000..25dec6cee29 --- /dev/null +++ b/AGENTS.md.template @@ -0,0 +1,175 @@ + + +# AGENTS.md + +Guidance for agent-style coding tools working in the Apache Cloudberry repository. + +## Project overview + +Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. + +Treat this repository as a database system, not as a typical application project. Small changes can affect SQL semantics, query planning, storage, distributed execution, management tooling, upgrade behavior, and user data safety. + +## Core principles for agents + +- Keep changes as small and direct as possible. +- Do not perform broad code refactoring. Cloudberry's core is PostgreSQL-based, and unnecessary refactoring makes familiar code harder for maintainers to recognize and review. +- Preserve PostgreSQL and Greenplum coding style in the area being edited. +- Prefer localized fixes over architecture rewrites unless explicitly requested. +- Read surrounding code before editing. Match existing naming, memory management, error handling, locking, and test patterns. +- Do not generate or import code with incompatible licensing. The project is Apache License 2.0. +- Never treat AI output as automatically correct. The contributor owns the final code. + +## Repository map + +- [README.md](README.md) — project introduction, community links, contribution overview, and license information. +- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution expectations and community guidance. +- [AI_POLICY.md](AI_POLICY.md) — rules for AI-assisted development. +- [SECURITY.md](SECURITY.md) — security reporting policy. +- [.github/pull_request_template.md](.github/pull_request_template.md) — PR checklist, test plan, impact, and AI disclosure checkbox. +- [src/](src/) — database source tree, including PostgreSQL-derived backend, frontend utilities, interfaces, tests, and build integration. +- [src/backend/](src/backend/) — main database backend. Important areas include parser, optimizer, executor, storage, catalog, commands, postmaster, replication, and Cloudberry distributed components. +- [src/backend/cdb/](src/backend/cdb/) — Cloudberry or Greenplum distributed database logic, including dispatch, gangs, motion, and MPP behavior. +- [src/backend/gporca/](src/backend/gporca/) and [src/backend/gpopt/](src/backend/gpopt/) — ORCA optimizer integration and optimizer-related code. +- [src/common/](src/common/) — code shared by backend and frontend utilities. +- [src/interfaces/](src/interfaces/) — client interfaces such as libpq, ECPG, and GPPC. +- [src/test/](src/test/) — regression, isolation, unit, and integration test infrastructure. +- [gpMgmt/](gpMgmt/) — Python management utilities and cluster administration tooling. +- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, packaging, and build helpers. +- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and contributed modules. +- [contrib/](contrib/) — PostgreSQL-style contributed modules and Cloudberry-specific extensions. +- [doc/](doc/) — SGML documentation sources. +- [devops/](devops/) — Docker, automation, sandbox, and build/deployment helper scripts. +- [mcp-server/](mcp-server/) — MCP server for AI-ready Cloudberry database interaction. + +## Architecture notes + +Cloudberry follows a PostgreSQL-style source layout with additional MPP database components inherited from Greenplum. The coordinator receives SQL, plans or optimizes it, dispatches work to segments, and collects results. Segment processes execute distributed pieces of the plan and interact through the interconnect. + +Key concepts agents should recognize: + +- Coordinator and segments are separate roles in a distributed database cluster. +- Query execution may involve dispatch, gangs, motion nodes, distributed transactions, snapshots, and interconnect behavior. +- Storage and catalog changes can affect upgrade, recovery, visibility, and distributed consistency. +- PostgreSQL compatibility matters. Avoid changing behavior that is inherited from PostgreSQL unless the task explicitly targets Cloudberry divergence. +- Extensions under [gpcontrib/](gpcontrib/) and [contrib/](contrib/) may have independent build or test workflows. + +## Working rules + +1. Start by identifying the subsystem and reading nearby files, tests, and documentation. +2. Prefer existing helpers, macros, memory contexts, error reporting conventions, and test infrastructure. +3. Avoid unrelated formatting changes. +4. Avoid renaming symbols or moving files unless explicitly required. +5. Do not silently change SQL-visible behavior, catalog definitions, on-disk format, wire protocol, GUC behavior, or user-facing messages. +6. If a change touches security-sensitive areas, call that out clearly in the PR description and request appropriate human review. +7. If a change touches distributed execution, verify whether it affects both coordinator and segment behavior. +8. If a change touches management scripts, check Python compatibility and existing unit or behave tests. +9. If a change touches documentation, keep examples accurate and consistent with project terminology. +10. If behavior is uncertain, add a small regression or unit test rather than relying on assumptions. + +## Build and test guidance + +Use the smallest relevant validation first, then broader validation when the change is ready. + +Common validation entry points mentioned by project docs and PR templates: + +- Configure and build through the repository's standard build flow or the automation in [devops/README.md](devops/README.md). +- Use Docker-based development and sandbox workflows under [devops/](devops/) when local system dependencies are not available. +- Run `make installcheck` for regression coverage when appropriate. +- Run `make -C src/test installcheck-cbdb-parallel` for Cloudberry parallel regression coverage when appropriate. +- For extension-specific changes, run the extension's local installcheck or documented test target. +- For management tooling under [gpMgmt/](gpMgmt/), inspect the relevant README and test targets before selecting a test command. + +Do not invent successful test results. If tests are not run, state that clearly in the final response or PR notes. + +## AI-assisted contribution policy + +Follow [AI_POLICY.md](AI_POLICY.md): + +- AI-generated code has the same responsibility and quality bar as human-written code. +- AI-assisted changes must pass normal review, testing, and CI standards. +- The contributor must ensure license compatibility. +- Significant AI-generated code should be disclosed using the PR template checkbox. +- Do not use AI to auto-generate responses to maintainer review feedback. +- Include or verify tests for AI-generated code. +- Keep changes simple and avoid code refactoring. + +## Security policy + +Follow [SECURITY.md](SECURITY.md): + +- Do not report security vulnerabilities in public issues, public mailing lists, or public forums. +- Send vulnerability reports to security@apache.org. +- For normal non-security bugs, use GitHub Issues, Discussions, the dev mailing list, or Slack. + +When working as an agent, do not expose secrets, credentials, private keys, database dumps with sensitive data, or vulnerability details in public-facing output. + +## Pull request expectations + +Use [.github/pull_request_template.md](.github/pull_request_template.md) as the checklist for final change summaries: + +- Explain what the PR does. +- Identify the type of change. +- Document breaking changes if any. +- Provide a test plan. +- Describe performance, user-facing, and dependency impact when applicable. +- Confirm documentation updates when needed. +- Confirm security review consideration. +- Disclose significant AI-assisted code generation. + +## Style expectations + +- C code should follow the surrounding PostgreSQL or Cloudberry style. +- Python code in [gpMgmt/](gpMgmt/) should follow nearby management script patterns and existing test style. +- SQL tests should include expected output files when required by the test framework. +- Documentation uses Markdown in many repository files and SGML under [doc/src/sgml/](doc/src/sgml/). +- Prefer project terminology: Apache Cloudberry, coordinator, segment, MPP, PostgreSQL kernel, Greenplum heritage. + +## High-risk areas + +Be especially conservative around: + +- Catalog definitions and upgrade-sensitive files. +- Storage formats, WAL, recovery, transactions, snapshots, and visibility. +- Planner, optimizer, executor, and motion/distributed execution logic. +- Authentication, cryptography, TLS, network protocol, and libpq behavior. +- Interconnect and dispatch paths. +- Cluster management commands that start, stop, expand, recover, or reconfigure clusters. +- Public SQL behavior, GUCs, system views, and extension APIs. + +## Recommended agent workflow + +1. Restate the requested change in concrete terms. +2. Locate the smallest relevant subsystem. +3. Read nearby implementation and tests. +4. Plan a minimal change. +5. Edit only files required for the task. +6. Add or update tests when behavior changes. +7. Run the narrowest relevant tests available. +8. Summarize changed files, test results, and any risks or follow-ups. + +## What not to do + +- Do not perform drive-by cleanup. +- Do not reformat unrelated code. +- Do not replace established PostgreSQL-style patterns with modern alternatives just for preference. +- Do not change public behavior without tests and documentation. +- Do not assume single-node behavior is enough for distributed database changes. +- Do not fabricate command output, test results, issue links, or reviewer decisions. diff --git a/AI_GUIDELINE.md b/AI_GUIDELINE.md new file mode 100644 index 00000000000..3b92f96a68c --- /dev/null +++ b/AI_GUIDELINE.md @@ -0,0 +1,178 @@ + + +# Guidelines for AI-assisted Contributions + +Apache Cloudberry follows the ASF Generative Tooling Guidance +for the use of AI-assisted development tools: + +https://www.apache.org/legal/generative-tooling.html + +This document provides additional project-specific guidance and +best practices for using AI tools in the Cloudberry community. +It is intended to supplement ASF guidance, not replace it. + +## Guidelines + +### 1. You own the code + +AI-generated code carries the same responsibility as code you +type yourself. Review it before submitting. If a bug ships, +"the AI wrote it" is not a defense. + +**Example:** As an experiment, you used an LLM to generate a +new type of executor node. The results were impressive, and you +wanted to share them with the community. Before opening a PR, +read every line, verify the logic, and make sure it fits with +existing code patterns. Someone might use your code in +production, not just for experiments. + +### 2. Same quality bar + +AI-assisted contributions must pass the same review, testing, +and CI standards as any other code. No shortcuts. AI-generated +code must come with corresponding tests, or be covered by +existing ones. If the AI wrote the code, you should at least +write or carefully verify the tests. + +**Example:** You use an LLM to implement a new aggregate +function. The PR must include regression tests in `src/test` +that exercise both normal and edge cases. + +### 3. Watch the license + +Don't let AI introduce code incompatible with the Apache +License 2.0. You are responsible for ensuring all submitted +code — AI-generated or not — has proper licensing. + +See [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) +for details. + +**Example:** If an AI tool reproduces a snippet from a +GPL-licensed project, you must not include it. When in doubt, +rewrite from scratch. + +### 4. Flag it + +When your PR includes significant AI-generated code, check the +AI disclosure box in the PR template. You don't have to +disclose minor AI assistance (autocomplete, reformatting), but +be transparent about substantial generation. + +You can also record AI assistance directly in the commit +message using the optional `Assisted-by:` trailer (one line +per tool), following the same convention used by the Linux +kernel. See the +[Linux kernel coding assistants guidance](https://docs.kernel.org/process/coding-assistants.html) +for background. + +``` +Assisted-by: ChatGPT +Assisted-by: GitHub Copilot +``` + +This trailer is optional and non-binding — it provides +lightweight provenance information without making AI +disclosure a strict requirement. + +**Example:** Using an LLM to autocomplete a single function +signature — no need for a flag. Using an LLM to generate an +entire new GUC parameter with validation logic — flag it and +add an `Assisted-by:` trailer. The flag doesn't mean the PR +skips review or merge criteria, but it gives reviewers more +context about the generation method and lets them focus on +architecture and logic rather than specific operators. + +### 5. No meaningless code refactoring + +Our core is PostgreSQL, and refactoring work has already been +done here. Rewriting code significantly complicates rebase. +Also, refactoring changes the code in a way that forces people +to relearn code they already know. Keep changes as simple as +possible. + +**Example:** LLMs are eager to refactor. One day you may be +asked: "This code is not very good. Do you want to improve +it?" Of course! It could happen several times. Tokens are +spent, but what is the point of such refactoring? +(Rhetorical question) + +### 6. LLM code review + +Some AI review tools (such as GitHub Copilot Review or +CodeRabbit) may not currently be available for ASF-hosted +repositories due to operational, budgetary, or permission +reasons. Contributors can still use personal AI tools locally +but are responsible for ensuring code quality, compliance with +licensing terms, and reviewing outcomes. + +**Example:** One could use GitHub Copilot for automated AI +code review on pull requests. Here are some important points: + +- Copilot suggestions are **non-binding hints**, not + requirements. +- If a suggestion is irrelevant or wrong, skip it — you know + your code best. +- If a suggestion catches a real issue, fix it like you would + for any review comment. +- Copilot does not replace human reviewers. All PRs still need + approval from a committer. + +### 7. Talk to maintainers yourself + +Review discussions should reflect the contributor's own +understanding and technical judgment. AI tools may assist with +drafting responses, but contributors should engage +thoughtfully and personally with reviewers. Maintainers invest +time reviewing your code; respond in kind. + +**Example:** A reviewer asks "why did you choose this approach +over X?" — write your own answer explaining the tradeoff, +don't paste an LLM-generated reply. + +## AGENTS.md + +[AGENTS.md](https://agents.md/) is a README for agents: a +dedicated, predictable place to provide context and +instructions to help AI coding agents work on your project. +We do not ship a repository-level `AGENTS.md` because the +right content is platform- and user-specific. If you work with +AI coding agents locally, create your own `AGENTS.md`. You could +take the template from the `AGENTS.md.template` file. + +## Good uses of AI + +- Bug fixing and root cause analysis +- Code review +- Writing and improving tests +- Documentation and code comments +- Build system and CI improvements +- Security research and vulnerability scanning +- Learning the codebase faster + +## Resources + +- [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) + — Official Apache guidance on AI tool usage +- [GitHub Copilot](https://github.com/features/copilot) + — AI pair programmer and code reviewer +- [CodeRabbit](https://www.coderabbit.ai/) + — Yet another AI pair programmer and code reviewer +- [AGENTS.md](https://agents.md/) + - README for agents \ No newline at end of file diff --git a/AI_POLICY.md b/AI_POLICY.md deleted file mode 100644 index 889a37a90f8..00000000000 --- a/AI_POLICY.md +++ /dev/null @@ -1,89 +0,0 @@ - - -# AI Policy - -We welcome AI tools in Apache Cloudberry development — code assistants, LLMs, AI code review, and beyond. AI is a normal developer tool, like an IDE or a debugger. This document sets simple ground rules so everyone can use AI responsibly. - -## Guidelines - -### 1. You own the code - -AI-generated code carries the same responsibility as code you type yourself. Review it before submitting. If a bug ships, "the AI wrote it" is not a defense. - -**Example:** As an experiment, you used LLM to generate a new type of executor node. The results were impressive, and you wanted to share them with the community. Before opening PR, read every line, verify the logic, and make sure it fits with existing code patterns. Someone might use your code in production, not just for experiments. - -### 2. Same quality bar - -AI-assisted contributions must pass the same review, testing, and CI standards as any other code. No shortcuts. AI-generated code must come with corresponding tests, or be covered by existing ones. If the AI wrote the code, you should at least write or carefully verify the tests. - -**Example:** You use an LLM to implement a new aggregate function. The PR must include regression tests in `src/test` that exercise both normal and edge cases. - -### 3. Watch the license - -Don't let AI introduce code incompatible with the Apache License 2.0. You are responsible for ensuring all submitted code — AI-generated or not — has proper licensing. - -See [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. - -**Example:** If an AI tool reproduces a snippet from a GPL-licensed project, you must not include it. When in doubt, rewrite from scratch. - -### 4. Flag it - -When your PR includes significant AI-generated code, check the AI disclosure box in the PR template. You don't have to disclose minor AI assistance (autocomplete, reformatting), but be transparent about substantial generation. - -**Example:** Using LLM to autocomplete a single function signature - no need for a flag. Using LLMs to generate an entire new GUC parameter with validation logic - flag it. The flag doesn't mean that the PR doesn't need to be reviewed or merged, but it will give reviewers more information about the code generation method and allow them to focus more on checking the architecture and logic, rather than specific operators. - -### 5. No meaningless code refactoring - -Our core is PostgreSQL, and refactoring work has already been done here. Rewriting code significantly complicates rebase. Also, refactoring changes the code in a way that forces people to relearn the code they already know. Keep changes as simple as possible. - -**Example:** The point of LLM is to spend your tokens. One day, you will be asked: "This code is not very good. Do you want to improve it?" Of course! It could happen several times. Tokens are spent, but what is the point of such refactoring? (Rhetorical question) - -### 6. LLM code review - -So far, it is not possible to use paid LLM models for code review in open source ASF projects. However, one could use personal licenses for LLMs to do the same. - -**Example:** One could use GitHub Copilot for automated AI code review on pull requests. Here are some important points: - -- Copilot suggestions are **non-binding hints**, not requirements. -- If a suggestion is irrelevant or wrong, skip it — you know your code best. -- If a suggestion catches a real issue, fix it like you would for any review comment. -- Copilot does not replace human reviewers. All PRs still need approval from a committer. - -### 7. Talk to maintainers yourself - -Do not use AI to auto-generate responses to review feedback. Maintainers invest time reviewing your code; respond thoughtfully and personally. - -**Example:** A reviewer asks "why did you choose this approach over X?" — write your own answer explaining the tradeoff, don't paste an LLM-generated reply. - -## Good uses of AI - -- Bug fixing and root cause analysis -- Code review -- Writing and improving tests -- Documentation and code comments -- Build system and CI improvements -- Security research and vulnerability scanning -- Learning the codebase faster - -## Resources - -- [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) - Official Apache guidance on AI tool usage -- [GitHub Copilot](https://github.com/features/copilot) - AI pair programmer and code reviewer -- [LLM Leaderboard](https://llm-stats.com/) - LLM Stats Score, it's better to use high-ranked models diff --git a/README.md b/README.md index 9e2a5686317..e332b06b309 100644 --- a/README.md +++ b/README.md @@ -111,7 +111,7 @@ with the contribution. | Code contribution | Learn how to contribute code to the Cloudberry, including coding preparation, conventions, workflow, review, and checklist following the [code contribution guide](https://cloudberry.apache.org/contribute/code).| | Submit the proposal | Proposing major changes to Cloudberry through [proposal guide](https://cloudberry.apache.org/contribute/proposal).| | Doc contribution | We need you to join us to help us improve the documentation, see the [doc contribution guide](https://cloudberry.apache.org/contribute/doc).| -| AI policy | For AI-assisted development, please review our [AI policy](AI_POLICY.md) for guidelines on responsible AI usage.| +| AI guidline | For AI-assisted development, please review our [AI guideline](AI_GUIDELINE.md) for advice on responsible AI usage.| ## Roadmap From f022292cb3ba07974ac591b9ccf9be476bb4de87 Mon Sep 17 00:00:00 2001 From: Leonid Borchuk Date: Wed, 13 May 2026 12:25:33 +0300 Subject: [PATCH 4/9] Use long - --- AI_GUIDELINE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AI_GUIDELINE.md b/AI_GUIDELINE.md index 3b92f96a68c..6a8c684c04d 100644 --- a/AI_GUIDELINE.md +++ b/AI_GUIDELINE.md @@ -175,4 +175,4 @@ take the template from the `AGENTS.md.template` file. - [CodeRabbit](https://www.coderabbit.ai/) — Yet another AI pair programmer and code reviewer - [AGENTS.md](https://agents.md/) - - README for agents \ No newline at end of file + — README for agents \ No newline at end of file From d6bdeb77495e97bd94637db6065ff8fc8a276b32 Mon Sep 17 00:00:00 2001 From: Leonid <63977577+leborchuk@users.noreply.github.com> Date: Thu, 14 May 2026 10:23:34 +0300 Subject: [PATCH 5/9] Update AGENTS.md.template Co-authored-by: Dianjin Wang --- AGENTS.md.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md.template b/AGENTS.md.template index 25dec6cee29..0daa0f1914a 100644 --- a/AGENTS.md.template +++ b/AGENTS.md.template @@ -23,7 +23,7 @@ Guidance for agent-style coding tools working in the Apache Cloudberry repositor ## Project overview -Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. +Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a modern PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. Treat this repository as a database system, not as a typical application project. Small changes can affect SQL semantics, query planning, storage, distributed execution, management tooling, upgrade behavior, and user data safety. From 830bfbb1e3f0b474d6697cdf1cc6465b5020ccfd Mon Sep 17 00:00:00 2001 From: Leonid <63977577+leborchuk@users.noreply.github.com> Date: Thu, 14 May 2026 10:23:41 +0300 Subject: [PATCH 6/9] Update AGENTS.md.template Co-authored-by: Dianjin Wang --- AGENTS.md.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md.template b/AGENTS.md.template index 0daa0f1914a..e3688ea014a 100644 --- a/AGENTS.md.template +++ b/AGENTS.md.template @@ -31,7 +31,7 @@ Treat this repository as a database system, not as a typical application project - Keep changes as small and direct as possible. - Do not perform broad code refactoring. Cloudberry's core is PostgreSQL-based, and unnecessary refactoring makes familiar code harder for maintainers to recognize and review. -- Preserve PostgreSQL and Greenplum coding style in the area being edited. +- Preserve PostgreSQL and Cloudberry coding style in the area being edited. - Prefer localized fixes over architecture rewrites unless explicitly requested. - Read surrounding code before editing. Match existing naming, memory management, error handling, locking, and test patterns. - Do not generate or import code with incompatible licensing. The project is Apache License 2.0. From 3fc2627debdece47cf4b26df56b3c577f968ca05 Mon Sep 17 00:00:00 2001 From: Leonid <63977577+leborchuk@users.noreply.github.com> Date: Thu, 14 May 2026 10:23:48 +0300 Subject: [PATCH 7/9] Update AGENTS.md.template Co-authored-by: Dianjin Wang --- AGENTS.md.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md.template b/AGENTS.md.template index e3688ea014a..b307020ec77 100644 --- a/AGENTS.md.template +++ b/AGENTS.md.template @@ -41,7 +41,7 @@ Treat this repository as a database system, not as a typical application project - [README.md](README.md) — project introduction, community links, contribution overview, and license information. - [CONTRIBUTING.md](CONTRIBUTING.md) — contribution expectations and community guidance. -- [AI_POLICY.md](AI_POLICY.md) — rules for AI-assisted development. +- [AI_GUIDELINE.md](AI_GUIDELINE.md) — rules for AI-assisted development. - [SECURITY.md](SECURITY.md) — security reporting policy. - [.github/pull_request_template.md](.github/pull_request_template.md) — PR checklist, test plan, impact, and AI disclosure checkbox. - [src/](src/) — database source tree, including PostgreSQL-derived backend, frontend utilities, interfaces, tests, and build integration. From 4287f8e0daaa04bb97ba45e98d667b75507c0814 Mon Sep 17 00:00:00 2001 From: Leonid <63977577+leborchuk@users.noreply.github.com> Date: Thu, 14 May 2026 10:23:56 +0300 Subject: [PATCH 8/9] Update AGENTS.md.template Co-authored-by: Dianjin Wang --- AGENTS.md.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md.template b/AGENTS.md.template index b307020ec77..da026b9fddb 100644 --- a/AGENTS.md.template +++ b/AGENTS.md.template @@ -101,7 +101,7 @@ Do not invent successful test results. If tests are not run, state that clearly ## AI-assisted contribution policy -Follow [AI_POLICY.md](AI_POLICY.md): +Follow [AI_GUIDELINE.md](AI_GUIDELINE.md): - AI-generated code has the same responsibility and quality bar as human-written code. - AI-assisted changes must pass normal review, testing, and CI standards. From f58921e2aa70b97b9fa3b6d90c475634ca4541cb Mon Sep 17 00:00:00 2001 From: Leonid Borchuk Date: Thu, 14 May 2026 14:31:52 +0300 Subject: [PATCH 9/9] Keep AGENTS.md.template in sync with AI_GUIDANCE.md --- AGENTS.md.template | 276 ++++++++++++++++++++++++++++++++------------- 1 file changed, 199 insertions(+), 77 deletions(-) diff --git a/AGENTS.md.template b/AGENTS.md.template index da026b9fddb..1e63ac398e0 100644 --- a/AGENTS.md.template +++ b/AGENTS.md.template @@ -19,139 +19,256 @@ # AGENTS.md -Guidance for agent-style coding tools working in the Apache Cloudberry repository. +Guidance for agent-style coding tools working in the Apache +Cloudberry repository. ## Project overview -Apache Cloudberry is an Apache Incubator project and an open-source massively parallel processing database. It evolved from Greenplum Database and is built on a modern PostgreSQL kernel. It is used for data warehouse, large-scale analytics, and AI or ML workloads. +Apache Cloudberry is an Apache Incubator project and an +open-source massively parallel processing database. It evolved +from Greenplum Database and is built on a modern PostgreSQL +kernel. It is used for data warehouse, large-scale analytics, +and AI or ML workloads. -Treat this repository as a database system, not as a typical application project. Small changes can affect SQL semantics, query planning, storage, distributed execution, management tooling, upgrade behavior, and user data safety. +Treat this repository as a database system, not as a typical +application project. Small changes can affect SQL semantics, +query planning, storage, distributed execution, management +tooling, upgrade behavior, and user data safety. ## Core principles for agents - Keep changes as small and direct as possible. -- Do not perform broad code refactoring. Cloudberry's core is PostgreSQL-based, and unnecessary refactoring makes familiar code harder for maintainers to recognize and review. -- Preserve PostgreSQL and Cloudberry coding style in the area being edited. -- Prefer localized fixes over architecture rewrites unless explicitly requested. -- Read surrounding code before editing. Match existing naming, memory management, error handling, locking, and test patterns. -- Do not generate or import code with incompatible licensing. The project is Apache License 2.0. -- Never treat AI output as automatically correct. The contributor owns the final code. +- Do not perform broad code refactoring. Cloudberry's core is + PostgreSQL-based, and unnecessary refactoring makes familiar + code harder for maintainers to recognize and review. +- Preserve PostgreSQL and Cloudberry coding style in the area + being edited. +- Prefer localized fixes over architecture rewrites unless + explicitly requested. +- Read surrounding code before editing. Match existing naming, + memory management, error handling, locking, and test + patterns. +- Do not generate or import code with incompatible licensing. + The project is Apache License 2.0. +- Never treat AI output as automatically correct. The + contributor owns the final code. ## Repository map -- [README.md](README.md) — project introduction, community links, contribution overview, and license information. -- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution expectations and community guidance. -- [AI_GUIDELINE.md](AI_GUIDELINE.md) — rules for AI-assisted development. +- [README.md](README.md) — project introduction, community + links, contribution overview, and license information. +- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution + expectations and community guidance. +- [AI_GUIDELINE.md](AI_GUIDELINE.md) — rules for AI-assisted + development. - [SECURITY.md](SECURITY.md) — security reporting policy. -- [.github/pull_request_template.md](.github/pull_request_template.md) — PR checklist, test plan, impact, and AI disclosure checkbox. -- [src/](src/) — database source tree, including PostgreSQL-derived backend, frontend utilities, interfaces, tests, and build integration. -- [src/backend/](src/backend/) — main database backend. Important areas include parser, optimizer, executor, storage, catalog, commands, postmaster, replication, and Cloudberry distributed components. -- [src/backend/cdb/](src/backend/cdb/) — Cloudberry or Greenplum distributed database logic, including dispatch, gangs, motion, and MPP behavior. -- [src/backend/gporca/](src/backend/gporca/) and [src/backend/gpopt/](src/backend/gpopt/) — ORCA optimizer integration and optimizer-related code. -- [src/common/](src/common/) — code shared by backend and frontend utilities. -- [src/interfaces/](src/interfaces/) — client interfaces such as libpq, ECPG, and GPPC. -- [src/test/](src/test/) — regression, isolation, unit, and integration test infrastructure. -- [gpMgmt/](gpMgmt/) — Python management utilities and cluster administration tooling. -- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, packaging, and build helpers. -- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and contributed modules. -- [contrib/](contrib/) — PostgreSQL-style contributed modules and Cloudberry-specific extensions. +- [.gitmessage](.gitmessage) — commit message template with + title, body, and trailer conventions. +- [.github/pull_request_template.md](.github/pull_request_template.md) + — PR checklist, test plan, impact, and AI disclosure + checkbox. +- [src/](src/) — database source tree, including + PostgreSQL-derived backend, frontend utilities, interfaces, + tests, and build integration. +- [src/backend/](src/backend/) — main database backend. + Important areas include parser, optimizer, executor, + storage, catalog, commands, postmaster, replication, and + Cloudberry distributed components. +- [src/backend/cdb/](src/backend/cdb/) — distributed database + logic, including dispatch, gangs, motion, and MPP behavior. +- [src/backend/gporca/](src/backend/gporca/) and + [src/backend/gpopt/](src/backend/gpopt/) — ORCA top-down optimizer + integration and optimizer-related code. +- [src/common/](src/common/) — code shared by backend and + frontend utilities. +- [src/interfaces/](src/interfaces/) — client interfaces such + as libpq, ECPG, and GPPC. +- [src/test/](src/test/) — regression, isolation, unit, and + integration test infrastructure. +- [gpMgmt/](gpMgmt/) — Python management utilities and + cluster administration tooling. +- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, + packaging, and build helpers. +- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and + contributed modules. +- [contrib/](contrib/) — PostgreSQL-style contributed modules + and Cloudberry-specific extensions. - [doc/](doc/) — SGML documentation sources. -- [devops/](devops/) — Docker, automation, sandbox, and build/deployment helper scripts. -- [mcp-server/](mcp-server/) — MCP server for AI-ready Cloudberry database interaction. +- [devops/](devops/) — Docker, automation, sandbox, and + build/deployment helper scripts. +- [mcp-server/](mcp-server/) — MCP server for AI-ready + Cloudberry database interaction. ## Architecture notes -Cloudberry follows a PostgreSQL-style source layout with additional MPP database components inherited from Greenplum. The coordinator receives SQL, plans or optimizes it, dispatches work to segments, and collects results. Segment processes execute distributed pieces of the plan and interact through the interconnect. +Cloudberry follows a PostgreSQL-style source layout with +additional MPP database components inherited from Greenplum. +The coordinator receives SQL, plans or optimizes it, dispatches +work to segments, and collects results. Segment processes +execute distributed pieces of the plan and interact through the +interconnect. Key concepts agents should recognize: -- Coordinator and segments are separate roles in a distributed database cluster. -- Query execution may involve dispatch, gangs, motion nodes, distributed transactions, snapshots, and interconnect behavior. -- Storage and catalog changes can affect upgrade, recovery, visibility, and distributed consistency. -- PostgreSQL compatibility matters. Avoid changing behavior that is inherited from PostgreSQL unless the task explicitly targets Cloudberry divergence. -- Extensions under [gpcontrib/](gpcontrib/) and [contrib/](contrib/) may have independent build or test workflows. +- Coordinator and segments are separate roles in a distributed + database cluster. +- Query execution may involve dispatch, gangs, motion nodes, + distributed transactions, snapshots, and interconnect + behavior. +- Storage and catalog changes can affect upgrade, recovery, + visibility, and distributed consistency. +- PostgreSQL compatibility matters. Avoid changing behavior + that is inherited from PostgreSQL unless the task explicitly + targets Cloudberry divergence. +- Extensions under [gpcontrib/](gpcontrib/) and + [contrib/](contrib/) may have independent build or test + workflows. ## Working rules -1. Start by identifying the subsystem and reading nearby files, tests, and documentation. -2. Prefer existing helpers, macros, memory contexts, error reporting conventions, and test infrastructure. +1. Start by identifying the subsystem and reading nearby + files, tests, and documentation. +2. Prefer existing helpers, macros, memory contexts, error + reporting conventions, and test infrastructure. 3. Avoid unrelated formatting changes. -4. Avoid renaming symbols or moving files unless explicitly required. -5. Do not silently change SQL-visible behavior, catalog definitions, on-disk format, wire protocol, GUC behavior, or user-facing messages. -6. If a change touches security-sensitive areas, call that out clearly in the PR description and request appropriate human review. -7. If a change touches distributed execution, verify whether it affects both coordinator and segment behavior. -8. If a change touches management scripts, check Python compatibility and existing unit or behave tests. -9. If a change touches documentation, keep examples accurate and consistent with project terminology. -10. If behavior is uncertain, add a small regression or unit test rather than relying on assumptions. +4. Avoid renaming symbols or moving files unless explicitly + required. +5. Do not silently change SQL-visible behavior, catalog + definitions, on-disk format, wire protocol, GUC behavior, + or user-facing messages. +6. If a change touches security-sensitive areas, call that out + clearly in the PR description and request appropriate human + review. +7. If a change touches distributed execution, verify whether + it affects both coordinator and segment behavior. +8. If a change touches management scripts, check Python + compatibility and existing unit or behave tests. +9. If a change touches documentation, keep examples accurate + and consistent with project terminology. +10. If behavior is uncertain, add a small regression or unit + test rather than relying on assumptions. ## Build and test guidance -Use the smallest relevant validation first, then broader validation when the change is ready. - -Common validation entry points mentioned by project docs and PR templates: - -- Configure and build through the repository's standard build flow or the automation in [devops/README.md](devops/README.md). -- Use Docker-based development and sandbox workflows under [devops/](devops/) when local system dependencies are not available. -- Run `make installcheck` for regression coverage when appropriate. -- Run `make -C src/test installcheck-cbdb-parallel` for Cloudberry parallel regression coverage when appropriate. -- For extension-specific changes, run the extension's local installcheck or documented test target. -- For management tooling under [gpMgmt/](gpMgmt/), inspect the relevant README and test targets before selecting a test command. - -Do not invent successful test results. If tests are not run, state that clearly in the final response or PR notes. +Use the smallest relevant validation first, then broader +validation when the change is ready. + +Common validation entry points mentioned by project docs and +PR templates: + +- Configure and build through the repository's standard build + flow or the automation in + [devops/README.md](devops/README.md). +- Use Docker-based development and sandbox workflows under + [devops/](devops/) when local system dependencies are not + available. +- Run `make installcheck` for regression coverage when + appropriate. +- Run `make -C src/test installcheck-cbdb-parallel` for + Cloudberry parallel regression coverage when appropriate. +- For extension-specific changes, run the extension's local + installcheck or documented test target. +- For management tooling under [gpMgmt/](gpMgmt/), inspect + the relevant README and test targets before selecting a test + command. + +Do not invent successful test results. If tests are not run, +state that clearly in the final response or PR notes. ## AI-assisted contribution policy Follow [AI_GUIDELINE.md](AI_GUIDELINE.md): -- AI-generated code has the same responsibility and quality bar as human-written code. -- AI-assisted changes must pass normal review, testing, and CI standards. +- AI-generated code has the same responsibility and quality + bar as human-written code. +- AI-assisted changes must pass normal review, testing, and CI + standards. - The contributor must ensure license compatibility. -- Significant AI-generated code should be disclosed using the PR template checkbox. -- Do not use AI to auto-generate responses to maintainer review feedback. +- Significant AI-generated code should be disclosed using the + PR template checkbox and optionally recorded with an + `Assisted-by:` trailer in the commit message. +- AI tools may assist with drafting responses, but + contributors should engage thoughtfully and personally with + reviewers. - Include or verify tests for AI-generated code. -- Keep changes simple and avoid code refactoring. +- Keep changes simple and avoid meaningless code refactoring. ## Security policy Follow [SECURITY.md](SECURITY.md): -- Do not report security vulnerabilities in public issues, public mailing lists, or public forums. +- Do not report security vulnerabilities in public issues, + public mailing lists, or public forums. - Send vulnerability reports to security@apache.org. -- For normal non-security bugs, use GitHub Issues, Discussions, the dev mailing list, or Slack. +- For normal non-security bugs, use GitHub Issues, + Discussions, the dev mailing list, or Slack. -When working as an agent, do not expose secrets, credentials, private keys, database dumps with sensitive data, or vulnerability details in public-facing output. +When working as an agent, do not expose secrets, credentials, +private keys, database dumps with sensitive data, or +vulnerability details in public-facing output. ## Pull request expectations -Use [.github/pull_request_template.md](.github/pull_request_template.md) as the checklist for final change summaries: +Use [.github/pull_request_template.md](.github/pull_request_template.md) +as the checklist for final change summaries: - Explain what the PR does. - Identify the type of change. - Document breaking changes if any. - Provide a test plan. -- Describe performance, user-facing, and dependency impact when applicable. +- Describe performance, user-facing, and dependency impact + when applicable. - Confirm documentation updates when needed. - Confirm security review consideration. - Disclose significant AI-assisted code generation. +## Commit conventions + +- Add the standard Apache License header for newly created + files (not needed for third-party files). +- When drafting the commit message, use the + [.gitmessage](.gitmessage) template as a reference. +- Start the title with a prefix indicating the change type: + `Fix ...` for bug or typo fixes, `Feature: ...` for new + features, `Enhancement: ...` for code optimization, + `Doc: ...` for documentation changes. For other changes, + start with an imperative uppercase verb. +- Keep the title line to 50 characters or fewer. Do not end + it with a period. +- Leave a blank line between the title and the body. +- In the body, explain *what*, *why*, and *how*. Note any + compatibility issues. Wrap lines at 72 characters. +- Use optional trailers as needed: `Co-authored-by:`, + `Reported-by:`, `See:` (for GitHub Issues or Discussions + links), and `Assisted-by:` (for AI tool attribution). + ## Style expectations -- C code should follow the surrounding PostgreSQL or Cloudberry style. -- Python code in [gpMgmt/](gpMgmt/) should follow nearby management script patterns and existing test style. -- SQL tests should include expected output files when required by the test framework. -- Documentation uses Markdown in many repository files and SGML under [doc/src/sgml/](doc/src/sgml/). -- Prefer project terminology: Apache Cloudberry, coordinator, segment, MPP, PostgreSQL kernel, Greenplum heritage. +- C code should follow the surrounding PostgreSQL or + Cloudberry style. +- Python code in [gpMgmt/](gpMgmt/) should follow nearby + management script patterns and existing test style. +- SQL tests should include expected output files when required + by the test framework. +- Documentation uses Markdown in many repository files and + SGML under [doc/src/sgml/](doc/src/sgml/). +- Prefer project terminology: Apache Cloudberry, coordinator, + segment, MPP, PostgreSQL kernel, Greenplum heritage. ## High-risk areas Be especially conservative around: - Catalog definitions and upgrade-sensitive files. -- Storage formats, WAL, recovery, transactions, snapshots, and visibility. -- Planner, optimizer, executor, and motion/distributed execution logic. -- Authentication, cryptography, TLS, network protocol, and libpq behavior. +- Storage formats, WAL, recovery, transactions, snapshots, + and visibility. +- Planner, optimizer, executor, and motion/distributed + execution logic. +- Authentication, cryptography, TLS, network protocol, and + libpq behavior. - Interconnect and dispatch paths. -- Cluster management commands that start, stop, expand, recover, or reconfigure clusters. +- Cluster management commands that start, stop, expand, + recover, or reconfigure clusters. - Public SQL behavior, GUCs, system views, and extension APIs. ## Recommended agent workflow @@ -163,13 +280,18 @@ Be especially conservative around: 5. Edit only files required for the task. 6. Add or update tests when behavior changes. 7. Run the narrowest relevant tests available. -8. Summarize changed files, test results, and any risks or follow-ups. +8. Summarize changed files, test results, and any risks or + follow-ups. ## What not to do - Do not perform drive-by cleanup. - Do not reformat unrelated code. -- Do not replace established PostgreSQL-style patterns with modern alternatives just for preference. -- Do not change public behavior without tests and documentation. -- Do not assume single-node behavior is enough for distributed database changes. -- Do not fabricate command output, test results, issue links, or reviewer decisions. +- Do not replace established PostgreSQL-style patterns with + modern alternatives just for preference. +- Do not change public behavior without tests and + documentation. +- Do not assume single-node behavior is enough for distributed + database changes. +- Do not fabricate command output, test results, issue links, + or reviewer decisions.