diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 3b592500f9c..d2c001658f9 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -36,6 +36,7 @@ Fixes #ISSUE_Number - [ ] Followed [contribution guide](https://cloudberry.apache.org/contribute/code) - [ ] Added/updated documentation - [ ] Reviewed code for security implications +- [ ] This PR contains AI-assisted code generation - [ ] Requested review from [cloudberry committers](https://github.com/orgs/apache/teams/cloudberry-committers) ### Additional Context diff --git a/.gitmessage b/.gitmessage index 9852789f9a9..0e5b8bc31d9 100644 --- a/.gitmessage +++ b/.gitmessage @@ -35,6 +35,10 @@ Add your commit body here # Discussions, please list them as a reference: #See: Issue#id ? #See: Discussion#id ? +# If AI tools substantially assisted in writing this commit, optionally +# note which tool(s) were used (one line per tool): +#Assisted-by: ChatGPT +#Assisted-by: GitHub Copilot ######################################################################## # # diff --git a/AGENTS.md.template b/AGENTS.md.template new file mode 100644 index 00000000000..1e63ac398e0 --- /dev/null +++ b/AGENTS.md.template @@ -0,0 +1,297 @@ + + +# AGENTS.md + +Guidance for agent-style coding tools working in the Apache +Cloudberry repository. + +## Project overview + +Apache Cloudberry is an Apache Incubator project and an +open-source massively parallel processing database. It evolved +from Greenplum Database and is built on a modern PostgreSQL +kernel. It is used for data warehouse, large-scale analytics, +and AI or ML workloads. + +Treat this repository as a database system, not as a typical +application project. Small changes can affect SQL semantics, +query planning, storage, distributed execution, management +tooling, upgrade behavior, and user data safety. + +## Core principles for agents + +- Keep changes as small and direct as possible. +- Do not perform broad code refactoring. Cloudberry's core is + PostgreSQL-based, and unnecessary refactoring makes familiar + code harder for maintainers to recognize and review. +- Preserve PostgreSQL and Cloudberry coding style in the area + being edited. +- Prefer localized fixes over architecture rewrites unless + explicitly requested. +- Read surrounding code before editing. Match existing naming, + memory management, error handling, locking, and test + patterns. +- Do not generate or import code with incompatible licensing. + The project is Apache License 2.0. +- Never treat AI output as automatically correct. The + contributor owns the final code. + +## Repository map + +- [README.md](README.md) — project introduction, community + links, contribution overview, and license information. +- [CONTRIBUTING.md](CONTRIBUTING.md) — contribution + expectations and community guidance. +- [AI_GUIDELINE.md](AI_GUIDELINE.md) — rules for AI-assisted + development. +- [SECURITY.md](SECURITY.md) — security reporting policy. +- [.gitmessage](.gitmessage) — commit message template with + title, body, and trailer conventions. +- [.github/pull_request_template.md](.github/pull_request_template.md) + — PR checklist, test plan, impact, and AI disclosure + checkbox. +- [src/](src/) — database source tree, including + PostgreSQL-derived backend, frontend utilities, interfaces, + tests, and build integration. +- [src/backend/](src/backend/) — main database backend. + Important areas include parser, optimizer, executor, + storage, catalog, commands, postmaster, replication, and + Cloudberry distributed components. +- [src/backend/cdb/](src/backend/cdb/) — distributed database + logic, including dispatch, gangs, motion, and MPP behavior. +- [src/backend/gporca/](src/backend/gporca/) and + [src/backend/gpopt/](src/backend/gpopt/) — ORCA top-down optimizer + integration and optimizer-related code. +- [src/common/](src/common/) — code shared by backend and + frontend utilities. +- [src/interfaces/](src/interfaces/) — client interfaces such + as libpq, ECPG, and GPPC. +- [src/test/](src/test/) — regression, isolation, unit, and + integration test infrastructure. +- [gpMgmt/](gpMgmt/) — Python management utilities and + cluster administration tooling. +- [gpAux/](gpAux/) — auxiliary scripts, demo cluster support, + packaging, and build helpers. +- [gpcontrib/](gpcontrib/) — Cloudberry-related extensions and + contributed modules. +- [contrib/](contrib/) — PostgreSQL-style contributed modules + and Cloudberry-specific extensions. +- [doc/](doc/) — SGML documentation sources. +- [devops/](devops/) — Docker, automation, sandbox, and + build/deployment helper scripts. +- [mcp-server/](mcp-server/) — MCP server for AI-ready + Cloudberry database interaction. + +## Architecture notes + +Cloudberry follows a PostgreSQL-style source layout with +additional MPP database components inherited from Greenplum. +The coordinator receives SQL, plans or optimizes it, dispatches +work to segments, and collects results. Segment processes +execute distributed pieces of the plan and interact through the +interconnect. + +Key concepts agents should recognize: + +- Coordinator and segments are separate roles in a distributed + database cluster. +- Query execution may involve dispatch, gangs, motion nodes, + distributed transactions, snapshots, and interconnect + behavior. +- Storage and catalog changes can affect upgrade, recovery, + visibility, and distributed consistency. +- PostgreSQL compatibility matters. Avoid changing behavior + that is inherited from PostgreSQL unless the task explicitly + targets Cloudberry divergence. +- Extensions under [gpcontrib/](gpcontrib/) and + [contrib/](contrib/) may have independent build or test + workflows. + +## Working rules + +1. Start by identifying the subsystem and reading nearby + files, tests, and documentation. +2. Prefer existing helpers, macros, memory contexts, error + reporting conventions, and test infrastructure. +3. Avoid unrelated formatting changes. +4. Avoid renaming symbols or moving files unless explicitly + required. +5. Do not silently change SQL-visible behavior, catalog + definitions, on-disk format, wire protocol, GUC behavior, + or user-facing messages. +6. If a change touches security-sensitive areas, call that out + clearly in the PR description and request appropriate human + review. +7. If a change touches distributed execution, verify whether + it affects both coordinator and segment behavior. +8. If a change touches management scripts, check Python + compatibility and existing unit or behave tests. +9. If a change touches documentation, keep examples accurate + and consistent with project terminology. +10. If behavior is uncertain, add a small regression or unit + test rather than relying on assumptions. + +## Build and test guidance + +Use the smallest relevant validation first, then broader +validation when the change is ready. + +Common validation entry points mentioned by project docs and +PR templates: + +- Configure and build through the repository's standard build + flow or the automation in + [devops/README.md](devops/README.md). +- Use Docker-based development and sandbox workflows under + [devops/](devops/) when local system dependencies are not + available. +- Run `make installcheck` for regression coverage when + appropriate. +- Run `make -C src/test installcheck-cbdb-parallel` for + Cloudberry parallel regression coverage when appropriate. +- For extension-specific changes, run the extension's local + installcheck or documented test target. +- For management tooling under [gpMgmt/](gpMgmt/), inspect + the relevant README and test targets before selecting a test + command. + +Do not invent successful test results. If tests are not run, +state that clearly in the final response or PR notes. + +## AI-assisted contribution policy + +Follow [AI_GUIDELINE.md](AI_GUIDELINE.md): + +- AI-generated code has the same responsibility and quality + bar as human-written code. +- AI-assisted changes must pass normal review, testing, and CI + standards. +- The contributor must ensure license compatibility. +- Significant AI-generated code should be disclosed using the + PR template checkbox and optionally recorded with an + `Assisted-by:` trailer in the commit message. +- AI tools may assist with drafting responses, but + contributors should engage thoughtfully and personally with + reviewers. +- Include or verify tests for AI-generated code. +- Keep changes simple and avoid meaningless code refactoring. + +## Security policy + +Follow [SECURITY.md](SECURITY.md): + +- Do not report security vulnerabilities in public issues, + public mailing lists, or public forums. +- Send vulnerability reports to security@apache.org. +- For normal non-security bugs, use GitHub Issues, + Discussions, the dev mailing list, or Slack. + +When working as an agent, do not expose secrets, credentials, +private keys, database dumps with sensitive data, or +vulnerability details in public-facing output. + +## Pull request expectations + +Use [.github/pull_request_template.md](.github/pull_request_template.md) +as the checklist for final change summaries: + +- Explain what the PR does. +- Identify the type of change. +- Document breaking changes if any. +- Provide a test plan. +- Describe performance, user-facing, and dependency impact + when applicable. +- Confirm documentation updates when needed. +- Confirm security review consideration. +- Disclose significant AI-assisted code generation. + +## Commit conventions + +- Add the standard Apache License header for newly created + files (not needed for third-party files). +- When drafting the commit message, use the + [.gitmessage](.gitmessage) template as a reference. +- Start the title with a prefix indicating the change type: + `Fix ...` for bug or typo fixes, `Feature: ...` for new + features, `Enhancement: ...` for code optimization, + `Doc: ...` for documentation changes. For other changes, + start with an imperative uppercase verb. +- Keep the title line to 50 characters or fewer. Do not end + it with a period. +- Leave a blank line between the title and the body. +- In the body, explain *what*, *why*, and *how*. Note any + compatibility issues. Wrap lines at 72 characters. +- Use optional trailers as needed: `Co-authored-by:`, + `Reported-by:`, `See:` (for GitHub Issues or Discussions + links), and `Assisted-by:` (for AI tool attribution). + +## Style expectations + +- C code should follow the surrounding PostgreSQL or + Cloudberry style. +- Python code in [gpMgmt/](gpMgmt/) should follow nearby + management script patterns and existing test style. +- SQL tests should include expected output files when required + by the test framework. +- Documentation uses Markdown in many repository files and + SGML under [doc/src/sgml/](doc/src/sgml/). +- Prefer project terminology: Apache Cloudberry, coordinator, + segment, MPP, PostgreSQL kernel, Greenplum heritage. + +## High-risk areas + +Be especially conservative around: + +- Catalog definitions and upgrade-sensitive files. +- Storage formats, WAL, recovery, transactions, snapshots, + and visibility. +- Planner, optimizer, executor, and motion/distributed + execution logic. +- Authentication, cryptography, TLS, network protocol, and + libpq behavior. +- Interconnect and dispatch paths. +- Cluster management commands that start, stop, expand, + recover, or reconfigure clusters. +- Public SQL behavior, GUCs, system views, and extension APIs. + +## Recommended agent workflow + +1. Restate the requested change in concrete terms. +2. Locate the smallest relevant subsystem. +3. Read nearby implementation and tests. +4. Plan a minimal change. +5. Edit only files required for the task. +6. Add or update tests when behavior changes. +7. Run the narrowest relevant tests available. +8. Summarize changed files, test results, and any risks or + follow-ups. + +## What not to do + +- Do not perform drive-by cleanup. +- Do not reformat unrelated code. +- Do not replace established PostgreSQL-style patterns with + modern alternatives just for preference. +- Do not change public behavior without tests and + documentation. +- Do not assume single-node behavior is enough for distributed + database changes. +- Do not fabricate command output, test results, issue links, + or reviewer decisions. diff --git a/AI_GUIDELINE.md b/AI_GUIDELINE.md new file mode 100644 index 00000000000..6a8c684c04d --- /dev/null +++ b/AI_GUIDELINE.md @@ -0,0 +1,178 @@ + + +# Guidelines for AI-assisted Contributions + +Apache Cloudberry follows the ASF Generative Tooling Guidance +for the use of AI-assisted development tools: + +https://www.apache.org/legal/generative-tooling.html + +This document provides additional project-specific guidance and +best practices for using AI tools in the Cloudberry community. +It is intended to supplement ASF guidance, not replace it. + +## Guidelines + +### 1. You own the code + +AI-generated code carries the same responsibility as code you +type yourself. Review it before submitting. If a bug ships, +"the AI wrote it" is not a defense. + +**Example:** As an experiment, you used an LLM to generate a +new type of executor node. The results were impressive, and you +wanted to share them with the community. Before opening a PR, +read every line, verify the logic, and make sure it fits with +existing code patterns. Someone might use your code in +production, not just for experiments. + +### 2. Same quality bar + +AI-assisted contributions must pass the same review, testing, +and CI standards as any other code. No shortcuts. AI-generated +code must come with corresponding tests, or be covered by +existing ones. If the AI wrote the code, you should at least +write or carefully verify the tests. + +**Example:** You use an LLM to implement a new aggregate +function. The PR must include regression tests in `src/test` +that exercise both normal and edge cases. + +### 3. Watch the license + +Don't let AI introduce code incompatible with the Apache +License 2.0. You are responsible for ensuring all submitted +code — AI-generated or not — has proper licensing. + +See [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) +for details. + +**Example:** If an AI tool reproduces a snippet from a +GPL-licensed project, you must not include it. When in doubt, +rewrite from scratch. + +### 4. Flag it + +When your PR includes significant AI-generated code, check the +AI disclosure box in the PR template. You don't have to +disclose minor AI assistance (autocomplete, reformatting), but +be transparent about substantial generation. + +You can also record AI assistance directly in the commit +message using the optional `Assisted-by:` trailer (one line +per tool), following the same convention used by the Linux +kernel. See the +[Linux kernel coding assistants guidance](https://docs.kernel.org/process/coding-assistants.html) +for background. + +``` +Assisted-by: ChatGPT +Assisted-by: GitHub Copilot +``` + +This trailer is optional and non-binding — it provides +lightweight provenance information without making AI +disclosure a strict requirement. + +**Example:** Using an LLM to autocomplete a single function +signature — no need for a flag. Using an LLM to generate an +entire new GUC parameter with validation logic — flag it and +add an `Assisted-by:` trailer. The flag doesn't mean the PR +skips review or merge criteria, but it gives reviewers more +context about the generation method and lets them focus on +architecture and logic rather than specific operators. + +### 5. No meaningless code refactoring + +Our core is PostgreSQL, and refactoring work has already been +done here. Rewriting code significantly complicates rebase. +Also, refactoring changes the code in a way that forces people +to relearn code they already know. Keep changes as simple as +possible. + +**Example:** LLMs are eager to refactor. One day you may be +asked: "This code is not very good. Do you want to improve +it?" Of course! It could happen several times. Tokens are +spent, but what is the point of such refactoring? +(Rhetorical question) + +### 6. LLM code review + +Some AI review tools (such as GitHub Copilot Review or +CodeRabbit) may not currently be available for ASF-hosted +repositories due to operational, budgetary, or permission +reasons. Contributors can still use personal AI tools locally +but are responsible for ensuring code quality, compliance with +licensing terms, and reviewing outcomes. + +**Example:** One could use GitHub Copilot for automated AI +code review on pull requests. Here are some important points: + +- Copilot suggestions are **non-binding hints**, not + requirements. +- If a suggestion is irrelevant or wrong, skip it — you know + your code best. +- If a suggestion catches a real issue, fix it like you would + for any review comment. +- Copilot does not replace human reviewers. All PRs still need + approval from a committer. + +### 7. Talk to maintainers yourself + +Review discussions should reflect the contributor's own +understanding and technical judgment. AI tools may assist with +drafting responses, but contributors should engage +thoughtfully and personally with reviewers. Maintainers invest +time reviewing your code; respond in kind. + +**Example:** A reviewer asks "why did you choose this approach +over X?" — write your own answer explaining the tradeoff, +don't paste an LLM-generated reply. + +## AGENTS.md + +[AGENTS.md](https://agents.md/) is a README for agents: a +dedicated, predictable place to provide context and +instructions to help AI coding agents work on your project. +We do not ship a repository-level `AGENTS.md` because the +right content is platform- and user-specific. If you work with +AI coding agents locally, create your own `AGENTS.md`. You could +take the template from the `AGENTS.md.template` file. + +## Good uses of AI + +- Bug fixing and root cause analysis +- Code review +- Writing and improving tests +- Documentation and code comments +- Build system and CI improvements +- Security research and vulnerability scanning +- Learning the codebase faster + +## Resources + +- [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) + — Official Apache guidance on AI tool usage +- [GitHub Copilot](https://github.com/features/copilot) + — AI pair programmer and code reviewer +- [CodeRabbit](https://www.coderabbit.ai/) + — Yet another AI pair programmer and code reviewer +- [AGENTS.md](https://agents.md/) + — README for agents \ No newline at end of file diff --git a/README.md b/README.md index 2a4b7146efd..e332b06b309 100644 --- a/README.md +++ b/README.md @@ -111,6 +111,7 @@ with the contribution. | Code contribution | Learn how to contribute code to the Cloudberry, including coding preparation, conventions, workflow, review, and checklist following the [code contribution guide](https://cloudberry.apache.org/contribute/code).| | Submit the proposal | Proposing major changes to Cloudberry through [proposal guide](https://cloudberry.apache.org/contribute/proposal).| | Doc contribution | We need you to join us to help us improve the documentation, see the [doc contribution guide](https://cloudberry.apache.org/contribute/doc).| +| AI guidline | For AI-assisted development, please review our [AI guideline](AI_GUIDELINE.md) for advice on responsible AI usage.| ## Roadmap