FEAT add CodeAttackConverter and CodeAttackAttack (closes #1945)#1960
Open
u7k4rs6 wants to merge 1 commit into
Open
FEAT add CodeAttackConverter and CodeAttackAttack (closes #1945)#1960u7k4rs6 wants to merge 1 commit into
u7k4rs6 wants to merge 1 commit into
Conversation
Implement CodeAttack (Ren et al., ACL 2024) as a standalone converter
and a PromptSendingAttack subclass following the FlipAttack pattern.
CodeAttackConverter encodes a natural-language prompt word-by-word into
a data-structure initialisation sequence (deque appends, list appends,
or a string assignment) and embeds it in a partial code template that
asks the model to complete the code. Five language variants are
supported: python_stack, python_list, python_string, cpp, go. The
verbose flag selects the _plus template (detailed paragraphs) for the
three Python variants; cpp and go have no plus variant upstream.
CodeAttackAttack wraps the converter in a PromptSendingAttack, prepends
a system prompt that frames the session as code completion, and forwards
language and verbose to the converter. Callers supply a scorer via
AttackScoringConfig as usual.
Files added:
pyrit/prompt_converter/code_attack_converter.py
pyrit/executor/attack/single_turn/code_attack.py
pyrit/datasets/executors/code_attack.yaml
pyrit/datasets/prompt_converters/code_attack_python_stack{,_plus}.yaml
pyrit/datasets/prompt_converters/code_attack_python_list{,_plus}.yaml
pyrit/datasets/prompt_converters/code_attack_python_string{,_plus}.yaml
pyrit/datasets/prompt_converters/code_attack_cpp.yaml
pyrit/datasets/prompt_converters/code_attack_go.yaml
tests/unit/prompt_converter/test_code_attack_converter.py (23 tests)
tests/unit/executor/attack/single_turn/test_code_attack.py (16 tests)
doc/code/executor/attack/code_attack.py
doc/code/executor/attack/code_attack.ipynb
Files modified:
pyrit/prompt_converter/__init__.py
pyrit/executor/attack/single_turn/__init__.py
pyrit/executor/attack/__init__.py
doc/myst.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1945.
Summary
Implements CodeAttack (Ren et al., ACL 2024, arXiv:2403.07865), which reformulates a harmful query as a code-completion task. The query is encoded into a data-structure initialization sequence inside a partial code template with a
decode()stub, and the target is asked to complete the code. Because the intent is expressed as a programming task rather than a natural-language request, safety training keyed to natural language triggers less reliably. Black-box, no compute requirements.Two notes for review (deltas from the issue)
Encoding is word-by-word, not character-by-character. The issue described it as char-by-char (from the paper abstract), but the reference implementation (renqibing/CodeAttack) splits on whitespace and hyphens via regex, with character-level only as a fallback for single-token inputs. I matched the reference code. One consequence: separators are normalized on encode (hyphens and runs of whitespace are consumed as delimiters), which is documented in the converter docstring.
Eight templates, not five. The issue scoped five (one per language), but the reference ships eight: the three Python types each have a base and a
_plusverbose variant, and cpp and go have no verbose variant upstream. I included all eight to match the reference. Happy to drop the four_plusfiles if you would rather keep it to five.Design
Follows the FlipConverter + FlipAttack two-class template.
pyrit/prompt_converter/code_attack_converter.py): encodes the prompt into the chosen data-structure operations and renders the code template. Parameters:language(python_stack, python_list, python_string, cpp, go) andverbose(bool, default True, selects the_plusvariant where one exists; intentionally a no-op for cpp and go). Standalone PromptConverter, composes through the normal pipeline.pyrit/executor/attack/single_turn/code_attack.py): subclasses PromptSendingAttack, instantiates the converter and prepends it to the request converters, injects a code-completion system prompt viaprepended_conversation. Reuses existing scorers, no attack-specific scoring.pyrit/datasets/prompt_converters/(matching the CodeChameleon convention) and the system prompt underpyrit/datasets/executors/(matchingflip_attack.yaml).Tests
39 unit tests (23 converter, 16 attack), all passing, following the
test_flip_converter.py/test_flip_attack.pypatterns. Coverage includes per-language template rendering, verbose vs base variants, word-recovery round-trips, empty / special-character / long prompts, converter prepend ordering, system-prompt injection, and scorer invocation through the normal path. Pre-commit (ruff, ty, validate-docs) green.Files
New:
pyrit/prompt_converter/code_attack_converter.pypyrit/executor/attack/single_turn/code_attack.pypyrit/datasets/executors/code_attack.yamlpyrit/datasets/prompt_converters/(code_attack_python_{stack,list,string}{,_plus}.yaml,code_attack_cpp.yaml,code_attack_go.yaml)tests/unit/prompt_converter/test_code_attack_converter.pytests/unit/executor/attack/single_turn/test_code_attack.pydoc/code/executor/attack/code_attack.py(jupytext source, with generated.ipynb)Modified (exports and docs registration):
pyrit/prompt_converter/__init__.pypyrit/executor/attack/single_turn/__init__.pypyrit/executor/attack/__init__.pydoc/myst.ymlChecklist