Skip to content

747: update identifier validation message now that leading colons aren't allowed#795

Merged
lognaturel merged 36 commits intoXLSForm:masterfrom
lindsay-stevens:pyxform-747
Jan 29, 2026
Merged

747: update identifier validation message now that leading colons aren't allowed#795
lognaturel merged 36 commits intoXLSForm:masterfrom
lindsay-stevens:pyxform-747

Conversation

@lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Dec 16, 2025

Closes #747

Why is this the best possible solution? Were any other approaches considered?

The majority of these commits are organising error/warning messages into the errors.py module, which is what I had in mind for the ErrorCode pattern which was added to pyxform a few months ago. After this PR about 30 error messages are organised there, but pyxform raises an error at about 170 locations and many of those have distinct messages.

The goal of this stage of organisation is to try and identify all the errors/warnings which deal with identifiers, so that the validation approach and message content can be more consistent with each other and with the documentation. Also to take the opportunity to move over error/warning messages that are already using the errors.Detail class or seemed easy to move. There are still some name-related errors that have not been moved (e.g. Audits must always be named 'audit.', etc), but they don't specifically deal with identifier grammar. There is also still a name-like regex check for the "app" parameter but that requires an android package name which has different rules to XML names.

What are the regression risks?

If library or script usages were looking for specific error strings then these changes would break them. But putting the errors into an enum/object should make it easier to avoid looking for specific strings - as is shown in the tests that use ErrorCode.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Maybe? #630 says ODK tools don't support having a colon as the first character in identifier names but that doesn't seem to be stated in the ODK XForms Specification. On one hand maybe it is not a hard requirement from a spec perspective, but on the other hand it is functionally required via pyxform validation and not being supported by ODK form clients.

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

@lindsay-stevens lindsay-stevens marked this pull request as ready for review December 17, 2025 16:11
@lindsay-stevens
Copy link
Contributor Author

Realised there's a name setting so added check/tests for that. It's previously untested, only documented on xlsform.org, and kind of sneaks in there in via json_dict.update(settings) on xls2json.py L352.

- organising error messages
- add test to verify error raised, since no other test seems to check
  for this error case or message.
- the modified tests pass anyway but would have asserted that each
  letter in the parameter was in the error e.g. ["M", "i", "s", "s"]
- added type check and tests for error/warning contains/not-contains
- organising error messages
- reworded error message to be consistent with the underlying XML
  "Name" rule/token, except that ODK form clients generally don't
  support having a colon for the first character, and not mentioning
  non-ascii unicode letters since that's probably too technical.
- preceding code applies the additional entity dataset name rules for
  not allowing names starting with a period or two underscores.
- organising error messages
- preceding code applies additional entity save_to name rules for
  not allowing reserved words or names starting two underscores.
- organising error messages
- the replaced validation regex is pretty much the XML name rule and
  the error message sounds like that was the intent (although the value
  is put into an attribute like `<value ref="VAL"/>` not a tag).
- organising error messages
- in survey_element.py the code that shows the exact offending character
  seems to have been added for the purpose of checking the form_name,
  which corresponds to the Survey.name attribute, which is used as the
  tag name of the primary instance element. When uploading a XLSForm to
  an online converter this name defaults to "data", and it is only set
  to the basename the XLS/X file name when using the CLI, or some other
  value when using pyxform as a library. This code would not apply to
  most other names since workbook_to_json (or other) code checks names
  before survey_element.py sees them. Also the regex
  INVALID_XFORM_TAG_REGEXP is different to the rule in is_xml_tag. So
  the per-character check was removed since it seems unlikely to be
  useful and is inconsistent with other name validation.
- not caught by relevant test since it only requires the substring
  "must also specify an image".
- it's possible to set the primary instance root node name via the
  setting "name" (which overrides other ways to set the name), so check
  the settings code path as well.
@lindsay-stevens
Copy link
Contributor Author

Force pushed to rebase on master and resolve merge conflicts (errors.py and test_choices_sheet.py).

@lognaturel
Copy link
Contributor

that doesn't seem to be stated in the ODK XForms Specification

Not explicitly, you're right, that's deferred to the underlying specs, in this case going down to the XML spec. I think these two StackOverflow explanations make it clear that colons should not be allowed as leading characters: https://stackoverflow.com/questions/40445735/is-a-colon-a-legal-first-character-in-an-xml-tag-name/40447829

Copy link
Contributor

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this down into easy commits -- the overall diff felt impossible to review meaningfully!

Overall having a central error message registry feels like a good idea in this context. I have a few notes inline you may or may not want to act on.

I have some uneasiness about the fact that tests compare against enum references, not the rendered user-facing strings. Things like formatting bugs could ship while all tests still pass and we could fail to notice that an error message became wrong in certain contexts used. Maybe for some really common or critical error messages we could assert on literal strings? I don't have any specific candidates in mind currently but wanted to share this thought in case it gives you ideas.

@lindsay-stevens
Copy link
Contributor Author

I have some uneasiness about the fact that tests compare against enum references, not the rendered user-facing strings.

Agree it has that potential weakness but each of the tests involves using the same context items, so the risks seem to be a) there is an unused token in the template which makes the result look weird, or b) checking the rendered string with a test case during development was overlooked. Maybe each message could have a "rendering check" test to at least casually confirm what it looks like with realistic context values. Also, the _ErrorFormatter will put "unknown" in place of any missing token values so for problem a) the user sees a somewhat cryptic but not totally broken looking message or a runtime error.

The current approach improves on the previous patterns of a) only checking part of the error/warning string and therefore facing the same rendering problems as above, b) only checking part of the message and matching some other message. By using assigned symbols we can more easily see exactly where an error/warning is used and tested, both for coverage and for not having to manually find and update any whole or partial error/warning messages that may be relevant to current work.

One of the next steps could be working on the language of the messages, and so far the format I've been trying to follow is roughly "This is where the problem is. This is why it's a problem. This is what you can do to fix it." Sort of like how a python traceback does context, details, tips in that order. Many messages have been collected in this PR but there are still tons still inline amongst the code, and perhaps they could be reviewed for consistency with Central/Collect/Webforms message phrasing, so maybe that all could be a follow up task? Also could be extended to localisation.

@lognaturel lognaturel merged commit 122548c into XLSForm:master Jan 29, 2026
14 checks passed
@lindsay-stevens lindsay-stevens deleted the pyxform-747 branch January 29, 2026 19:28
@lindsay-stevens
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update identifier validation message now that leading colons aren't allowed

2 participants