Conversation
|
We've been running this branch in production for the past few weeks via a fork (datacore-one/orgparse@pr-77) as the underlying parser for an org-mode AI workflow library. Wanted to share real-world feedback since you've had no external reviews yet. What we're doing with itWe built org-workspace on top of this branch — a library for AI agents to read and mutate org files (task state transitions, property updates, LOGBOOK entries, CLOCK entries, refile). It runs overnight processing hundreds of org tasks autonomously. The mutation capabilities in this PR are exactly what made that possible without falling back to fragile regex substitution. What works well
One fragility we hit
try:
node._mark_dirty()
except AttributeError:
node._env.reload(node.path) # fallbackIt would be helpful to have a public Minor notes
OverallThis PR fills a real gap. orgparse is widely used for reading org files but there's no good Python solution for writing them without mangling the file. We'd love to see this merged. Happy to help review specific parts if that's useful. |
|
Oh wow, nice! The goal is for the coordination of AI agents around Org-Mode files. It's not finished yet, but I'm aiming to make that orgparse-compatible eventually. The current state is that the tree-sitter parser handles a lot more Org-Mode syntax than orgparse and it just needs a Python wrapper. I'd love some of your feedback on there, and perhaps we could cooperate, since we seem to be having a similar project in mind? |
This is in reference to #11 and #76
I wanted to be able to modify parsed nodes and dump them to a file for a project I'm building. This code was written with AI (Codex 5.2 thinking primarily) but I was in the loop for every change made.
Here's how this works:
LineItemis added that represents a parsed line of an org nodes text. There are subclasses for each supported Org function (dates, properties, headline etc).TextLinesto preserve the 1-to-1 representation when using__str__.LineItem. If one wasn't present for a specific property (node.scheduled = datewhen there was no previous scheduled date parsed from file) one is inserted._line_itemsinstead. That's how the edits are made possible.The_lineswith the original file contents are still available, as a performance optimization, for the usual use-case of just parsing the nodes without making any modifications to them.I also considered creating
_line_itemslazily, on first write, as another optimization, but didn't do that in the end - seems to be performant enough, my 37k task archiveloadsin 2.32s on current master and in 3.71s on this branch. Let me know if this is acceptable.I mostly added unit tests for the new functionality as the current functionality is mostly left unchanged.
The
bodyediting is pretty awkward as it requires a line index since bodies can be not contiguous and can contain timestamps etc.Similarly, edge cases such as duplicated
logbookdrawers are equally awkward.This PR also permits
childrenmodification, so essentially adding subtrees and moving them around the file (but only within the sameOrgEnv.)I reorganized the code a little bit and reexported the relevant functions back from the
nodemodule, to keep the API stable.Technically, these newLineItemscould replace the old attributes like_headingetc, entirely, but I opted to keep the old attributes in, at least until I get some comments on the general approach here - with richer representation it'll be easier to add support for links, etc, as part of the heading for instance.I went ahead and made the "new" representation the only available one (without changing the public API).. That uncovered a few issues, but these were fixed and now my archive loads and looks correct. It also sped things up a bit - archive now loads in 2.68s.