-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmorph-paper.tex
More file actions
1064 lines (902 loc) · 50.8 KB
/
morph-paper.tex
File metadata and controls
1064 lines (902 loc) · 50.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
% Required packages
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{mathtools}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}
\usepackage{cleveref}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{natbib}
\usepackage[margin=1in]{geometry}
\usepackage{tikz}
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc}
\usepackage{xcolor}
\usepackage{enumitem}
% Theorem environments
\newtheorem{definition}{Definition}
\newtheorem{axiom}{Axiom}
\newtheorem{proposition}{Proposition}
\newtheorem{theorem}{Theorem}
\title{\textbf{Morph: Version Control When Pipelines Are Probabilistic}}
\author{%
Raffi Krikorian\\
Mozilla\\
\texttt{raffi@mozilla.org}
}
\date{}
\begin{document}
\maketitle
\begin{abstract}
The tools we use to manage software were built for a world where code is
deterministic. Git tracks source files where identity is byte equality,
reproducibility means identical output, and merge is text reconciliation. None
of that holds when you are building with LLMs. Prompt outputs are stochastic.
Behavior depends on model version, retrieval corpus, and tool availability---not
just the files in the repo. And ``did it get better?'' is a statistical
question, not a diff.
\textsc{Morph} is a distributed version control system designed for this world.
It extends Git's content-addressed Merkle DAG with three additions:
pipelines (the sequences of prompt calls, tool invocations, retrieval steps,
and transforms that make up an LLM application), evaluation suites (versioned
definitions of what ``good'' means and how to measure it), and runs (permanent
execution receipts recording exactly what ran, in what environment, and what
it produced). A \textsc{Morph} commit bundles a pipeline with an evaluation
suite and scores. At merge time, \textsc{Morph} records the scores from both
parents and the scores the merged pipeline achieved---making it visible whether
the merge improved, held steady, or regressed on any metric.
The paper formalizes the pipeline model, defines what it means for one pipeline
version to improve on another, and works out the merge semantics that follow.
It describes a concrete v0 system, what the design looks like in practice for
agent-driven development and human-AI collaboration, and how the Actor
abstraction handles everything from a solo human to a swarm of agents using
the same DVCS machinery.
\end{abstract}
% ============================================================
\section{Introduction}
\label{sec:intro}
Software development is changing faster than the tools built to manage it.
Large language models are not autocomplete for code anymore---they are
components inside larger systems that retrieve context, call tools, generate
and modify code, run tests, and iterate~\citep{yao2023react,
jimenez2024swebench}. These pipelines---combining prompt calls, retrieval, tool
invocations, and deterministic transforms---are different from traditional source
code in ways that break the foundations of how we track changes.
Git~\citep{torvalds2005git}, the version control system that runs almost
everything, is built on three assumptions: (i)~identity is byte equality,
(ii)~reproducibility means identical output bytes, and (iii)~merge is syntactic
text reconciliation. All three break when the thing you are versioning is a
stochastic, environment-dependent transformation pipeline:
\begin{itemize}[nosep]
\item \textbf{Same code, different outputs.} Run the same prompt twice and
you get different answers. That is not a bug---it is how LLMs work. Git's
assumption that identity means byte equality does not hold here.
\item \textbf{Behavior depends on the environment.} Model version, sampling
settings, retrieval corpus, and tool availability all shape what comes out.
None of that is in the file tree.
\item \textbf{``Did it get better?'' is a statistical question.} You cannot
read a diff and know whether the pipeline improved. You have to run it,
score it, and compare.
\item \textbf{Agents produce patches, not just humans.} Which agent, which
model, which prompt generated a change---and whether the result actually
passes tests---matters for accountability. Git has no place to put any of
that.
\item \textbf{Merge can silently regress behavior.} Two branches can merge
cleanly at the text level while the resulting pipeline performs worse than
either parent.
\end{itemize}
\noindent This is not hypothetical. It is what happens every day in an
increasing share of software development, and it is why we need a different
kind of version control---one that records not just \emph{what the files are}
but \emph{who changed them}, \emph{what the pipeline did}, and \emph{whether it got better}.
\textsc{Morph} extends version control to \emph{recorded behavior}. It keeps
Git's core principles---every object identified by a hash of its contents,
organized in a Merkle DAG---while adding support for pipelines, evaluation
suites, and execution evidence. It is built for teams that have adopted the
improvement model: the shared assumption that each commit should leave the
pipeline at least as capable as before, the same way Git assumes you are
trying to write good code. Within that model, \textsc{Morph}'s job is to
record everything---every actor, every run, every score, every merge. A
\textsc{Morph} commit is not just a file snapshot; it is a record that a
specific pipeline, run against a specific evaluation suite, achieved specific
scores, with the execution receipts attached.
When a pipeline makes LLM calls, retrieves context, and runs tools, a merge
record cannot be purely syntactic. It has to include \emph{what the pipeline did}.
\textsc{Morph} records, at every merge, the scores both parents achieved and
the scores the merged pipeline achieved. What anyone does with that record is
up to them.
The contributions of this paper are:
\begin{enumerate}[nosep]
\item A formal model of pipelines as composable, effect-aware computations
supporting sequential chaining, parallel branching, and multi-actor
attribution (\Cref{sec:programs}).
\item A way to define what ``better'' means: evaluation suites as versioned
contracts, scores as the result of running them, and an ordering that lets
you compare any two pipeline versions (\Cref{sec:eval}).
\item A merge record: \textsc{Morph} records the scores both parents achieved
and the scores the merged pipeline achieved, with a formal property showing
what it means when those scores improve (\Cref{sec:merge}).
\item A concrete system design (\textsc{Morph} v0) built as a content-addressed
object store with Git-style CLI tooling (\Cref{sec:system}).
\end{enumerate}
% ============================================================
\section{Background and Related Work}
\label{sec:related}
\textsc{Morph} builds directly on Git's architecture~\citep{torvalds2005git}: a
chain of commits where each one is identified by a hash of its contents,
organized into a Merkle DAG~\citep{merkle1988digital}, so history is
tamper-evident and cheap to compare. That design worked well for decades.
When code is deterministic, text is all you need to track. Prior work on
formalizing version control~\citep{roundy2005darcs,angiuli2014homotopical,swierstra2014semantics}
built more rigorous foundations for patch theory and merge semantics---but all
of it operates at the level of file content. \textsc{Morph} takes the same
impulse and applies it one layer up: to what the actors \emph{did} and what
the pipeline \emph{produced}, not just what the files \emph{say}.
ML tooling has partially recognized the gap. DVC~\citep{dvc2020} versions
datasets and pipelines alongside code; MLflow~\citep{zaharia2018mlflow} tracks
experiments and metrics across training runs. Evaluation frameworks like
HELM~\citep{liang2023helm} and SWE-bench~\citep{jimenez2024swebench} take
behavioral measurement seriously. But all of these treat evaluation as something
that happens \emph{around} version control, not inside it. Metrics get logged;
they do not become part of the commit record.
The closest thing practitioners have built is tools like Braintrust~\citep{braintrust2024},
LangSmith, and Humanloop---prompt versioning, evaluation pipelines, and CI hooks
that block a deploy when scores drop. That reflects a real need. But these are
centralized SaaS platforms that version individual prompts, not pipeline DAGs;
their merge blocking is a CI side-effect, not a native VCS operation; and they
produce no decentralized, verifiable evidence graph. They version prompts.
\textsc{Morph} records \emph{what pipelines did}---every actor, every run,
every score---and keeps the execution receipts to back it up.
LLM application frameworks like LangChain~\citep{chase2022langchain} and agent
paradigms like ReAct~\citep{yao2023react} define what the pipelines look like
and how agents operate~\citep{wang2024survey}. \textsc{Morph} is not a
competitor to any of them---it is the version control layer underneath them.
The pipeline formalism uses monadic effect theory~\citep{moggi1991notions,
wadler1995monads}---the same pattern behind \texttt{Promise} in JavaScript
and \texttt{Result} in Rust---to compose steps that carry side effects
without losing track of them. The reproducibility framing follows
Peng~\citep{peng2011reproducible}, updated for a setting where byte-identical
output is often impossible and re-running the same evaluation checks is the
achievable substitute.
% ============================================================
\section{Pipelines and How They Compose}
\label{sec:programs}
\subsection{What Gets Checked Out}
\textsc{Morph} models a workspace as three things held together:
\begin{equation}
S = D \times C \times M,
\end{equation}
where $D$ is the document tree (your code, datasets, artifacts), $C$ is
execution context (intermediate results, caches), and $M$ is metadata (who
produced what, run identifiers). Think of it as the full snapshot of everything
in your working directory, plus the context needed to understand it. This is
the thing that gets checked out, modified, and committed.
\subsection{Actors}
Every worker---human, agent, or a human and agent working together---is an
\emph{Actor}:
\begin{equation}
\text{Actor} = (\text{id},\; \text{type} \in \{\texttt{human},
\texttt{agent}\},\; \text{env\_config} \cup \{\bot\}),
\end{equation}
where \texttt{env\_config} records the model, sampling settings, toolchain,
and runner for agent actors, and is $\bot$ for humans. The model does not
otherwise distinguish between human and agent workers. A human on a laptop, an
agent in a CI environment, and a human+agent pair in a Cursor session are all
just Actors with different types and environment configurations.
This matters: two agents working simultaneously in different environments is
exactly the same situation as two humans working on two laptops. Both check out
the tree, both work independently, both commit and push. That is the standard
distributed version control problem---and \textsc{Morph} inherits the standard
DVCS solution for it. No new theory needed. The \texttt{env\_config} on each
actor's nodes is just provenance---a record of what was running when---not a
coordination mechanism.
\subsection{What a Pipeline Is}
A pipeline takes a workspace state and produces a new one, plus a full record
of everything that happened along the way: which actor ran which step, which
model was called, what it returned, what tools ran, what failed and was retried.
Formally:
\begin{equation}
P : S \to F(S),
\end{equation}
where $F$ is a wrapper that carries side effects (randomness, I/O, traces,
failures, retries) alongside the result, not just the final output.
Concretely, a pipeline is a DAG where each \emph{node} is a single
computational step---one prompt call, one tool invocation, one retrieval
lookup, one deterministic transform, or one review decision. Each edge is
either a data dependency (output of one step feeds into the next) or a control
dependency (ordering without data flow). The graph structure captures what runs,
in what order, and what depends on what.
\begin{definition}[Pipeline]
A \emph{pipeline} $P = (V, \mathcal{E}, \kappa, \rho, \alpha, \varepsilon)$
consists of:
\begin{itemize}[nosep]
\item $V$ --- a set of nodes (each a single computational step, as above)
\item $\mathcal{E} \subseteq V \times V$ --- directed edges forming a DAG
\item $\kappa : V \to \{\texttt{prompt\_call}, \texttt{tool\_call},
\texttt{retrieval}, \texttt{transform}, \texttt{identity},
\texttt{review}\}$ --- node type, assigning each node its operator kind
\item $\rho : V \to \mathcal{H} \cup \{\bot\}$ --- payload reference,
pointing each node to its stored content (prompt template, tool schema,
etc.) by hash, or $\bot$ if the node needs none
\item $\alpha : V \to 2^{\mathcal{A}}$ --- attribution set, the set of Actor
IDs that contributed to this node
\item $\varepsilon : V \to \text{EnvConfig} \cup \{\bot\}$ --- per-node
environment, recording which model and toolchain ran this node, or $\bot$
for human-only nodes
\end{itemize}
\end{definition}
\noindent Throughout, $\bot$ denotes an absent or unattributed value. The same
symbol is reused deliberately wherever something is optional.
A word on \texttt{review} nodes. They represent an explicit acceptance or
modification decision---a human approving a diff in Cursor, or an agent
evaluating a candidate and choosing to accept, reject, or modify it. The
actor on a \texttt{review} node can be human, agent, or both. The node type
records the \emph{kind} of operation; the attribution set records \emph{who}
did it.
Attribution as a set handles the real-world mess. A human who accepts 80\% of
an agent's suggested diff and rewrites the other 20\% inline gets attribution
$\{\texttt{agent-1}, \texttt{human-1}\}$. One actor working alone gets a set
with one entry. A legacy node with no recorded attribution gets an empty
set---which is what $\bot$ reduces to.
\subsection{Composing Pipelines}
The tricky thing about chaining LLM steps is that each step does not just
return a value---it also produces side effects. A prompt call might fail and
retry. A tool invocation logs what it did. A retrieval step records latency.
Pipe outputs to inputs naively and you lose all of that bookkeeping.
A \emph{monad} solves this. You have almost certainly used one without calling
it that: \texttt{Promise} in JavaScript, \texttt{Optional} in Java,
\texttt{Result} in Rust. The pattern is the same everywhere---a wrapper that
carries extra context (errors, traces, retries) alongside the result, so that
context does not get dropped when you chain steps together.
For \textsc{Morph}, the monad gives us two operations:
\begin{align}
\text{pure} &: A \to F(A), \\
\text{bind} &: F(A) \to (A \to F(B)) \to F(B).
\end{align}
\noindent \texttt{pure} lifts a plain value into the pipeline so it can be
passed along without side effects. \texttt{bind} is how you chain steps: run
the first, take its output, pass it to the next. If you have used
\texttt{Promise.then()} in JavaScript or \texttt{flatMap} in Scala or Java,
that is bind---\textsc{Morph} just formalizes the same idea at the level of the
whole pipeline.
Sequential composition follows directly:
\begin{equation}
(Q \circ P)(s) = \text{bind}(P(s),\; Q).
\end{equation}
\noindent The monad laws give you two things:
\begin{itemize}[nosep]
\item \textbf{Associativity:} $(R \circ Q) \circ P = R \circ (Q \circ P)$
--- you can refactor a pipeline without changing what it does.
\item \textbf{Identity:} There exists a pipeline $I$ such that $I
\circ P = P \circ I = P$ --- there is a meaningful no-op, which matters
for defaults and incremental adoption.
\end{itemize}
For parallel branches---two actors working on the same input state, generating
candidates to be compared or merged---we fan out, run both, and reconcile:
\begin{equation}
\text{branch}(P, Q) = J \circ (P \otimes Q) \circ \Delta,
\end{equation}
where $\Delta(s) = (s, s)$ duplicates the starting state, $P \otimes Q$ runs
both branches independently (this is \texttt{Promise.all()}), and $J$ is an
explicit join node that decides what to do with both results---pick the one with
better test coverage, merge the best parts, or hand off to a \texttt{review}
node. The join step is itself a node in the pipeline with its own attribution
and record of who ran it.
\subsection{Multiple Actors, Concurrent Work}
The concurrent case---two humans on two laptops, two agents in two cloud
environments, a human and an agent each on their own branch---is just DVCS.
Each actor checks out the tree, works independently, commits, and pushes.
\textsc{Morph} records the scores from each branch and what the merge achieved.
The standard distributed version control model handles everything else.
The interesting cases are within a single session, where multiple actors
contribute to one pipeline:
\begin{itemize}[nosep]
\item \textbf{Sequential handoff:} Actor A does retrieval, actor B writes the
code, actor C runs verification. This is just sequential composition
($C \circ B \circ A$) with different actors on different nodes.
Attribution records who did each step; \texttt{env\_config} records what
environment each ran in.
\item \textbf{Parallel candidates:} Two actors generate independent candidate
patches from the same starting state. A \texttt{review} node picks or
merges them. This is the branch composition above---the standard Cursor
multi-composer workflow.
\item \textbf{Human + agent on one node:} A human accepts and edits an
agent's diff. This is a \texttt{review} node with attribution
$\{\texttt{agent}, \texttt{human}\}$. No special case needed.
\end{itemize}
\noindent Same record in all three cases: who contributed what, in what
environment, producing what output. The scores are a property of the composed
result, not any individual step.
One thing \textsc{Morph} cannot do from the record alone is decompose
blame to individual actors. If the pipeline scores poorly,
\textsc{Morph} knows the score and knows who touched which nodes, but it cannot
automatically tell you which actor caused it. Contributions are
entangled---agent B's code change may only make sense given what
agent A retrieved. Figuring this out means running the pipeline again
with one actor's contribution removed. That is future work.
% ============================================================
\section{Defining ``Better'': Evaluation and Scores}
\label{sec:eval}
\subsection{Evaluation Suites}
In practice, an evaluation suite is the answer to: what does ``good'' mean for
this pipeline? In a Claude Code workflow, it might be: does the generated patch
pass the test suite? Does it stay under 500ms p95 latency? Does it call any
forbidden APIs? In a Cursor session building a RAG pipeline, it might be:
does retrieval hit the right documents in the top-3? Does the generated answer
score above 0.85 on faithfulness? Does the whole thing cost under \$0.02 per
query?
In code, an evaluation suite $T$ takes a workspace state and produces
observations:
\begin{equation}
\text{Eval}_{T,E} : S \to F(\text{Obs}_T),
\end{equation}
where $\text{Obs}_T$ includes raw metric samples, test execution logs, timing
data, and anything else needed to produce scores. Running an evaluation has
side effects too---compiling code, calling a judge model, collecting human
ratings---so it lives in the same $F$ wrapper as pipelines. Each metric
$m \in T$ specifies a scoring function, a direction (higher is better, or
lower is better), and how to aggregate samples into a final score (mean, lower
confidence bound, percentile threshold, pass/fail).
\textsc{Morph} supports two kinds of evaluation:
\begin{itemize}[nosep]
\item \textbf{Artifact evaluation:} Does the output do what it should?
Unit tests pass, integration tests pass, static analysis clean, benchmarks
within bounds. This is the \texttt{pytest} or \texttt{cargo test} you are
already running---\textsc{Morph} just makes the results a versioned
object you can compare across commits.
\item \textbf{Process evaluation:} Did the pipeline behave acceptably while
getting there? Cost under budget, latency within bounds, no forbidden tool
calls, no policy violations. Things that do not show up in the output but
matter for deployment.
\end{itemize}
\noindent Both matter. A patch that passes tests but costs \$10 per request
is not a good merge, even if the text diff is clean.
There is one trap worth calling out. An agent that controls its own test files
could modify them to manufacture a perfect score. Each evaluation suite
declares a \emph{fixture source} to prevent this: where the test data comes
from. Options are \texttt{candidate-sourced} (the produced tree),
\texttt{base-sourced} (a specified baseline), \texttt{pinned} (locked to a
specific hash), or \texttt{external}. Without this, a capable enough agent
could just rewrite the tests.
\subsection{Scores}
For an evaluation suite $T$ with metrics $\{m_1, \ldots, m_k\}$, a scorecard
is one number per metric---accuracy, latency, cost, whatever $T$ defines. We
write $V_T$ for the set of all possible scorecards under $T$:
\begin{equation}
V_T = \prod_{m \in T} V_m.
\end{equation}
One scorecard is at least as good as another if it scores better on
\emph{every} metric (e.g., higher for accuracy, lower for latency):
\begin{equation}
x \leq_T y \iff \forall m \in T,\; x_m \leq_m y_m.
\end{equation}
Running a pipeline $P$ from starting state $s_0$ against suite $T$ produces a
scorecard. Because a pipeline's outputs vary across runs, a single execution
gives you a sample, not a ground truth. The aggregation method---mean over $n$
runs, lower confidence bound at level $\alpha$, p95 latency, pass/fail---is
part of the evaluation suite definition, fixed and hashed like everything else.
$\text{cert}_T(P, s_0)$ denotes the resulting aggregate scorecard:
\begin{equation}
\text{cert}_T(P, s_0) \in V_T.
\end{equation}
A score difference between two pipelines might be a real improvement or
sampling noise. The formalism treats $\text{cert}_T$ as the declared aggregate
and leaves the question of significance to the evaluation suite designer; the
discussion covers what that means in practice.
\begin{definition}[Improvement Ordering]
Pipeline $Q$ \emph{improves on} pipeline $P$ under suite $T$ and starting
state $s_0$, written $P \preceq_{T,s_0} Q$, if its scorecard is at least as
good on every metric:
\begin{equation}
\text{cert}_T(P, s_0) \leq_T \text{cert}_T(Q, s_0).
\end{equation}
\end{definition}
\begin{proposition}
The improvement ordering $\preceq_{T,s_0}$ is a preorder: a pipeline always
improves on itself (reflexivity), and if $Q$ improves on $P$ and $R$ improves
on $Q$, then $R$ improves on $P$ (transitivity).
\end{proposition}
\begin{proof}
Reflexivity follows from reflexivity of $\leq_m$ for each metric. Transitivity
follows from transitivity of $\leq_m$ applied componentwise.
\end{proof}
\noindent This turns ``did it get better?'' from a judgment call into a
comparison with a well-defined answer.
% ============================================================
\section{Commits and Merge}
\label{sec:merge}
\subsection{What's in a Commit}
A Git commit says: here is what the files looked like. A \textsc{Morph} commit
says: here is a pipeline, the evaluation suite it was run against, and the
scores it got. The execution receipts are attached.
\begin{definition}[Commit]
A \textsc{Morph} commit is a tuple
\begin{equation}
c = (\text{tree\_hash},\; \text{program\_id},\; T,\; v,\; \text{parents},\;
\text{env\_constraints},\; \text{evidence\_refs}),
\end{equation}
where $\text{tree\_hash}$ is the root hash of the working directory tree (same
role as Git's tree object), $\text{program\_id}$ is the content hash of the
pipeline DAG, $T$ is an evaluation suite (by hash), $v \in V_T$ is the scores
the pipeline achieved, $\text{parents}$ are parent commit hashes forming the
Merkle DAG, $\text{env\_constraints}$ records the environment in which the
scores were captured, and $\text{evidence\_refs}$ are hashes of the
supporting run and trace objects.
\end{definition}
The scores $v$ are not decorative. They are what the pipeline produced when it
was committed, and the reference point for any future merge. When the pipeline
and evaluation suite are unspecified, they default to the identity pipeline and
empty suite---so \textsc{Morph} degrades gracefully to a plain VCS. Same
tamper-evident history as Git. Just more in each commit.
\subsection{How Merge Works}
Git merge records a text reconciliation. \textsc{Morph} is built on the
assumption that you are merging because you believe the result is better than
either parent. The merge record captures everything needed to verify that
assumption---or to see exactly where it fell short:
\begin{enumerate}[nosep]
\item \textbf{Structural merge} of pipeline DAGs (analogous to three-way
text merge).
\item \textbf{Combine evaluation suites.} If parents have suites $T_1$
and $T_2$, the merge suite is $T = T_1 \uplus T_2$---the full set of
metrics from both, kept separate by metric ID so definitions cannot
silently collide.
\item \textbf{Record the bar.} Embed each parent's scores into $V_T$
and record the best from either parent on every metric:
\begin{equation}
v_{\text{ref}} = \text{embed}(v_1) \sqcup \text{embed}(v_2).
\end{equation}
The embedding $\text{embed} : V_{T_i} \to V_T$ maps each metric's value
from the parent suite to the corresponding component in the merged suite,
and assigns $\bot_m$ (the worst possible value for metric $m$) to any
metric in $T \setminus T_i$. This requires each $V_m$ to have a least
element $\bot_m$, which we take as an axiom: for a minimize-direction
metric, $\bot_m = +\infty$; for a maximize-direction metric, $\bot_m = -\infty$.
The operator $\sqcup$ then takes the componentwise maximum under each
metric's direction.
\item \textbf{Record improvement.} Run the merged pipeline $R$ and record
its scores relative to $v_{\text{ref}}$:
\begin{equation}
\text{cert}_T(R, s_0) \text{ vs. } v_{\text{ref}}.
\label{eq:merge-dom}
\end{equation}
\end{enumerate}
\begin{theorem}[What a Good Merge Record Shows]
\label{thm:monotone}
If a merge commit records $\text{cert}_T(R, s_0) \geq_T v_{\text{ref}}$,
then the merged pipeline $R$ scores at least as well as both parents on every
metric each parent measured. For parent commits $c_1 = (\ldots, T_1, v_1, \ldots)$ and $c_2 =
(\ldots, T_2, v_2, \ldots)$:
\begin{equation}
\text{cert}_{T_1}(R, s_0) \geq_{T_1} v_1 \quad \text{and} \quad
\text{cert}_{T_2}(R, s_0) \geq_{T_2} v_2.
\end{equation}
\end{theorem}
\begin{proof}
By construction, $\text{embed}(v_i)$ agrees with $v_i$ on every metric in
$T_i$ and assigns $\bot_m$ elsewhere. The join $v_{\text{ref}} =
\text{embed}(v_1) \sqcup \text{embed}(v_2)$ therefore satisfies
$v_{\text{ref}} \geq_{T_i} \text{embed}(v_i) \geq_{T_i} v_i$ for $i \in
\{1, 2\}$ (where $\geq_{T_i}$ denotes restriction to the metrics of $T_i$).
If $\text{cert}_T(R, s_0) \geq_T v_{\text{ref}}$, then in particular for
every metric $m \in T_1$:
\[
\text{cert}_T(R, s_0)_m \geq_m (v_{\text{ref}})_m \geq_m (v_1)_m,
\]
which gives $\text{cert}_{T_1}(R, s_0) \geq_{T_1} v_1$, and symmetrically
for $T_2$.
\end{proof}
If you look at a merge record and the merged pipeline scored at or above both
parents, nothing was lost. If the record shows a drop on any metric, you know
exactly where and by how much. Either way, the record is there.
\subsection{Metric Retirement}
The merge record compares against the parent suites. That works when the
parent metrics still make sense. Usually they do. But sometimes a branch
changes what the pipeline fundamentally does---switching retrieval strategy,
replacing a model, restructuring the task entirely. The old metrics may no
longer be the right thing to measure. Comparing the new pipeline against them
would be like measuring a search engine's latency with a benchmark designed for
a chatbot: the number is real, the comparison is meaningless.
\textsc{Morph} handles this with \emph{metric retirement}. A commit can declare that one or more metrics from a parent's evaluation suite
no longer apply, with a stated reason. A retired set $D \subseteq T_1 \uplus
T_2$ is removed from the merge suite before recording the reference scores:
\begin{equation}
T = (T_1 \uplus T_2) \setminus D, \quad
v_{\text{ref}} = \text{embed}(v_1) \sqcup \text{embed}(v_2)
\text{ projected onto } T.
\end{equation}
The retirement is explicit and attributed: any merge that retires metrics must
include a \texttt{review} node. The review actor---human, agent, or both---is
recorded as the one who made the call. This is not a human requirement; it is
an attribution requirement. The \texttt{review} node records who decided, in
what environment, and why. That record persists like everything else.
The merge record still shows improvement on every metric that survived.
\textsc{Morph} does not prevent someone from retiring every metric---that is a
policy decision, not a recording one. What it does is make every retirement
visible, attributed, and permanent in the object graph.
\section{How \textsc{Morph} Works}
\label{sec:system}
\subsection{How Objects Are Stored}
Like Git, \textsc{Morph} stores everything as content-addressed, permanent
nodes in a Merkle DAG using SHA-256---meaning every object is identified by
a hash of its contents, so the history cannot be tampered with. The object
types extend Git's \texttt{blob}/\texttt{tree}/\texttt{commit} with new
additions:
\begin{table}[t]
\centering
\caption{Object types in \textsc{Morph} compared to Git. Items marked with
$\star$ are new to \textsc{Morph}; others generalize Git equivalents.}
\label{tab:objects}
\begin{tabular}{@{}llp{6.5cm}@{}}
\toprule
\textbf{Object} & \textbf{Git Analog} & \textbf{Description} \\
\midrule
Blob & blob & Atomic content: prompt templates, tool schemas, configs. Open
\texttt{kind} field. \\
Tree & tree & Structured grouping of objects (directory analog). \\
Pipeline$^\star$ & --- & DAG of typed operators with data/control edges. \\
EvalSuite$^\star$ & --- & Versioned evaluation definition: cases, metrics,
thresholds, fixture sources. \\
Commit & commit & Tree hash $+$ pipeline hash $+$ eval suite $+$ scores
$+$ evidence refs. \\
Run$^\star$ & --- & Execution receipt: environment, inputs, outputs, metrics,
trace ref, actor identity. \\
Trace$^\star$ & --- & Sequence of typed, addressable events within a run. \\
Annotation$^\star$ & --- & Metadata on any object without changing its hash. \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Runs and Traces: Keeping the Receipt}
When Claude Code applies a patch, something happened. A specific model ran, with
specific sampling settings, on a specific input state, and produced a specific
output. Tests either passed or failed. Latency was recorded. Maybe a tool call
was made mid-generation. Right now, most of that disappears the moment the
session ends. \textsc{Morph} keeps it.
A \textbf{Run} is a permanent receipt of a single pipeline execution. It
records: the pipeline hash (so you know exactly what ran), the full environment
(model identifier and version, sampling settings, toolchain), the input state,
the output artifacts, the observed scores, a trace reference, and the actor's
identity. Everything is there. If you need to re-run the evaluation or dig
into why a merge looked the way it did, the Run has it.
A \textbf{Trace} is the detailed record inside a Run---a sequence of typed
events, each with a unique identifier. A prompt call event records the rendered
prompt and the raw completion. A tool call event records the invocation and
result. A retry records what failed and why. Individual events within a trace
can be annotated independently---so a reviewer can flag a specific tool call or
a specific model output without touching anything else.
Runs never modify commits. They are evidence that accumulates over time. This
means you can run a pipeline ten times and have ten Runs attached to the same
commit---each one adding to the picture of how it behaves, without rewriting
history.
\subsection{Annotations}
\textbf{Annotations} attach metadata to any object (or a specific event within
a trace) without altering its hash. They enable human ratings, bookmarks, tags,
cross-references, and attribution overlays. Annotations are the extensibility
mechanism: higher-level tools can layer rich metadata onto the permanent object
graph without requiring changes to core object types.
\subsection{Separation of Concerns: \textsc{Morph} Does Not Execute}
\textsc{Morph} does not execute pipelines or run evaluations. All LLM calls,
tool execution, and test runs happen in external tools---Cursor, Claude Code,
your CI system. \textsc{Morph} just stores what they report: permanent
content-addressed objects, execution receipts, scores, and the record of what
happened at every merge. Commands like \texttt{morph run record} and
\texttt{morph eval record} \emph{ingest} results; they do not run anything.
The VCS layer stays thin and out of the way.
\subsection{Integration: Editors, Agents, and the Filesystem}
\textsc{Morph} is designed to be integrated into the tools where AI-assisted
development actually happens---editors like Cursor and agentic coding tools like
Claude Code---not run alongside them as a separate process.
\paragraph{The editor path.}
In a VS Code-based environment like Cursor, the integration is a VS Code
extension. VS Code's document model already serializes all writes to open
files---when a human is typing and an agent is applying a diff simultaneously,
the editor queues them. There is no raw filesystem collision at the content
level. \textsc{Morph} hooks into VS Code's document change events, which fire
on every modification with the diff and a flag indicating whether the change
came from a user action or a programmatic edit (agents use
\texttt{workspace.applyEdit()}). The extension reads that flag, tags attribution
accordingly, and records each change as a trace event. The pipeline DAG builds
incrementally as events arrive. A commit is just: snapshot the current tree,
hash it, seal the DAG assembled so far.
\paragraph{The filesystem path.}
When agents write directly to the filesystem---headless Claude Code sessions,
subprocess invocations, scripts running outside the editor---the document
model's serialization is gone. Two agents could write to the same file
simultaneously.
Well-behaved agents already handle this. Claude Code writes a
file, reads it back, compiles, runs tests, and checks whether the world is in
the state it expected. If another agent wrote something conflicting in between,
the compile or test step catches it. The agent does not need to know about the
concurrent write---it just sees a broken state and responds to that. The
write-read-verify loop is the conflict detection mechanism.
\textsc{Morph}'s role is not to intercept or serialize filesystem access.
It is to record what the actor declares when it commits---which pipeline ran,
what the tree hash is, and what the evaluation results were. If two agents
collided and left the tree in a broken state, the evaluation scores reflect
that. The record shows exactly what happened and when.
Agents are responsible for verifying their own writes before committing. That
is already the convention. \textsc{Morph} records whether they did.
\subsection{CLI Design}
The CLI mirrors Git where possible:
\smallskip
\begin{center}
\begin{tabular}{@{}ll@{}}
\texttt{morph init} & Initialize repository \\
\texttt{morph add .} & Stage working-space changes \\
\texttt{morph commit -m "msg"} & Commit with eval contract \\
\texttt{morph merge <branch>} & Merge with score recording \\
\texttt{morph run record <file>} & Ingest execution receipt \\
\texttt{morph eval record <file>} & Ingest evaluation results \\
\texttt{morph annotate <hash>} & Attach metadata \\
\texttt{morph log} / \texttt{status} & Inspect history and state \\
\end{tabular}
\end{center}
% ============================================================
\section{Axioms}
\label{sec:axioms}
\textsc{Morph} rests on a small set of axioms. These are not philosophical
claims. They are the design constraints the rest of the system depends on.
Violate any of them in an implementation and something breaks.
\begin{axiom}[Content-Addressed, Immutable Objects]
Every object is identified by a cryptographic hash of its contents, the same
way Git works. If the hash matches, the content is what you think it is.
History cannot be tampered with.
\end{axiom}
\begin{axiom}[Evidence Does Not Rewrite History]
Runs and evaluation results never mutate prior commits. New evidence produces
new objects; old commits stay exactly as they were. You can always go back and
see what a commit recorded at the time it was made.
\end{axiom}
\begin{axiom}[Pipeline Steps Compose Cleanly]
There is a well-defined way to chain pipeline steps sequentially and run them
in parallel---analogous to \texttt{Promise.then()} for chaining and
\texttt{Promise.all()} for concurrency---such that the results stay consistent
when you refactor or restructure. You can reorganize a pipeline without
changing what it does, and there is a meaningful no-op step.
\end{axiom}
\begin{axiom}[Evaluation Suites are Explicit Contracts]
``Better'' is not implicit. Every evaluation suite declares its metrics,
their directions, their fixture sources, and how samples get aggregated into
scores. All of it is versioned and hashed like everything else.
\end{axiom}
\begin{axiom}[Scores are Partially Ordered]
A scorecard is one number per metric. One scorecard dominates another only if
it wins on every metric simultaneously. If pipeline A is more accurate but
slower than pipeline B, they are incomparable---neither dominates. You only
get a clean winner when one pipeline is at least as good on everything.
\end{axiom}
\begin{axiom}[Merge Records Scores From Both Parents]
A merge commit records what both parents achieved and what the merged pipeline
achieved. This is the record that lets anyone verify, for any merge in history,
whether the shared assumption held.
\end{axiom}
\begin{axiom}[Environment is Part of the Record]
Every run records the environment that produced its scores---model version,
sampling settings, toolchain. Without this, scores from different environments
are not comparable. A pipeline running against \texttt{gpt-4o} and the same
pipeline running against a fine-tuned local model are different things, even
if the code is identical.
\end{axiom}
\begin{axiom}[Reproducibility Means Re-Running the Checks]
You cannot get byte-identical outputs from an LLM pipeline. Reproducibility
means something weaker: re-running the same evaluation suite in the same
declared environment produces aggregate scores consistent with the recorded
value, within the expected variance of the aggregation method chosen.
\end{axiom}
% ============================================================
\section{Discussion}
\label{sec:discussion}
\paragraph{This is what CI is already trying to do.}
CI pipelines already run tests on proposed changes. GitHub Actions blocks a
merge if tests fail. Braintrust blocks a deploy if eval scores drop. These
tools share the same underlying assumption \textsc{Morph} is built on: that
you are trying to get better. The difference is that \textsc{Morph} records
the contract, the evidence, and the outcome as permanent versioned objects,
not ephemeral pipeline logs. Failed CI runs evaporate; \textsc{Morph} runs
persist as evidence you can audit months later. Any enforcement sits on top.
\textsc{Morph} just makes sure the record is always there.
\paragraph{Agent accountability, finally.}
When Claude Code applies a patch today, there is no clean record of which
model generated it, what prompt produced it, what it cost, or whether it
passed any checks. Something breaks in production. You go digging through
session logs hoping something was captured.
\textsc{Morph} changes that. Every committed patch has a Run attached: the
agent identity, the model version, the full execution trace, and the scores.
That chain---from agent action to recorded outcome---does not currently exist
anywhere in agent-driven development.
\paragraph{The stats are minimal in v0.}
The current implementation uses mean aggregation, worst-case scores, percentile
thresholds, and lower confidence bounds. That is enough to be useful. Richer
methods---Bayesian comparison, two-sample equivalence testing, human-in-the-loop
evaluation---all fit the evaluation interface and are deferred to future work.
The interface is clean: swapping in a better scoring method does not require
changing the commit format or the merge procedure.
\paragraph{Built to distribute.}
Because all objects are content-addressed and permanent, distributed settings
are just Git's push/pull model applied to a richer object store. The v0
implementation is local-only; a remote protocol and federated verification
are planned.
\paragraph{On Git compatibility.}
An implementation of \textsc{Morph} can use Git's object store directly.
The content-addressed Merkle DAG that Git uses is the right substrate for
\textsc{Morph}'s objects too---blobs, trees, and commits map cleanly, and
the new object types (pipelines, eval suites, runs, traces) are just additional
objects in the same store. \textsc{Morph} and Git can share a repository:
Git tracks source files, \textsc{Morph} tracks pipelines and score history,
same underlying object store. Adoption is incremental. Leave the pipeline and
evaluation suite fields empty in a \textsc{Morph} commit and it degenerates
into a standard tree-hash commit.
\paragraph{Scores are statistics, not facts.}
Every score is an aggregate over runs that vary---a mean, a confidence bound,
a percentile. An observed improvement of 0.02 accuracy might be real or it
might be noise, depending on sample count and pipeline variance. \textsc{Morph}
records the aggregate and the method that produced it. Teams that need
stronger guarantees should use lower confidence bounds or equivalence tests
rather than raw means; the record makes the choice visible either way.
\paragraph{The credit assignment problem.}
When multiple actors contribute to a single pipeline, \textsc{Morph} can tell
you who touched which nodes, but it cannot automatically tell you which actor
caused a score change. The scores are a property of the composed pipeline, not
individual contributions. Agent A's retrieval step shapes what agent B can
generate; agent B's output shapes what agent C verifies. These contributions
are entangled, not independent.
The standard approach---Shapley-value decomposition---requires running the
pipeline counterfactually with each actor's contribution removed. For LLM
pipelines, that is ill-defined: removing agent A's retrieval does not produce
a meaningful baseline because agent B's generation was conditioned on that
retrieval. You cannot swap out one actor and expect the rest of the pipeline
to behave as if it had not happened. For actors on separate branches this is
not a limitation---each branch has its own scores. For tightly coupled
cooperative work, the evaluation suite has to test the composed result as a
whole, and credit decomposition remains an open research problem.
\paragraph{Known limitations.}
Three open problems worth naming.
First, \textsc{Morph} assumes evaluation suites are honest---that the metrics
actually measure what you think they measure. A bad eval produces meaningless
scores. Garbage in, garbage out; \textsc{Morph} makes it auditable garbage,
but it is still garbage.
Second, structural merge of pipeline DAGs (step 1 of the merge procedure) is
under-specified. Three-way text merge has decades of tooling behind it.
Pipeline-graph merge does not. Even defining what a correct pipeline merge
means---when two branches both modified the same prompt node in different
directions---is an open problem. This is the hardest gap in \textsc{Morph} v0.
Third, a recorded score difference may be real improvement or sampling noise.
The formalism treats $\text{cert}_T$ as the declared aggregate and leaves
significance to the evaluation suite designer. A future version could expose
a significance threshold as a first-class field in the suite, making the
required confidence level part of the versioned contract rather than buried
in the aggregation method.
\paragraph{Distribution shift.}
Evaluation suites are pinned by hash, which is the right design for
auditability. But the world changes: retrieval corpora age, model providers
update APIs silently, user query distributions drift. A high score recorded
six months ago against a pinned suite may say nothing about current
performance. \textsc{Morph} does not solve this---no version control system
can. What it does is make the evaluation suite explicit and versioned, so
when a team updates their suite to reflect a shifted distribution, the change
is visible in the history with full attribution. The gap between a pinned
historical eval and a current one is a deliberate design choice that teams
can see, not a silent assumption they cannot.
\paragraph{Metric retirement and policy.}
Metric retirement is a necessary escape valve but also a potential hole.
Nothing in the formal model prevents an actor from retiring every metric.
That is intentional. \textsc{Morph}'s job is not to block things---it is to
record everything. Every retirement is a \texttt{review} node in the history:
who made the call, in what environment, and why. Organizations that want to
restrict who can retire which metrics can build that on top, the same way
code review policies sit on top of Git without being part of Git.
% ============================================================
\section{Conclusion}
\label{sec:conclusion}
\textsc{Morph} is built on a simple assumption: that teams using it are trying
to improve. Each commit should leave the pipeline at least as capable as before.
Each merge should produce something better than either parent. That is not a
constraint \textsc{Morph} enforces---it is the model actors buy into when they
use it, the same way Git assumes you are trying to write good code.
Within that model, \textsc{Morph}'s job is to record everything. It extends
Git's content-addressed Merkle DAG to capture what actors actually do: the
pipelines they run, the evaluation suites they run them against, and the scores
that come back. Every merge records what both parents achieved and what the
merged pipeline achieved. The Actor abstraction puts humans, agents, and
human-agent pairs on equal footing---same DVCS model, same commit mechanics,
same merge record. In a Cursor session, the pipeline DAG builds incrementally
from every change event. A commit is just a named snapshot of that stream.
When the software you are building makes LLM calls, retrieves context, and
runs tools---and the same code produces different outputs depending on which
model is running---``what changed'' is not just bytes. It is who did what,
under what conditions, and what the pipeline produced. \textsc{Morph} records
all of it.
\bibliographystyle{plainnat}
\begin{thebibliography}{20}
\bibitem[Angiuli et~al.(2014)]{angiuli2014homotopical}
Carlo Angiuli, Edward Morehouse, Daniel~R. Licata, and Robert Harper.
\newblock Homotopical patch theory.
\newblock In \emph{Proceedings of the 19th ACM SIGPLAN International Conference
on Functional Programming (ICFP)}, pages 243--256, 2014.
\bibitem[Braintrust(2024)]{braintrust2024}
Braintrust.
\newblock {Braintrust: The AI observability platform}.
\newblock \url{https://www.braintrust.dev}, 2024.
\bibitem[Chase(2022)]{chase2022langchain}
Harrison Chase.
\newblock {LangChain}.
\newblock \url{https://github.com/langchain-ai/langchain}, 2022.
\bibitem[Chiang et~al.(2024)]{chiang2024chatbotarena}
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios~Nikolas Angelopoulos,
Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan,
Joseph~E. Gonzalez, and Ion Stoica.
\newblock Chatbot {Arena}: An open platform for evaluating {LLMs} by human
preference.
\newblock In \emph{Proceedings of the 41st International Conference on Machine
Learning (ICML)}, 2024.
\bibitem[DVC(2020)]{dvc2020}
{DVC Team}.
\newblock {DVC}: Data version control.
\newblock \url{https://dvc.org}, 2020.
\bibitem[Jimenez et~al.(2024)]{jimenez2024swebench}
Carlos~E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir
Press, and Karthik Narasimhan.
\newblock {SWE-bench}: Can language models resolve real-world {GitHub} issues?
\newblock In \emph{The Twelfth International Conference on Learning
Representations (ICLR)}, 2024.
\bibitem[Liang et~al.(2023)]{liang2023helm}
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu,
Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar,
et~al.
\newblock Holistic evaluation of language models.
\newblock \emph{Transactions on Machine Learning Research}, 2023.