From 36049b235ff2418f6dee52fd4848e80d9796d6c8 Mon Sep 17 00:00:00 2001 From: Maxim Katcharov Date: Wed, 8 Apr 2026 15:39:42 -0600 Subject: [PATCH] Add API Design document --- design/api_design.md | 392 +++++++++++++++++++++++++++ design/api_eval.md | 472 +++++++++++++++++++++++++++++++++ design/api_redesign_densify.md | 196 ++++++++++++++ 3 files changed, 1060 insertions(+) create mode 100644 design/api_design.md create mode 100644 design/api_eval.md create mode 100644 design/api_redesign_densify.md diff --git a/design/api_design.md b/design/api_design.md new file mode 100644 index 0000000000..2918d28bad --- /dev/null +++ b/design/api_design.md @@ -0,0 +1,392 @@ + +# Introduction + +Users experience a product through its API. The API informs a user’s mental model, determines legibility of code, and ease of writing. It serves as an external contract, so it is difficult to change. + +# Principles + +An API should be: + +* User-centred. Consider novice and experienced users. The API should be used, in tests and elsewhere, in the manner we expect the user to use it. +* Consistent. At a macro level, represent the same things in the same way across the API. At a micro level, syntax should match semantics: use the same constructs to represent the same concepts. At a meta level, apply a small and clear set of rules confidently and consistently, even if this leads to suboptimal results in some cases. +* Boring. Idiomatic. Avoid novelty, focus on established structures and patterns, common to the language, and to languages in general. Tricks and complexity must be well-justified. +* Extensible. It must allow for future changes and additions. +* Safe. It should prevent user error, through its wording, structure, or the use of types. + +We spend more time reading than writing, but we spend more time thinking than reading. Problems in a conceptual model are substantially worse than a readability issue. + +* Comprehensibility above readability. Consider the number and complexity of distinct concepts, or types of thing, and how the API can be used to distinguish them. Choose structural or conceptual clarity over an API that reads like natural language. +* Readability over writability. Keep it terse, and avoid boilerplate. This is increasingly important as our need to review AI generated code increases. +* Do not neglect writability, including tool use like autocomplete, and also AI tools. + +Specific guidance: + +* Show sequenced operations in a natural reading order. Prefer chaining to nesting. +* Show sub-structured operations or declarations through nesting (including lambda-initiated chains). +* Avoid state. Prefer immutability. Make procedural code look data-driven and declarative, without destroying procedural reading order where present. +* Be explicit. Avoid implicit operations or defaults. +* Do not rely on documentation. Ensure that the API is clear and usable, as much as possible, without docs. Then write great docs. +* When tracking another API, adhere to its functionality and concepts, but do not repeat its defects and inconsistencies. +* Prefer simplicity in the API. Move complexity, including configuration, to a layer encapsulated by the API. + +# Terms and Distinctions + +The following sections introduce concepts and terminology. + +## Syntactic Flow + +Syntactic flow refers to the syntactic structures of a language or an API used to express operations or relations in a way that matches the way we see or think about things. Examples: + +``` +(a + 1) * 2 -- infix operator +* + a 1 2 -- prefix operator +(mult (add a 1) 2) -- lisp-like functional +mult(add(a, 1), 2) -- nested functional +a.plus(1).times(2) -- chaining (fluent, DSL) +var n = a.plus(1); +var r = n.times(2); -- imperative +a.pipe([ + add(1), + mult(2) ]); -- list-like +``` + +Some of these styles are preferable for expressing particular constructs, since they better-match how we think about them. This is a matter of: + +* Brevity. Some things are much easier to understand at a glance. Infix operators demonstrate this, versus approaches that require multiple lines. +* Consistency and focus. If `a` and `1` are equally important and functionally identical, there is no reason to express one differently from the other, as is done in the chaining and especially pipeline examples. However, if adding and multiplying are something "done to" `a`, then we should prefer to highlight this by using chaining. +* Visual closeness. We prefer to have the `*` close to the `2` that it applies to. This is part of why the infix style is less readable. +* Sequential order. Nested approaches often violate this: `add` is a verb, so there is an implied order of action. However, we must deduce this order by glancing from place to place rather than having it naturally presented to us. +* Structural coherence. The simplest structure is a sequence, and it can be used to express the ordering of action just mentioned. It can express other things, like importance (in fields or parameters), data flow, and so on. We can also express tree-like structures: simple trees using infix operators, and complex trees using multiple lines and indentation. Directed acyclic graphs tend to require the imperative style for expressing data and procedural flow. Unconstrained graphs are more difficult to think about and express, and we tend to avoid them. + +The above example code is simple. Some constructs that apply in simple cases break down in complex cases, while others improve. Note, for example, that chaining is not limited to linear sequences, and works well for structured DSLs. + +``` +form.validator() + .section("reservation", s -> s + .stringField("name", f -> f.length(2, 20)) + .dateField("date", f -> f.future()) + .stringField("comment", f -> f.optional().length(0, 100))); +``` + +In cases where there might be excessive or careful boilerplate in a nested functional or imperative approach, chaining is often a better fit. + +In practice, we focus on the **imperative** and **chaining** styles. Operators are terse, but very limited. Functional styles violate order and introduce substantial nesting. The imperative style matches the sequential way we tend to think about things, and chaining narrows the imperative style, often making its structure more clear. + +### Initiating and terminating flow + +There are patterns for initiating flow: + +``` +new WriteConcern(2) -- constructor +ReadPreference.secondary() -- static factory method +MongoClientSettings.builder() -- static factory method +myList.stream()... -- factory method +WriteConcern.MAJORITY -- immutable static constant +myEntity.operation(b -> b...) -- lambda (provides b, from myEntiy) +``` + +Patterns for terminating flow: + +``` +MongoClients.create(settings) -- create method taking a settings object +settingsBuilder.build() -- builder build method; terminal operations +``` + +In many cases, flow does not explicitly terminate. + +``` +a.plus(1) +collection.map(a -> a + 1) +something.withValue(2) +something.value(2) +``` + +Configuration examples: + +``` +Client c = createClient().withTimeout(x); +Client c = createClient( ClientSettings.create().withTimeout(x) ); +Client c = createClient( ClientSettings.builder().timeout(x).build() ); +``` + +The first has no builder or settings, the second has ClientSettings, while the third has a ClientSettingsBuilder. The builder pattern is not strictly necessary (even validation can be on-use, rather than on-build). + +## Operations + +An **entity** is a thing, and an **operation** is something that can be done with a thing. A value is an entity that never changes. Some operations, like addition, are simple. They require only two values as inputs, and yield a third value. Other operations require many more inputs. + +Operations are considered generally, and do not correspond to methods. An object could represent an operation, and, less frequently, multiple objects could represent a single conceptual entity. + +### Subjects + +The **subject** of an operation is the entity we do something to. + +``` +substring("hence", 0, 3) +"hence".substring(0, 3) +``` + +The string is the subject. The numbers are what we will call **directives**, they direct how the operation is to be performed. The object-oriented approach adheres to the above principle of "consistency and focus" by using syntax to distinguish the subject. This syntactic style strongly encourages a single subject, which is the object that a method is called on. However, there can be multiple subjects. For example: + +``` +"hen".concat("ce") +``` + +We concatenate two items, and there is no particular reason (except English reading order) to treat one as the proper subject - both strings are the subjects of the operation. Most operations that have more than two subjects can be reduced to binary operations (add, multiply, and, or, and other monoids), or equivalently to an operations taking a list. But not all. For example: + +``` +cond(test, thenValue, elseValue) +blendColors(r, g, b) +``` + +Such operations are rare, but are difficult to model consistently in an object-oriented style, and tend to appear unintuitive. In some cases, it may be possible to re-formulate the entities to allow for a binary operation - for example, ColorChannel instances might blend via binary operations, rather than the three-subject operation above. Some form of symmetry is usually a good indication that this is possible + +### Domain values + +Some values form natural groupings, and can be represented together as a **domain value**. For example, if an operation specifies a first and last position, these might form a Range. Three integers representing a red, green, and blue could form a Color. + +Nominal wrappers can be used to wrap even a single primitive value. This practice makes code safer, but it comes with overhead, and is typically reserved for types that cross boundaries. For example, if a particular concept is used only within a simple method, it is obviously not worthwhile to create a wrapper for that type. + +While such types are useful internally, they can complicate an API. If the nominal type or domain value is not used extensively outside the API, then we must create those values at the API later. Compare: + +``` +"hence".substring(0, 3) +"hence".substring(range(0, 3)) +"hence".substring(range(0, 3)) +``` + +Although `range` is a coherent domain value, it is not used outside the API, so we should prefer to represent it using non-domain parameters, often primitives. If a type is expected to already be created before it is provided as an argument, then we should directly use that type in the API. For example: + +``` +diagram.shift(x, y); // inconvenient, if using Vec2 outside of the API +diagram.shift(delta); // delta is Vec2 +``` + + +### Configuration + +When discussing the subjects of operations, we have considered simple operations on values. As we turn to complex operations, we find that inputs become more numerous and complex. Suppose that we need to send a message. This operation may need to specify the message itself, the details of the destination, the format, how it should be sent, timeouts, and so on. The inputs are numerous. + +There are 3 sources for these inputs: locations in code, application settings, and user data. These correspond to ownership and control: code is owned by developers, settings by the system operator, and user data by users. This control flows down: the developer can exercise control over the scope of user data, but the opposite must not be the case. There are also hard boundaries on making changes to those scopes: recompilation, and application restarts. + +Inputs tend to form natural groupings. These groupings might suggest certain entities or composite domain values. Nested entities might have different values and overrides for particular inputs. + +**Configuration** encompasses those inputs that govern how an operation is done. Configuration is often intentionally concealed in various entities supplied to an operation. These inputs are, in a sense, often irrelevant to the semantics of the operation itself. For example, whether a list has been configured to be an array or a linked list is often irrelevant. + +### Summary + +* **Operation** - what is done +* **Subject** - what an operation is done to +* **Directive** - what is done, more exactly +* **Configuration** - how it is done + +The subject and directives are the **parameterization** of an operation. For example, in `"hence".substring(0, 3)`, `"hence"` is the subject and part of what we are calling the parameterization. + +In an object-oriented language, in the simple ideal case, configuration is provided through the constructor, the subject is the object entity, the operation is the method, and the parameters/arguments (depending on if you are declaring or calling) are the directives. + +## Guidance + +1. Strongly prefer chaining to nested static functions. Nesting is difficult to read, and static functions can have conflicts and often introduce substantial boilerplate. +1. Classify the inputs as Subject, Directive, or Configuration. In some cases, this is obvious, but in others, alternatives should be considered or left open. +2. Group inputs together. This might suggest composite entities, new subjects, composite directives. For example: range, coordinate, name, address. However, prefer primitives over domain values unless the domain value is already used extensively outside the API (see Domain values above). +3. Consider moving values into configuration, and carefully consider the structure and hierarchy of the entities involved. +4. Choose a good default, instead of letting the user configure it. (For example, the Java Streams API uses ArrayList by default.) +5. Eliminate optional parameters. + +### Eliminating Optional Parameters + +Optional parameters can often be removed, and very few are truly optional. They often arise when defaults or implicit values are used, when different operations are conflated into a single construct, or when higher-level configuration "leaks" into the operation. + +* **Set defaults explicitly**. Implicit behaviour should be avoided, since it leads to confusion and errors. Consider: any required parameter can be made optional by giving it a convenient default. Defaults should be used rarely, such as when they save substantial boilerplate in very common operations. +* **Beware of booleans**. Booleans are sometimes used to represent alternate behaviours of a method. It may be better to split the method into two methods, especially when the alternate behaviour is rare. For example, `substr(useBytes, ...)` can be split into `substr` and `substrBytes`. The same applies to enums. Be especially wary when different values imply different sets of required parameters. +* **Omit flags**. If an optional is intended to enable functionality, consider removing the choice and always defaulting to some value. If it is truly unavoidable, consider placing this as a configuration flag on some higher level entity, or even at the level of application configuration. +* **Split out convenience methods and composite operations**. Optional parameters sometimes represent a secondary operation. Attempt to represent these instead using the same programming construct as an ensuing or preceding operation, or if there is an existing operation that performs the same task, just use it instead. An example might be a pipeline stage that has a "thenFilter" option, which should be removed, if an ensuing "filter" stage can be used instead. +* **Extract configuration**. Some entities need to be configured in some way. Entity configuration often requires sane defaults, flags, and so on. Certain operation-level options can be moved into entity configuration. If an optional does not naturally move into a higher-level entity, consider whether it belongs in a different entity. Remove optional parameters by having operations accept other entities as parameters. + +The above suggestions might not eliminate all optional parameters, or there might be other obstacles to removal. + +### Representing Optional Parameters + +Directive parameters might be optional, sometimes based on the values of other parameters. There are various ways to deal with optional inputs, when they must be included. + +Below, the items prefixed 0 are basic, generally built into the language. The items prefixed 1 offer some form of builder for all parameters. The items prefixed 2 place required fields at the method level, and use some form of builder for optional parameters. + +**0.0 Flat methods and overloads**: overloads for all variations. While viable in simple cases, this is generally unsustainable as it results in an overload explosion when new options are added. +```java +densify("field", 10) +densify("field", 0, 3, 1, "p") +``` + +**0.1 Nested values (domain values)**: some coherent grouping moves to a parameter, which can be a polymorphic domain value. It is important to ensure that the domain value is itself a coherent type of thing, and independent of the particular operation. They should not be used as mere labels. Prefer primitives unless the domain value is already used extensively outside the API (see Domain values above). +```java +densify("field", step(10)) +densify("field", rangeStep(0,3,1), "p") +move(xyDelta(1,20)) +``` + +**1.0 Chained Options, initiated on required parameters**: + +All required parameters are included in the initiation: +```java +densify( densifySpec("field", ...) ); -- no optionals +densify( densifySpec("field", ...).partitionByFields("q") ); +``` + +**1.1 Chained, initiated on first required**: + +First parameter starts the chain, required may be chained: +```java +densify(densifyField("field").step(10)); +densify(densifyField("field").rangeStep(0,3,1).partitionByFields("q")); +``` + +Note that nested values can be combined with this approach: +```java +densify(densifyField("field").range(step(10))); +densify(densifyField("field").range(rangeStep(0,3,1)).partitionByFields("q")); +``` + +**1.2 Fluent Operation**: the operation starts the chain +```java +densify().field("field").step(10); +densify().field("field").rangeStep(0,3,1).partitionByFields("q"); +``` +In the above example, the operation either cannot be chained, or must have a mandatory terminal method. + +**1.3 Builders**: All parameters are chained off of a provider method: +```java +densify(densifySpec().field("field").range(step(10))); +densify(densifySpec().field("field").range(rangeStep(0,3,1)).partitionByFields("q")); +``` +The builder may use an explicit build method. + +**2.0 Fluent Optionals**: the chain begins at optional parameters: +```java +densify("field", step(10)) +densify("field", rangeStep(0,3,1), Densify.options().partitionByFields("q") ) +densify("field", rangeStep(0,3,1), opts -> opts.partitionByFields("q") ) +densify("field", rangeStep(0,3,1), Densify.OPTIONS.partitionByFields("q") ) +densify("field", rangeStep(0,3,1), ... .partitionByFields("q").build() ) +``` +There are multiple variations. There could be overloads on just the options resulting in a more-localized "explosion". + +**2.1 Varargs**: use varargs to handle the optional (variable) arguments +```java +densify("field", step(10)) +densify("field", rangeStep(0,3,1), DensifyOption.partitionByFields("q")) +Signature: +densify(..., DensifyOption... options) +``` + +## Guidance for represeting Optional parameters + +1. Use 1.3 Builders for entity configuration, where optionals are unavoidable +2. If optionals cannot be eliminated at the operation level, use 2.1 Varargs. +3. Options and builders should never be shared across operations. Either the operations are not distinct, or they are, and their options will diverge. +3. Lambdas are a familiar and powerful pattern. A lambda will provide the type for auto-complete. In some cases, a lambda is required for type safety. However, lambdas tend to increase complexity. +4. Place subjects on the left side, whenever possible, with directive parameters to the right, and required directive parameters on the left, and optional directives on the right. + +# Practical examples + +We consider and evaluate concrete examples. + +## Java SE: Streams + +Java's Stream API is a sound and canonical design example. It uses initiating and terminal operations. However, consider the following: + +```java +var out = items.stream() + .filter(this::keep) + .parallel() // can apply to whole stream, earlier and later + .map(this::stage2) + .toList(); +``` + +Concerns: +1. Configuration is mixed into the stream of operations. A user might expect that only ensuing operations are parallel. + +As a positive, it nicely uses a `stream` and `parallelStream` for initiation, rather than a boolean configuration argument `parallel` (eliminates boolean options). + +## Driver: find + +```java +FindIterable results = collection + .find(Filters.eq("status", "active")) + .projection(Projections.include("name")) + .sort(Sorts.descending("age")) + .limit(10) + .collation(Collation.builder().locale("en").build()) // configuration — how strings are compared + .maxTime(5, SECONDS); // configuration — execution time limit +``` + +Concerns: +1. Configuration is mixed into the pipeline +2. The initiating "find" takes in the filter parameter, rather than configuration. Consider why Java SE does not accept a filter in its initiating method `stream(a -> a.isEven())`. +3. The methods limit and sort ignore order. No distinction is made between configuration and the declaration of operations (and their parameterizations). + +This could be improved by introducing a find that returns an iterable, placing configuration into an immutable findSettings accepted by this creation method (or, elsewhere), introducing a filter stage, and using types or exceptions to block the ordering issue. This would make the find method look much more like an aggregation pipeline. + +Worth doing? No. + +## Java SE: HttpClient + +```java +HttpClient client = HttpClient.newBuilder() + .version(HTTP_2) + .connectTimeout(Duration.ofSeconds(5)) + .executor(exec) + .followRedirects(Redirect.NORMAL) + .build(); + +var req1 = HttpRequest.newBuilder() + .uri(URI.create("https://api.example.com/users")) + .header("Authorization", "Bearer " + token) + .header("Accept", "application/json") + .GET() // directive + .build(); + +var req2 = HttpRequest.newBuilder() + .uri(URI.create("https://api.example.com/users")) + .header("Authorization", "Bearer " + token) + .POST(BodyPublishers.ofString(json)) // directive + .build(); + +client.sendAsync(req1, BodyHandlers.ofString()); +client.sendAsync(req2, BodyHandlers.ofString()); +``` +The client is created via a builder, and so are the requests. The client's `sendAsync` method uses the client, and the request entity. The request entity mixes configuration with directives, but is an example of moving options that could have existed in the `sendAsync` method into an entity. The `BodyHandler` was not moved. + +Concerns: +1. The builder has optional boilerplate, with varying "directives". +2. This API closely tracks the REST API, without adjustment. (It may be a design goal to exactly track the original API.) + +Note the similarity to `MongoClient`, though here, the "Collection" (target resource) is specified as a URI rather than as a separate entity. Perhaps the host and authentication could be specified at the client level. Specific requests could be initiated from resources. + + +## Driver: Mql Expressions API + +Compare with Stream: +```java +numList.stream() + .filter(n -> n.mod(valueOf(2)).equals(ZERO)) + .map(n -> n.multiply(TEN)) + .reduce(ZERO, (a, b) -> b.add(a)) +``` + +Equivalent code in the MQL Expressions API: + +```java +fieldArrInt("numList") + .filter(n -> n.mod(2).eq(0)) + .map(n -> n.multiply(10)) + .reduce(of(0), (a, b) -> b.add(a)) +``` + +The MQL expressions API design uses chaining, including chaining of subjects. Reduce operations are carried out on typed lists, for the same reason, and each one is a named method (sum, concat, and so on). Only initial values use static methods, and there, they are limited to `of`. It uses lambdas for certain stages, like the Stream API. + +Concerns: +1. We can include overrides for methods that can take `of` in certain locations where this has not already been done - though not all. +2. The API is not properly integrated into the driver's aggregate API. + + + diff --git a/design/api_eval.md b/design/api_eval.md new file mode 100644 index 0000000000..57ba261fb3 --- /dev/null +++ b/design/api_eval.md @@ -0,0 +1,472 @@ +This report is based on concepts discussed in api_design.md + +# Research: Java Driver + +This section organizes the public API by the distinction drawn above: **creation/configuration** (building entities) vs. **parameterization** (specifying operations), with code snippets illustrating distinctions. + +## A. Creation / Configuration + +Everything used to build a long-lived entity: clients, connections, credentials, buckets. + +### A1. The Builder Pattern + +The standard pattern for entity configuration. A `Builder` inner class accumulates settings; `.build()` produces an immutable result. + +```java +// The full pattern in one example: builder, nested builders, +// domain value inputs, factory class consuming the result, listener registration. +MongoClient client = MongoClients.create( // factory class consumes the built settings + MongoClientSettings.builder() // .builder() starts the builder + .applyConnectionString(new ConnectionString("mongodb://localhost")) + .readPreference(ReadPreference.secondary()) // domain value as builder input + .writeConcern(WriteConcern.MAJORITY) // domain value (static constant) + .credential(MongoCredential.createCredential("user", "admin", pw)) // domain value (static factory) + .applyToConnectionPoolSettings(b -> b // nested builder via callback + .maxSize(50) + .maxWaitTime(10, SECONDS)) + .applyToSocketSettings(b -> b // another nested builder + .connectTimeout(5, SECONDS)) + .addCommandListener(commandListener) // listener registration — just another builder method + .applyToConnectionPoolSettings(b -> b // listener on nested builder + .addConnectionPoolListener(poolListener)) + .build()); // .build() produces immutable MongoClientSettings +``` + +**Who uses this:** MongoClientSettings, ClusterSettings, ConnectionPoolSettings, SocketSettings, SslSettings, ServerSettings, AutoEncryptionSettings, ClientSessionOptions, TransactionOptions, Collation, ServerApi. + +Structure: +- `X.builder()` → `X.Builder`; `X.builder(existingX)` → copy-constructor builder. +- Builder methods: bare names returning `this` (`.maxSize()`, `.readPreference()`). +- `.build()` → immutable result (`@Immutable`), validated in private constructor. +- Time-related parameters use `method(long value, TimeUnit unit)` signature (`.maxWaitTime(10, SECONDS)`). + +Sub-patterns that appear within or alongside this builder pattern: + +**Domain value objects as builder inputs.** ReadPreference, WriteConcern, ReadConcern, MongoCredential, and ConnectionString are immutable values created independently, then passed to builders. Their construction idioms vary: + +```java +ReadPreference.secondary() // static factory +ReadPreference.secondary(tagSet, 5000, MILLISECONDS) // overloaded static factory (~28 overloads total) +WriteConcern.MAJORITY // static constant +new WriteConcern(2) // constructor +MongoCredential.createCredential("user", "admin", pw) // static factory (createXxx naming) +new ConnectionString("mongodb://localhost") // constructor (parses elements from connection URI) +``` + +Some domain values support `with*` for copy-on-write modification before passing to a builder: +```java +ReadPreference.secondary().withMaxStalenessMS(5000, MILLISECONDS) +WriteConcern.MAJORITY.withWTimeout(5, SECONDS).withJournal(true) +MongoCredential.createCredential(...).withMechanismProperty(KEY, value) +``` + +**Factory classes** (`MongoClients.create()`, `GridFSBuckets.create()`, `ClientEncryptions.create()`) consume the built settings to produce the entity. This is just the final step of the builder pattern. + +**Listener registration** (`addCommandListener()`, `addConnectionPoolListener()`) is just another builder method. Listener interfaces use `default` (empty) methods so users override only the callbacks they need. Event objects are immutable constructor-based data holders. + +Note: domain values and `with*` are also used outside the builder context — see A2 below. + +Concerns: +1. There are too many variations in the construction of builder inputs. WriteConcern and ReadConcern have public constructors; ReadPreference and MongoCredential use only static factories; ReadConcern is mostly used via constants but also exposes `ReadConcern(ReadConcernLevel)`. +2. ReadPreference has ~28 static factory overloads for (mode × tagSet × maxStaleness). +3. The settings could have been immutable, with no builder required, and wither-style setters. +4. Four immutability strategies coexist across the driver: (a) sealed immutable fluent (Search API, MQL); (b) mutable `return this` (iterables, Options); (c) builder → freeze (Settings); (d) `with*` copy-on-write (domain objects, client hierarchy). + +### A2. Copy-on-Write on the Client Hierarchy + +MongoCluster, MongoDatabase, MongoCollection, and GridFSBucket use `with*` methods to create new instances with adjusted configuration. This is not the builder pattern — these are runtime configuration adjustments on already-created entities: + +```java +MongoCollection coll = db.getCollection("orders") + .withReadPreference(ReadPreference.secondary()) // returns new MongoCollection with this ReadPreference + .withWriteConcern(WriteConcern.MAJORITY) // returns another new MongoCollection + .withReadConcern(ReadConcern.MAJORITY) + .withTimeout(10, SECONDS); +``` + +This reuses the same domain value objects from A1 (ReadPreference, WriteConcern, etc.), but in a different context. Consistent `with*` naming across all four types. GridFSBucket adds `withChunkSizeBytes()`. + +Concerns: + +1. This could have been done using existing concepts: `MongoClient.create(oldClient, settingsOverrides)`, which would have been more explicit, or possibly by moving certain settings into operation-level entities: `collection.find(settingsOverrides)`. + +## B. Parameterization + +Everything used to specify *how an operation will be carried out*. Two major operational flow patterns exist (see Flow above): **chaining** and **list-like** (aggregation pipelines). + +### B1. Deferred Operations (fluent iterables / publishers) + +For cursor-returning operations (find, distinct, aggregate, watch, listX), `.find()` or similar returns an iterable/publisher object. Chaining configures it. Nothing executes until a terminal method. + +The fluent chain mixes two kinds of method: +- **Directive** (parameterization) — define WHAT the operation returns: `filter`, `projection`, `sort`, `limit`, `skip`, `returnKey`, `showRecordId`, `min`, `max`. +- **Configuration** — define HOW the operation executes: `maxTime`, `maxAwaitTime`, `batchSize`, `cursorType`, `noCursorTimeout`, `partial`, `collation`, `allowDiskUse`, `timeoutMode`, `comment`. + +Some configuration also exists at the entity level (A2): `withTimeout` on MongoCollection supersedes `maxTime` on FindIterable when set. This is because `withTimeout` uses the new Client Side Operation Timeout (CSOT) system, which replaces the deprecated `maxTimeMS` approach used by `maxTime()`. When CSOT is enabled, legacy `maxTimeMS` settings are ignored. + +Both kinds appear on the same chain with no syntactic distinction: + +```java +FindIterable results = collection // .find() initiates; returns FindIterable + .find(Filters.eq("status", "active")) // all chained methods below also return FindIterable + .projection(Projections.include("name")) // operational — what fields to return + .sort(Sorts.descending("age")) // operational — what order + .limit(10) // operational — how many + .collation(Collation.builder() // configuration — how strings are compared + .locale("en").build()) + .maxTime(5, SECONDS); // configuration — execution time limit + +// Terminal: execution happens here. +results.forEach(doc -> process(doc)); // sync — iterates the cursor +Document first = results.first(); // sync — returns first match or null +List all = results.into(new ArrayList<>()); // sync — collects all into a list +``` + +```java +// Reactive equivalent: .find() returns a FindPublisher (implements Publisher). +FindPublisher results = collection + .find(Filters.eq("status", "active")) + .projection(Projections.include("name")) + .sort(Sorts.descending("age")) + .limit(10) + .maxTime(5, SECONDS); + +// Terminal: subscription triggers execution. +results.subscribe(subscriber); // reactive — pushes documents to subscriber +Mono first = Mono.from(results.first()); // reactive — wraps in Project Reactor Mono +``` + +Concerns: +1. Although the chaining syntax implies a sequence of transformations, the order of Java method calls on FindIterable does not matter — they all set fields on an underlying command. `.sort(x).limit(10)` is identical to `.limit(10).sort(x)`. The semantic execution order is fixed by the server (filter → sort → skip → limit), regardless of the order you call the Java methods. This contrasts with aggregate pipelines, where the list order of stages determines the execution order. + +**AggregateIterable** — operational (what) is the pipeline list; fluent methods are configuration (how): +```java +collection.aggregate(List.of( // operational — what stages + Aggregates.match(filter), + Aggregates.group("$category"), + Aggregates.sort(sort))) + .maxTime(5, SECONDS) // configuration — execution limit + .allowDiskUse(true); // configuration — resource usage +``` + +Concerns: +1. This uses a list-like API, when it would be possible to initiate (and configure) an Aggregate operation off of the `aggregate()` method, and specify ensuing chained operations, as with the Java SE Stream API. +2. Configuration is specified in the ensuing chain. This reverses the idiomatic pattern of using the chain for operations, and specifying configuration in some other way. + +**DistinctIterable** — operational is field + `.filter()`; rest is configuration: +```java +collection.distinct("status", String.class) // operational — what field, what type + .filter(Filters.gt("count", 10)) // operational — what to match + .maxTime(5, SECONDS) // configuration — execution limit + .collation(collation); // configuration — string comparison +``` + +**Immediate writes (B2)** — operational is Bson args; configuration is Options object: +```java +collection.updateOne( + Filters.eq("_id", id), // operational — what to match + Updates.set("status", "done"), // operational — what to change + new UpdateOptions() // configuration — how to execute + .upsert(true) + .maxTime(5, SECONDS)); +``` + +Borderline cases: `hint`/`hintString` (optimization directive), `let` (variable bindings for expressions), `comment` (logging metadata). The `explain()` overloads on FindIterable and AggregateIterable are alternative terminals, not configuration. + +Sync types: FindIterable, AggregateIterable, DistinctIterable, ChangeStreamIterable, ListCollectionsIterable, ListDatabasesIterable, ListIndexesIterable, ListSearchIndexesIterable, ListCollectionNamesIterable, GridFSFindIterable. +Reactive mirrors: FindPublisher, AggregatePublisher, DistinctPublisher, ChangeStreamPublisher, etc. (identical fluent methods, `Publisher` return types). + +Concerns: +1. [AI] Configuration methods (`collation`, `allowDiskUse`, `explain`, `timeoutMode`) are inconsistently available across the iterable family, though this is largely attributable to differences in server-side operations. + +### B2. Immediate Operations (static helpers + options) + +Write operations (insert, update, delete, findOneAndX) execute immediately and return a result. They take Bson parameters from static factories, and an optional options object: + +```java +// Immediate: executes now, returns a result. +UpdateResult result = collection.updateOne( + Filters.eq("_id", id), // static factory produces filter Bson + Updates.combine( // static factory combines two update Bsons into one + Updates.set("status", "archived"), + Updates.currentDate("updatedAt")), + new UpdateOptions() // options object for optional settings + .upsert(true) + .hint(indexKeys)); +``` + +Concerns: +1. Tracks the server API too closely, using static methods where chaining might have been possible. +2. The upsert boolean might be moved into a new method name: `upsertOne`. +3. [AI] The combine/merge operation has five different names across the static helper classes (see below). +4. [AI] Options handling differs across helpers: Aggregates uses Options classes; Filters uses overloads and `@Nullable` parameters; Updates uses PushOptions only for `pushEach`. + +**Static factory helper classes** (shared by both B1 and B2): Filters, Projections, Sorts, Updates, Indexes, Accumulators, Windows. Methods named after the operation: `Filters.eq()`, `Sorts.descending()`, `Updates.set()`. + +Each helper class has a "combine" method, but they all use different names for it: +```java +Filters.and(filter1, filter2) // "and" — semantic name (also: Filters.or) +Updates.combine(update1, update2) // "combine" +Sorts.orderBy(sort1, sort2) // "orderBy" +Projections.fields(proj1, proj2) // "fields" +Indexes.compoundIndex(idx1, idx2) // "compoundIndex" +// All five do the same thing: merge multiple items of the same type into one Bson. +``` + +**Options classes** (`new XOptions()`) — mutable self-returning setters, no `.build()`: +```java +new UpdateOptions().upsert(true).hint(indexKeys).collation(collation) +new FindOneAndUpdateOptions().returnDocument(ReturnDocument.AFTER).sort(sortBson) +new CreateCollectionOptions().capped(true).sizeInBytes(1048576) +new IndexOptions().unique(true).expireAfter(60L, SECONDS) +``` +Used by: UpdateOptions, ReplaceOptions, DeleteOptions, InsertOneOptions, InsertManyOptions, BulkWriteOptions, FindOneAndDeleteOptions, FindOneAndReplaceOptions, FindOneAndUpdateOptions, CountOptions, EstimatedDocumentCountOptions, CreateCollectionOptions, CreateIndexOptions, IndexOptions, DropIndexOptions, DropCollectionOptions, RenameCollectionOptions, GridFSUploadOptions, GridFSDownloadOptions. + +### B3. Aggregate Pipeline Operations (list-like) + +Aggregation uses a fundamentally different operational flow from find: instead of find's pseudo-chaining, it uses the **list-like** pattern (see Flow above) — a list of stages, each produced by the `Aggregates` static factory class. + +```java +// Aggregation: a list of stages, each built with Aggregates.xxx(...) +AggregateIterable aggResults = collection + .aggregate(List.of( + Aggregates.match( // stage 1: $match — uses Filters (B2 pattern) + Filters.eq("status", "active")), + Aggregates.group("$category", // stage 2: $group — field expressions + accumulators + Accumulators.sum("count", 1), + Accumulators.avg("avgPrice", "$price")), + Aggregates.sort(Sorts.descending("count")), // stage 3: $sort — uses Sorts (B2 pattern) + Aggregates.limit(10))); // stage 4: $limit — plain value + +// Terminal: execution happens here (same as B1 find-style). +aggResults.forEach(doc -> process(doc)); + +// Contrast — find-style chaining for a simpler query: +FindIterable findResults = collection + .find(Filters.eq("status", "active")) // chained, not a list of stages + .sort(Sorts.descending("count")) + .limit(10); + +// Terminal: +findResults.forEach(doc -> process(doc)); +``` + +The `Aggregates` class is the largest static factory in the API. Its methods fall into several sub-areas, each handling parameters differently: + +**Simple stages** — thin wrappers around Bson values from other helpers: +```java +Aggregates.match(filter) // delegates to Filters +Aggregates.sort(sort) // delegates to Sorts +Aggregates.project(projection) // delegates to Projections +Aggregates.limit(n) // plain int +Aggregates.skip(n) // plain int +Aggregates.count("fieldName") // plain string +Aggregates.sample(n) // plain int +Aggregates.unwind("$arrayField") // plain string +Aggregates.out("outputCollection")// plain string +Aggregates.replaceRoot(expression)// Bson expression +``` + +**Stages with structured parameters** — flat positional args, no options: +```java +Aggregates.group( // groupBy + accumulators + "$category", // the group key + Accumulators.sum("total", "$amount"), // accumulators from Accumulators helper + Accumulators.avg("avg", "$price")) + +Aggregates.lookup( // simple lookup: all positional + "inventory", // from collection + "item", // local field + "sku", // foreign field + "inventoryDocs") // as field + +Aggregates.lookup( // pipeline lookup: different set of positional params + "inventory", // from collection + let, // let variables + pipeline, // sub-pipeline + "inventoryDocs") // as field +``` + +**Stages with options objects** — mandatory positional args + optional Options: +```java +Aggregates.bucket(groupBy, boundaries) // without options +Aggregates.bucket(groupBy, boundaries, new BucketOptions() // with options (mutable setter) + .defaultBucket("other") + .output(Accumulators.sum("count", 1))) + +Aggregates.densify("timestamp", // mandatory params + DensifyRange.partitionRangeWithStep(1, MongoTimeUnit.HOUR),// DensifyRange uses static factories + DensifyOptions.densifyOptions().partitionByFields("region"))// options (named factory, immutable) + +Aggregates.geoNear(point, "distance", // mandatory params + GeoNearOptions.geoNearOptions().key("location")) // GeoNearOptions (named factory, immutable) + +Aggregates.merge("outputCollection", new MergeOptions() // mandatory + options + .uniqueIdentifier("_id") + .whenMatched(MergeOptions.WhenMatched.MERGE)) +``` + +**Stages that delegate to sealed APIs** — see B4 (Search) and B5 (MQL): +```java +Aggregates.search(SearchOperator.text(...)) // delegates to Search API (B4) +Aggregates.search(SearchOperator.text(...), searchOptions) // with SearchOptions +Aggregates.vectorSearch(path, vector, index, candidates, limit)// many positionals + VectorSearchOptions + +Aggregates.addFields(new Field<>("computed", // delegates to MQL Expressions (B5) + current().getString("name").toLower())) +``` + +Concerns: +1. [AI] Different stages use different strategies for optional parameters — some use flat positionals only (lookup), some use mutable Options classes (bucket, merge), some use immutable interface-based options with named factories (densify, geoNear), and some delegate to sealed APIs (search). Meanwhile, Filters (used in match stages) never uses options objects. Within a single pipeline, parameter representation varies by stage. + +### B4. Search API (Sealed Interface + Chained) + +The Search API (`@Beta`, 4.7+) uses a distinct parameterization style within aggregation: sealed interfaces, typed field references, and immutable fluent modification (each call returns a new instance). Contrast with how similar operations are done in other stages: + +```java +// A $match stage using the Filters API (B2 pattern): +// — static factory, String field name, positional params, returns Bson +Aggregates.match( + Filters.and( + Filters.text("mongodb"), // text search via Filters: just a string + Filters.eq("status", "published"))) // equality: fieldName + value + +// A $search stage using the Search API (sealed+immutable pattern): +// — static factory on sealed interface, SearchPath field ref, immutable fluent modification +Aggregates.search( + SearchOperator.compound() // static factory returns sealed interface + .must(List.of( + SearchOperator.text( // text search via SearchOperator + fieldPath("title"), // SearchPath instead of String fieldName + "mongodb") + .fuzzy() // immutable fluent (returns new instance) + .score(SearchScore.boost(1.5f)))) // score modification (also immutable fluent) + .filter(List.of( + SearchOperator.equals( // equality via SearchOperator + fieldPath("status"), // SearchPath + "published"))), + SearchOptions.searchOptions() // options via named factory (not constructor) + .index("default")) // immutable fluent chaining +``` + +Key differences from the B2 (Filters) pattern: +- **Field references**: `SearchPath.fieldPath("title")` vs. plain `String "title"`. +- **Immutability**: `.fuzzy()` returns a new instance; Filters methods return a final Bson. +- **Options**: `SearchOptions.searchOptions()` (named factory, immutable fluent) vs. `new XOptions()` (constructor, mutable). +- **Sealed interfaces**: cannot be implemented externally; `SearchOperator.of(Bson)` escape hatch for unsupported operators. + +### B5. MQL Expressions API (Type-Safe Lazy Expressions) + +The MQL Expressions API (`@Beta`, 4.9+) is an alternative to raw Bson for representing aggregation expressions. Contrast: + +```java +// Without MQL — raw BSON document for a $addFields expression: +Aggregates.addFields(new Field<>("discountedTotal", + new Document("$multiply", List.of( // manual $multiply operator + new Document("$sum", "$items.price"), // manual $sum + 0.9)))) + +// With MQL — type-safe fluent expression for the same computation: +Aggregates.addFields(new Field<>("discountedTotal", + current() // MqlDocument: the current pipeline document + .getArray("items") // → MqlArray + .sum(item -> item.getInteger("price")) // → MqlInteger (type guides available methods) + .multiply(of(0.9)))) // → MqlNumber +``` + +- **Sealed interface hierarchy**: `MqlValue` → `MqlBoolean`, `MqlString`, `MqlNumber` (→ `MqlInteger`), `MqlDate`, `MqlDocument`, `MqlArray`, `MqlMap`, `MqlEntry`. +- **Created via `MqlValues`**: `of(42)` → `MqlInteger`, `of("text")` → `MqlString`, `current()` → `MqlDocument`. +- **Immutable fluent chaining**: each method returns a new instance. Return types guide available methods (e.g., `MqlString.length()` → `MqlInteger`, not `MqlNumber`). +- **Lazy**: expressions build an AST, evaluated only at BSON serialization time. +- This is the most Java-idiomatic API in the driver (closest to Streams). It can be used inside any aggregation stage that accepts an expression, but it is a fundamentally different parameterization style from the Bson-based static factories. + +### B6. Operation Models (Bulk Write specifications) + +Two ideas coexist for specifying individual write operations within a bulk: + +```java +// Collection-level bulkWrite (older) — constructor-based WriteModel: +collection.bulkWrite(List.of( + new InsertOneModel<>(new Document("x", 1)), // constructor + new UpdateOneModel<>(eq("_id", id), set("x", 2)), // constructor with Filters/Updates + new DeleteOneModel<>(eq("_id", id)))); // constructor + +// Client-level bulkWrite (newer) — static factories on sealed interface: +client.bulkWrite(List.of( + ClientNamespacedWriteModel.insertOne( // static factory + ns1, new Document("x", 1)), + ClientNamespacedWriteModel.updateOne( // static factory + ns2, eq("_id", id), set("x", 2), + ClientUpdateOneOptions.clientUpdateOneOptions() // named factory for options (not constructor) + .upsert(true)), + ClientNamespacedWriteModel.deleteOne(ns1, eq("_id", id)))); // static factory +``` + +The evolution from constructors to static factories on sealed interfaces is visible here. Per-operation options in the newer API use named factories (`ClientUpdateOneOptions.clientUpdateOneOptions()`), unlike the older options which use constructors (`new UpdateOptions()`). + +## C. Operation Flow and Execution + +### C1. Overload Matrix on Core Interfaces + +MongoCollection, MongoDatabase, MongoCluster, and GridFSBucket expose operations as method overloads along systematic axes: + +```java +// The four overloads of insertOne (×session ×options): +collection.insertOne(document); +collection.insertOne(document, new InsertOneOptions()); +collection.insertOne(session, document); +collection.insertOne(session, document, new InsertOneOptions()); +``` + +Axes: **ClientSession** (always first if present) × **Options** (always last if present). Some operations add **result class**, **update type** (Bson vs pipeline), or **key type** (String vs Bson). This yields ~120+ overloads on MongoCollection. + +Concerns: +1. [AI] Search index methods (createSearchIndex, updateSearchIndex, dropSearchIndex) lack ClientSession overloads. +2. [AI] `estimatedDocumentCount` has no ClientSession overloads (server constraint). +3. [AI] `distinct` requires `resultClass` — no convenience overload using the collection's document type (unlike `find()`). +4. [AI] `listCollectionNames` returns a dedicated `ListCollectionNamesIterable`; `listDatabaseNames` returns generic `MongoIterable`. +5. [AI] GridFS uses type-based overloading for downloads (`openDownloadStream(String)` vs `openDownloadStream(ObjectId)` vs `openDownloadStream(BsonValue)`), consistent with the rest of the API. However, it has redundant `ObjectId`/`BsonValue` overloads across all id-accepting methods (download, delete, rename). + +### C2. Reactive Streams / Async + +The reactive module (`com.mongodb.reactivestreams.client`) is a near-exact structural mirror of the sync API: + +```java +// Sync +InsertOneResult result = collection.insertOne(doc); +for (Document d : collection.find(filter)) { process(d); } + +// Reactive — same structure, Publisher wrapping +Publisher pub = collection.insertOne(doc); +collection.find(filter).subscribe(subscriber); +``` + +- Same interface names (MongoClient, MongoDatabase, MongoCollection), different package. +- Same overloads, same options objects, same model classes. +- Specialized publishers (FindPublisher, AggregatePublisher, etc.) mirror the fluent iterables exactly. +- Internally backed by callback-based `SingleResultCallback` → bridged via Project Reactor `Mono`/`Flux` → exposed as `Publisher`. The callback layer is not public API. +- GridFS also has a reactive variant (GridFSBucket, GridFSUploadPublisher, GridFSDownloadPublisher). + +### C3. Results + +**Older result classes** — abstract class + static factory: +```java +InsertOneResult.acknowledged(insertedId) // or .unacknowledged() +UpdateResult.acknowledged(matched, modified, upsertedId) +DeleteResult.acknowledged(deletedCount) +``` + +**Newer result classes** (client bulk write) — sealed interfaces: +```java +ClientBulkWriteResult result = client.bulkWrite(writes); +result.getInsertedCount(); +result.getInsertResults(); // Map +``` + +Both are immutable. The evolution from abstract classes to sealed interfaces parallels WriteModel → ClientNamespacedWriteModel. + + +### AI Research Notes + +Overloads were counted from the sync MongoCollection interface. The reactive-streams MongoCollection mirrors this exactly. Deprecated methods (mapReduce) are excluded. All public API files in driver-sync, driver-reactive-streams, and driver-core were examined, including: MongoCluster, MongoDatabase, MongoCollection; all iterables and publishers; Aggregates, Filters, Projections, Sorts, Updates, Accumulators, Indexes, Windows; the search and bulk packages; MongoClientSettings and all connection settings; all Options classes; ReadPreference, WriteConcern, ReadConcern, MongoCredential; GridFSBucket; WriteModel hierarchy; result classes; event/listener interfaces. + diff --git a/design/api_redesign_densify.md b/design/api_redesign_densify.md new file mode 100644 index 0000000000..aa5c4fe001 --- /dev/null +++ b/design/api_redesign_densify.md @@ -0,0 +1,196 @@ + +This report is based on concepts discussed in api_design.md. The Part 1 is AI generated, Part 2 is human-authored. + +# Part 1 + +## Densify: Redesigning an Existing Stage + +The `$densify` aggregation stage fills gaps in a sequence of documents. Its MQL parameters are: + +- **field** (required): the field to densify. +- **range.step** + optional **range.unit** (required): the interval between inserted values. Step is always required. Unit is required for dates, absent for numbers. +- **range.bounds** (required): determines the extent of densification. Three variants: `"full"` (global min to max), `"partition"` (per-partition min to max), or `[lower, upper]` (explicit). +- **partitionByFields** (optional): field names to group by before densifying. + +The current Java API: + +```java +// Minimal: +Aggregates.densify("hour", + DensifyRange.fullRangeWithStep(1)) + +// With partition: +Aggregates.densify("hour", + DensifyRange.partitionRangeWithStep(1), + DensifyOptions.densifyOptions().partitionByFields("city")) + +// With explicit bounds and partition: +Aggregates.densify("hour", + DensifyRange.rangeWithStep(0, 10, 1), + DensifyOptions.densifyOptions().partitionByFields("city", "sensor")) + +// Date variant: +Aggregates.densify("timestamp", + DensifyRange.fullRangeWithStep(1, MongoTimeUnit.HOUR)) +``` + +### Analysis + +**Step 1: Identify the primary parameter.** + +The field is required but is a simple string, so it remains a positional parameter on the operation. Step (+ unit) is always required, and is a coherent item: a number, or a number plus a time unit. It cannot be defaulted (the doc prohibits defaults as a source of optionals). It is the most fundamental parameter — without a step, densify has no meaning. + +**Step 2: Can we use overloads (pattern 0.0)?** + +We have: field × step (number or date) × bounds (3 variants) × partition (present or not). This is at minimum 2 × 3 × 2 = 12 overloads. The doc prohibits this explosion, so we must introduce at least one entity. + +**Step 3: Identify which parameters form entities.** + +Per the Domain values guidance, prefer primitives unless the domain value is already used extensively outside the API. Step is a number (or number + time unit) — these are primitives or existing Java types, so step does not need a custom domain value and can use overloads. Bounds has three structurally distinct variants, which cannot be represented as primitives and require some form of entity. Partition fields are already just a list of strings. + +Step is primary and always required — it anchors whatever entity we create. The remaining parameters (bounds, partition fields) have variants, so they are not suitable as positional parameters at the operation level. They are secondary to step. + +Bounds has three variants: full, partition-scoped, and explicit. The "full" variant is the simplest default case, and "partition" is reductive to "full" when no partition fields are supplied (the whole collection is one partition). Explicit bounds `[low, high]` is structurally different. All three are always required (a bounds strategy must exist, even if it defaults to "full"). The three variants are distinct enough to warrant distinct construction methods rather than an enum + nullable values. + +Partition fields are the only true optional. They are a list of field names, present or absent. When absent, "full" and "partition" bounds behave identically. + +**Step 4: Consider entity coherence.** + +An entity should be coherent independently of the operation it parameterizes, and per Domain values, should be preferred only when already used outside the API. Step (a number, or number + time unit) is independently coherent — it represents an interval — but is not used outside this API, so primitives or overloads are preferred over a custom type. However, once we add bounds, the "full" and especially "partition" variants couple the entity to the concept of partitions within a dataset. This means: + +- Step alone is coherent. +- Step + bounds begins to encode dataset-level semantics (full vs. per-partition). +- Partition fields are a dataset-level grouping concern. + +If bounds and partition are both dataset-level concerns, they arguably belong together (or at least, bounds should not be separated from partition if the "partition" variant exists). This means either: (a) put all three together in one entity, or (b) keep step + bounds together and partition separate, but accept that the "partition" bounds variant references a concept managed elsewhere. + +**Step 5: Survey other stages for shared concepts.** + +Partition, bounds, and step appear across multiple aggregation stages: + +| Stage | Partition | Bounds/Range | Step | +|---|---|---|---| +| `$densify` | `partitionByFields` (optional, list of strings) | `bounds`: full / partition / [low, high] | `step` + optional `unit` | +| `$setWindowFields` | `partitionBy` (optional, expression) | `Window`: documents(low, high) or range(low, high) or timeRange(low, high, unit) | implicit in bounds | +| `$fill` | `partitionBy` or `partitionByFields` (optional) | n/a | n/a | +| `$bucket` | n/a (groupBy is required) | `boundaries` (explicit list) | n/a | + +Partition is a recurring concept, but each stage represents it differently (`partitionByFields` as string list vs. `partitionBy` as expression). Bounds/range is also recurring, but with different semantics (densify: how far to fill; window: how wide the window is). Step is unique to densify. + +Per the doc's guidance ("options and builders should never be shared across operations"), these should NOT be unified into shared Partition or Bounds types, even though they appear similar. Each stage's parameterization is its own entity. + +**Step 6: Determine entity groupings.** + +Three options emerge for how to group these into entities: + +**Option A: One entity (all together)** + +Step, bounds, and partition fields are all part of one range/spec entity: + +```java +densify("hour", step(1).fullRange()) +densify("hour", step(1).partitionRange().partitionBy("city")) +densify("hour", step(1).range(0, 10)) +densify("hour", step(1, HOUR).fullRange()) +densify("hour", step(1).range(0, 10).partitionBy("city")) +``` + +Advantages: single entity, no optionals at the operation level. Discoverable — start from step, autocomplete shows bounds options, then partition. Keeps partition and bounds in the same entity, which is consistent with their semantic coupling (Step 4). +Concerns: the entity mixes interval specification with dataset-level grouping. The chain ordering implies a sequence but there is none. `partitionRange()` without `partitionBy(...)` is expressible but merely equivalent to `fullRange()` — not invalid, but potentially confusing. + +**Option B: Two entities (step + bounds, then partition)** + +Step and bounds form one entity. Partition fields are a separate varargs optional (pattern 2.1) at the operation level: + +```java +densify("hour", step(1).fullRange()) +densify("hour", step(1).fullRange(), DensifyOption.partitionBy("city")) +densify("hour", step(1).range(0, 10)) +densify("hour", step(1, HOUR).fullRange()) +densify("hour", step(1).range(0, 10), DensifyOption.partitionBy("city")) +``` + +Advantages: step and bounds are tightly related (bounds determine where steps are placed). Partition is a genuinely separate concern — whether to group — and the varargs pattern keeps it extensible. The "partition" bounds variant's reference to partition fields is addressed by the option being visible at the same call site. +Concerns: introduces a `DensifyOption` type for a single current optional. The "partition" bounds variant semantically depends on the partition option, creating a cross-entity dependency. + +**Option C: Three entities (all separate)** + +Step, bounds, and partition are each separate parameters: + +```java +densify("hour", step(1), fullRange()) +densify("hour", step(1), partitionRange(), DensifyOption.partitionBy("city")) +densify("hour", step(1), range(0, 10)) +densify("hour", step(1, HOUR), fullRange()) +densify("hour", step(1), range(0, 10), DensifyOption.partitionBy("city")) +``` + +Advantages: each parameter is independent and single-purpose. Easy to understand in isolation. Step is independently coherent (Step 4). +Concerns: more positional parameters. The relationship between step and bounds is implicit rather than structural. The operation signature is wider: `densify(String, Step, Bounds, DensifyOption...)`. Bounds defaulting to "full" when absent would reintroduce a default. + +### Decision + +**Option B** is recommended: step + bounds form one entity; partition fields use varargs at the operation level. + +```java +densify("hour", step(1).fullRange()) +densify("hour", step(1).partitionRange(), DensifyOption.partitionBy("city")) +densify("hour", step(1).range(0, 10), DensifyOption.partitionBy("city", "sensor")) +densify("hour", step(1, HOUR).fullRange()) +``` + +Rationale: + +1. **Cohesion**. Step and bounds are meaningless without each other: step is the interval, bounds determine where it applies. They belong in the same entity. Partition fields are genuinely optional and separable — they govern grouping, not the fill strategy itself. +2. **Follows stated guidance**. The document recommends "use 2.1 Varargs" for operation-level optionals. Partition fields are the only optional, and varargs handles them cleanly. +3. **Cross-entity dependency is manageable**. The "partition range" bounds variant references the concept of partitions, but it is coherent on its own: it means "fill from per-group min to per-group max", which is meaningful even before you specify which fields define the groups. If no partition fields are supplied, partition range degrades to full range — unusual, but not invalid. +4. **Extensibility**. If new options emerge (e.g., a fill strategy, a null-handling flag), they absorb naturally as additional `DensifyOption` variants. Adding methods to the step+bounds entity would be appropriate only for options tightly related to filling, which is unlikely. +5. **Consistency with the current API**. The existing driver already separates partition into `DensifyOptions`, suggesting the original designers had the same instinct. Option B refines this by making the step+bounds entity more fluent and discoverable (chain from step, choose bounds) while keeping the operation-level separation. + +Option A was considered but rejected: folding partition into the range entity is more concise, but it mixes the "how to fill" concern (step + bounds) with the "how to group" concern (partition), violating the principle that entities should represent a single coherent concept. It also moves an optional into the entity where it becomes an ignorable method, rather than using the varargs pattern the document recommends. + +Option C was rejected because it separates step from bounds, despite their tight interdependence. This widens the operation signature and makes the relationship between parameters implicit rather than structural. + +# Part 2 (Human eval) + +None of the above seem adequate. We have: field, step, bounds, partitionByFields. + +1. Step is coupled to the idea of "densify", and distinct from the concept of a range, which is common to other operations. These should be separate. +2. The time unit applies to all provided numbers. +3. The presence of the time unit is akin to a flag, suggesting an overload. +4. Representing the time unit is a matter of API consistency. We have MongoTimeUnit (vs ChronoUnit), which further precludes `Duration`. +5. Partitioning is closely coupled to the range (that is, the shape of the area that we are densifying). +6. The specifics of the range are not shared with other operations, and might change independently of them. + +Before and after: + +```java +// Minimal: +Aggregates.densify("test", DensifyRange.fullRangeWithStep(1)) +// after: +Aggregates.densify("test", 1, DensifyRange.fullRange()) + +// With partition: +Aggregates.densify("test", + DensifyRange.partitionRangeWithStep(1), + DensifyOptions.densifyOptions().partitionByFields("city")) +// after: +Aggregates.densify("test", 1, + DensifyRange.partitionRange(1).partitionByFields("city")) + +// With explicit bounds and partition: +Aggregates.densify("test", + DensifyRange.rangeWithStep(0, 10, 1), + DensifyOptions.densifyOptions().partitionByFields("city", "sensor")) +// after: +Aggregates.densify("test", 1, + DensifyRange.range(0, 10).partitionByFields("city", "sensor")) + +// Date variant: +Aggregates.densify("timestamp", + DensifyRange.fullRangeWithStep(1, MongoTimeUnit.HOUR)) +// after: +Aggregates.densify("timestamp", 1, MongoTimeUnit.HOUR, DensifyRange.fullRange()) +``` + +The original has the benefit of clearly labelling step, but it includes step in the concept of a "DensifyRange". Conversely, the fields, which seem related to the range, are part of an option, even though they are not strictly optional but rather a list that might be empty, and in any case, they can be moved to the DensifyRange. \ No newline at end of file