Skip to content

SOLR-17550: preferred Overseer might not work#4175

Open
r4mercur wants to merge 5 commits intoapache:mainfrom
r4mercur:SOLR-17550
Open

SOLR-17550: preferred Overseer might not work#4175
r4mercur wants to merge 5 commits intoapache:mainfrom
r4mercur:SOLR-17550

Conversation

@r4mercur
Copy link

@r4mercur r4mercur commented Mar 1, 2026

add testcase to identify wrong method behaviour of setPreferredOverseer() in ZkController

https://issues.apache.org/jira/browse/SOLR-17550

Description

Like described in the ticket from the reporter: "By code inspection, ZkController.setPreferredOverseer creates a command message improperly. The key "node" should exist with the node name as value but instead the key is itself the node name – a bug introduced by a refactoring in 2022.
Ideally we understand the ramifications; like shouldn't this be found by a test?"

I only added the test case currently, not sure if the bug should also be fixed by using "node" as static string as key ?

Solution

My approach is a new unittest-case to identify the wrong behaviour of the setter method in the ZkController. I used the LLM "Claude Sonet 4.5" for problem description and solution proposals.

Tests

The new unittest-case written should identify problems, when using "getNodeName()" to set the key of the properties.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

add testcase to identify wrong method behaviour of setPreferredOverseer() in ZkController
@r4mercur r4mercur marked this pull request as draft March 1, 2026 18:19
r4mercur added 3 commits March 1, 2026 19:21
fix import after ./gradlew check & ./gradlew tidy
fix import after ./gradlew check & ./gradlew tidy & change test
@r4mercur r4mercur marked this pull request as ready for review March 1, 2026 19:37
@epugh
Copy link
Contributor

epugh commented Mar 3, 2026

I triggered the tests. I wish I knew more about the Overseer to review this pr..

Copy link
Author

@r4mercur r4mercur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How i understand the bug and the description of the ticket given from the creator, i would assume the behaviour currently is like when setPreferredOverseer() is called in the ZkController the values are set wrong. (happens when "ADDROLE" message is built)

For example with the buggy behaviour the values which are set like this:
{
"operation": "addrole",
"localhost:8983_solr": "localhost:8983_solr",
"role": "overseer",
"persist": "false"
}
but the key and values which are expected should look like this with the right key for the "node" part:
{
"operation": "addrole",
"node": "localhost:8983_solr",
"role": "overseer",
"persist": "false"
}

And since i'am also new in this topic the Overseer class is the main class which manages cluster state changes like collection creation, shards, roles etc. . This seems to be done with commands from Zookeeper-backed task queues. The class it self acts as the cluster coordinator as i understood it, but i could be wrong here.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

The new test is a unit test. It's something.
I'm not sure what tests exists for the feature. I don't put much faith in unit tests to tell me much; I put faith in integration tests (aka behavior / functional tests).

@r4mercur
Copy link
Author

r4mercur commented Mar 8, 2026

I added a new teststep with some behaviour test in the OverseerRolesTest class. Would you consider this combination of the two testcases now (unittestcase + integration test) sufficient for this feature to be tested right or would you prefer a stricter integration assertion arround the overseer election behaviour ?

Then i would delete the unittest part and would try to rewrite the integration test with more stricter "feature" based assertion.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 9, 2026

Before going further on tests... do you understand the bug enough to know what it's impact was? We'd use that to fill out the changelog for one thing but also I think it's kind of important to just know anyways. I see OverseerRolesTest; perhaps the feature wasn't totally broken; maybe it didn't work right in certain circumstances?

@r4mercur
Copy link
Author

r4mercur commented Mar 9, 2026

You are right and i tried now to understand the real impact of this bug.

With looking at the code:
The method seems to be deprecated anyway -> usage should be done via node roles like the log says at line 2616 ZkController.class.

Users probably are using this rarely (or never) and it seems like even when using it the buggy behaviour would only lead to silent failure of the preferred overseer role assignment.

So the question would be:
Wouldn't it make more sense to remove this method and the usage of it completely ? (I didn't check, if its used anywhere internally)

Then it would be wise to close this PR and create a new ticket for removing the existance of this method?

@dsmiley
Copy link
Contributor

dsmiley commented Mar 12, 2026

My reading of the code (aided by a few runs of the debugger in NodeRolesTest and OverseerRolesTest) is that if a node is started with system properties reflecting that it's the preferred Overseer (this is the new/recommended way), then, it will call org.apache.solr.cloud.ZkController#setPreferredOverseer to actively take over. But persists=false... on the receiving end, OverseerRoleCmd ignores whatever node is set to and triggers the overseer prioritization thread. I haven't read how that in turn knows who should be the overseer. Meaning, I'm not yet sure if this is an oversight/gap, or the caller should not pass "node" in the first place, as it's misleading (when persist=true).

@dsmiley
Copy link
Contributor

dsmiley commented Mar 12, 2026

BTW, sorry if my choice of newdev label on JIRA triggered your interest. This isn't what it seemed, in the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants