Don't create Host instances with random host_id #623

sylwiaszunejko · 2025-12-18T13:21:21Z

This PR fixes inefficiencies in the host initialization mechanism when bootstrapping a cluster.

Previously, the driver created Host instances with connections from the contact points provided in the cluster configuration using random host IDs. After establishing the control connection and reading from system.peers, these initial Host instances were discarded and replaced with new ones created using the correct host metadata. This approach resulted in unnecessary creation and teardown of multiple connections.

Changes

The control connection is now initialized only using the endpoints specified in the cluster configuration.
After a successful control connection is established, the driver reads from system.local and system.peers.
Based on this metadata, Host instances are created with the correct host_id values.
Connections are then initialized directly on these properly constructed Host instances.

sylwiaszunejko · 2025-12-18T13:21:56Z

Some tests are still failing, but I wanted to ask if the direction is good @dkropachev

sylwiaszunejko · 2025-12-18T14:57:06Z

@Lorak-mmk maybe you know, why this test assumes that the new_host should be different?

def test_get_control_connection_host(self):
        """
        Test to validate Cluster.get_control_connection_host() metadata

        @since 3.5.0
        @jira_ticket PYTHON-583
        @expected_result the control connection metadata should accurately reflect cluster state.

        @test_category metadata
        """

        host = self.cluster.get_control_connection_host()
        assert host == None

        self.session = self.cluster.connect()
        cc_host = self.cluster.control_connection._connection.host

        host = self.cluster.get_control_connection_host()
        assert host.address == cc_host
        assert host.is_up == True

        # reconnect and make sure that the new host is reflected correctly
        self.cluster.control_connection._reconnect()
        new_host = self.cluster.get_control_connection_host()
        assert host != new_host

Lorak-mmk · 2025-12-18T15:12:17Z

I have no idea.
In Rust Driver we have logic that if CC breaks then we try to connect it to all other hosts (because the one it was connected to is presumed non-working for now).
I see no such logic in Python Driver. This part was added in commit 2796ee5:

Was this test passing until now and non-flaky? If so, then perhaps there is such logic somewhere.

Lorak-mmk · 2025-12-18T15:13:04Z

Now that I think of it: I see that driver uses LBP to decide order of hosts to connect. See _connect_host_in_lbp and _reconnect_internal.
LBP uses by default is Round Robin, so on reconnect it will start from a different host than at the beginning, right? It would explain why each CC reconnect should land at different host in healthy cluster.

sylwiaszunejko · 2025-12-18T15:21:00Z

Now that I think of it: I see that driver uses LBP to decide order of hosts to connect. See _connect_host_in_lbp and _reconnect_internal. LBP uses by default is Round Robin, so on reconnect it will start from a different host than at the beginning, right? It would explain why each CC reconnect should land at different host in healthy cluster.

Makes sense, second question: in this test:

def test_profile_lb_swap(self):
        """
        Tests that profile load balancing policies are not shared

        Creates two LBP, runs a few queries, and validates that each LBP is execised
        seperately between EP's

        @since 3.5
        @jira_ticket PYTHON-569
        @expected_result LBP should not be shared.

        @test_category config_profiles
        """
        query = "select release_version from system.local where key='local'"
        rr1 = ExecutionProfile(load_balancing_policy=RoundRobinPolicy())
        rr2 = ExecutionProfile(load_balancing_policy=RoundRobinPolicy())
        exec_profiles = {'rr1': rr1, 'rr2': rr2}
        with TestCluster(execution_profiles=exec_profiles) as cluster:
            session = cluster.connect(wait_for_all_pools=True)

            # default is DCA RR for all hosts
            expected_hosts = set(cluster.metadata.all_hosts())
            rr1_queried_hosts = set()
            rr2_queried_hosts = set()

            rs = session.execute(query, execution_profile='rr1')
            rr1_queried_hosts.add(rs.response_future._current_host)
            rs = session.execute(query, execution_profile='rr2')
            rr2_queried_hosts.add(rs.response_future._current_host)

            assert rr2_queried_hosts == rr1_queried_hosts

in this tests it is assumed that both queries should use the same host, as they use different instances of RoundRobinPolicy and they start from the same host? But how this can be true if the position when we start is randomized here: https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L182

Lorak-mmk · 2025-12-18T15:38:37Z

No idea. Perhaps populate is not called for those policies for some reason, and they are populated using on_up/down etc?
Try to print a log / stacktrace in populate and run this test.

cassandra/cluster.py

dkropachev · 2025-12-19T13:07:05Z

cassandra/policies.py

+        if not self.local_dc:
+            self.local_dc = dc
+            return HostDistance.LOCAL


Should not be in this PR

@sylwiaszunejko, what is the reason for having it here ?

cassandra/cluster.py

sylwiaszunejko · 2025-12-19T14:34:02Z

Now that I think of it: I see that driver uses LBP to decide order of hosts to connect. See _connect_host_in_lbp and _reconnect_internal. LBP uses by default is Round Robin, so on reconnect it will start from a different host than at the beginning, right? It would explain why each CC reconnect should land at different host in healthy cluster.

Makes sense, second question: in this test:
def test_profile_lb_swap(self):
        """
        Tests that profile load balancing policies are not shared

        Creates two LBP, runs a few queries, and validates that each LBP is execised
        seperately between EP's

        @since 3.5
        @jira_ticket PYTHON-569
        @expected_result LBP should not be shared.

        @test_category config_profiles
        """
        query = "select release_version from system.local where key='local'"
        rr1 = ExecutionProfile(load_balancing_policy=RoundRobinPolicy())
        rr2 = ExecutionProfile(load_balancing_policy=RoundRobinPolicy())
        exec_profiles = {'rr1': rr1, 'rr2': rr2}
        with TestCluster(execution_profiles=exec_profiles) as cluster:
            session = cluster.connect(wait_for_all_pools=True)

            # default is DCA RR for all hosts
            expected_hosts = set(cluster.metadata.all_hosts())
            rr1_queried_hosts = set()
            rr2_queried_hosts = set()

            rs = session.execute(query, execution_profile='rr1')
            rr1_queried_hosts.add(rs.response_future._current_host)
            rs = session.execute(query, execution_profile='rr2')
            rr2_queried_hosts.add(rs.response_future._current_host)

            assert rr2_queried_hosts == rr1_queried_hosts
in this tests it is assumed that both queries should use the same host, as they use different instances of RoundRobinPolicy and they start from the same host? But how this can be true if the position when we start is randomized here: https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L182

This test was working because populate was called before cc was created, so we only knew about contact points provided in cluster config (so only one host) I believe current approach (calling populate on lbp after creating cc so we can update lbp with all known hosts) is much better so we should remove this test @Lorak-mmk WDYT?

Lorak-mmk · 2025-12-19T15:13:24Z

In the previous approach (calling populate with one host) were the on_add calls correct (so one call for each host, besides CC host)?
If so, then both versions are correct. I think we could then switch to proposed version.

Lorak-mmk · 2025-12-19T15:13:50Z

You could then adjust the test, not remove it.

sylwiaszunejko · 2025-12-19T15:34:14Z

In the previous approach (calling populate with one host) were the on_add calls correct (so one call for each host, besides CC host)? If so, then both versions are correct. I think we could then switch to proposed version.

on_add is called properly, but if there is only one host during populate the starting position for RoundRobinPolicy is always the same even if some hosts are added later:

if len(hosts) > 1:
            self._position = randint(0, len(hosts) - 1)

Let control connection use resolved contact points from cluster config if lbp is not yet initialized.

dkropachev · 2025-12-23T19:08:25Z

cassandra/cluster.py

+    def _connect_host(self):
        errors = {}
+
        lbp = (
            self._cluster.load_balancing_policy
            if self._cluster._config_mode == _ConfigMode.LEGACY else
            self._cluster._default_load_balancing_policy
        )

+        # use endpoints from the default LBP if it is already initialized
        for host in lbp.make_query_plan():
            try:
-                return (self._try_connect(host), None)
+                return (self._try_connect(host.endpoint), None)
            except ConnectionException as exc:
                errors[str(host.endpoint)] = exc
                log.warning("[control connection] Error connecting to %s:", host, exc_info=True)
                self._cluster.signal_connection_failure(host, exc, is_host_addition=False)
            except Exception as exc:
                errors[str(host.endpoint)] = exc
                log.warning("[control connection] Error connecting to %s:", host, exc_info=True)
            if self._is_shutdown:
                raise DriverException("[control connection] Reconnection in progress during shutdown")
-
+
+        # if lbp not initialized use contact points provided to the cluster
+        if len(errors) == 0:
+            for endpoint in self._cluster.endpoints_resolved:
+                try:
+                    return (self._try_connect(endpoint), None)
+                except ConnectionException as exc:
+                    errors[str(endpoint)] = exc
+                    log.warning("[control connection] Error connecting to %s:", endpoint, exc_info=True)
+                    self._cluster.signal_connection_failure(endpoint, exc, is_host_addition=False)
+                except Exception as exc:
+                    errors[str(endpoint)] = exc
+                    log.warning("[control connection] Error connecting to %s:", endpoint, exc_info=True)
+                if self._is_shutdown:
+                    raise DriverException("[control connection] Reconnection in progress during shutdown")
+
        return (None, errors)


Let's make it simple solving #622 on the way:

def _connect_host(self): errors = {} lbp = self._cluster.load_balancing_policy \ if self._cluster._config_mode == _ConfigMode.LEGACY else self._cluster._default_load_balancing_policy # use endpoints from the default LBP if it is already initialized for endpoint in itertools.chain((host.endpoint for host in lbp.make_query_plan()), self._cluster.endpoints_resolved): try: return (self._try_connect(endpoint), None) except Exception as exc: errors[str(endpoint)] = exc log.warning("[control connection] Error connecting to %s:", endpoint, exc_info=True) if self._is_shutdown: raise DriverException("[control connection] Reconnection in progress during shutdown") return (None, errors)

Can you please also find a better name for it.

sylwiaszunejko requested a review from dkropachev December 18, 2025 13:21

sylwiaszunejko force-pushed the remove_random_ids branch from 7f061e1 to ef382b9 Compare December 18, 2025 13:41

sylwiaszunejko force-pushed the remove_random_ids branch from ef382b9 to 9598dd5 Compare December 18, 2025 14:58

sylwiaszunejko force-pushed the remove_random_ids branch from 9598dd5 to 9e162dd Compare December 19, 2025 13:20

dkropachev requested changes Dec 19, 2025

View reviewed changes

sylwiaszunejko force-pushed the remove_random_ids branch 2 times, most recently from adddec1 to 3e864fc Compare December 20, 2025 12:58

sylwiaszunejko requested a review from dkropachev December 22, 2025 09:57

Don't create Host instances with random host_id

0a1aa0e

Let control connection use resolved contact points from cluster config if lbp is not yet initialized.

sylwiaszunejko force-pushed the remove_random_ids branch from 3e864fc to 0a1aa0e Compare December 22, 2025 12:59

sylwiaszunejko self-assigned this Dec 22, 2025

sylwiaszunejko requested a review from Lorak-mmk December 22, 2025 13:21

sylwiaszunejko marked this pull request as ready for review December 22, 2025 13:21

tests/unit: Provide host_id when initializing Host

a6cf3aa

sylwiaszunejko force-pushed the remove_random_ids branch from dd1eb6f to a6cf3aa Compare December 22, 2025 13:56

dkropachev requested changes Dec 23, 2025

View reviewed changes

Don't create Host instances with random host_id #623

Are you sure you want to change the base?

Don't create Host instances with random host_id #623

Uh oh!

Conversation

sylwiaszunejko commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

sylwiaszunejko commented Dec 18, 2025

Uh oh!

sylwiaszunejko commented Dec 18, 2025

Uh oh!

Lorak-mmk commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Dec 18, 2025

Uh oh!

sylwiaszunejko commented Dec 18, 2025

Uh oh!

Lorak-mmk commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

dkropachev Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

dkropachev Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sylwiaszunejko commented Dec 19, 2025

Uh oh!

Lorak-mmk commented Dec 19, 2025

Uh oh!

Lorak-mmk commented Dec 19, 2025

Uh oh!

sylwiaszunejko commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkropachev Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

dkropachev Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sylwiaszunejko commented Dec 18, 2025 •

edited

Loading

Lorak-mmk commented Dec 18, 2025 •

edited

Loading

sylwiaszunejko commented Dec 19, 2025 •

edited

Loading