Proto commits in yugabyte/yugabyte-db

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:4786ff9
Author:Atharva Kurhade
Committer:Atharva Kurhade

[#23198] DocDB, ASH: Instrument remote and local bootstrap with wait events Summary: Added the following wait events for ASH Instrumentation of remote and local bootstraps:- 1. RemoteBootstrap_StartRemoteSession :- Remote bootstrap client is waiting for a remote session to be started on the remote bootstrap server. 2. RemoteBootstrap_FetchData :- Remote bootstrap client is waiting for data to be fetched from remote bootstrap server. 3. RemoteBootstrap_ReadDataFromFile :- Remote bootstrap server is reading data from file. 4. RemoteBootstrap_RateLimiter :- Remote bootstrap client slowing down due to rate limiter throttling network access to remote bootstrap server. We also add a new query id for RemoteBootstrap wait events (QueryIdForRemoteBootstrap). Wait Events captured on the Remote Bootstrap Client:- query_id | wait_event | wait_event_aux | wait_event_type | wait_event_code | wait_event_class ----------+------------------------------------+-----------------+-----------------+-----------------+------------------ 8 | RemoteBootstrap_FetchData | 2619de885f0b4ec | Network | 83886091 | TabletWait 8 | OnCpu_Active | 2619de885f0b4ec | Cpu | 117440512 | Common 8 | RemoteBootstrap_StartRemoteSession | 4243c0c668ba44f | Network | 83886092 | TabletWait 8 | RemoteBootstrap_StartRemoteSession | 4243c0c668ba44f | Network | 83886092 | TabletWait 8 | RemoteBootstrap_FetchData | 4243c0c668ba44f | Network | 83886091 | TabletWait 8 | RemoteBootstrap_FetchData | 4243c0c668ba44f | Network | 83886091 | TabletWait 8 | RemoteBootstrap_FetchData | 4ab2f826b3aa421 | Network | 83886091 | TabletWait 8 | RemoteBootstrap_FetchData | 4ab2f826b3aa421 | Network | 83886091 | TabletWait Wait Events captured on the Remote Bootstrap Server:- query_id | wait_event | wait_event_aux | wait_event_type | wait_event_code | wait_event_class ----------+------------------------------------+-----------------+-----------------+-----------------+------------------ 8 | OnCpu_Active | 4243c0c668ba44f | Cpu | 117440512 | Common 8 | OnCpu_Active | 4243c0c668ba44f | Cpu | 117440512 | Common 8 | OnCpu_Active | 4243c0c668ba44f | Cpu | 117440512 | Common 8 | OnCpu_Active | 4243c0c668ba44f | Cpu | 117440512 | Common 8 | OnCpu_Active | 4243c0c668ba44f | Cpu | 117440512 | Common 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait 8 | RemoteBootstrap_ReadDataFromFile | 4243c0c668ba44f | DiskIO | 83886093 | TabletWait **Upgrade/Downgrade safety:** The field added is optional, if the field is not present, then ASH will not have some metadata, which is fine during upgrades/rollbacks Jira: DB-12141 Test Plan: Manual Test:- Creating a cluster and adding new nodes then checking if the wait states are present in yb_active_session_history_view. C++ Tests:- ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kRemoteBootstrap_StartRemoteSession ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kRemoteBootstrap_FetchData ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kRemoteBootstrap_RateLimiter ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kRemoteBootstrap_ReadDataFromFile Reviewers: amitanand, asaha, hbhanawat Reviewed By: amitanand, asaha Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D42984

The documentation is generated from this commit.

Commit:d4cc2b7
Author:Naorem Khogendro Singh
Committer:Naorem Khogendro Singh

[PLAT-17217] Implement server gflags method in node agent Summary: Implement gflags update in node agent. Test Plan: Manually tested after enabling the feature flag. Will be doing more test runs before the feature flag is enabled. We want to have all the changes in before we can test the full workflow. Also added the streaming of info output from remote node. ``` 2025-05-05T01:04:14.460Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-4] com.yugabyte.yw.common.NodeAgentClient Download command curl -L -o /tmp/yugabyte-2024.1.0.0-b47-centos-x86_64.tar.gz https://s3.us-west-2.amazonaws.com/uploads.dev.yugabyte.com/nkhogen/yugabyte-2024.1.0.0-b47-centos-x86_64.tar.gz Dowloading software Running install software phase: download-software 2025-05-05T01:04:17.013Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-6] com.yugabyte.yw.common.NodeAgentClient Running install software phase: make-yb-software-dir 2025-05-05T01:04:17.214Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-7] com.yugabyte.yw.common.NodeAgentClient Running install software phase: untar-software 2025-05-05T01:04:17.461Z [debug] e0503aa4-bba6-4a04-a1e2-ac47a5a357d1 AuthorizationHandler.java:99 [application-pekko.actor.default-dispatcher-25194] com.yugabyte.yw.rbac.handlers.AuthorizationHandler User not present in the system 2025-05-05T01:04:18.532Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-5] com.yugabyte.yw.common.NodeAgentClient Running install software phase: make-yb-software-dir Running install software phase: untar-software 2025-05-05T01:04:25.235Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-9] com.yugabyte.yw.common.NodeAgentClient Running install software phase: make-release-dir Running install software phase: copy-package-to-release Running install software phase: remove-temp-package Running install software phase: post-install 2025-05-05T01:04:25.435Z [info] db680383-fea4-423d-bb3b-f4e815804065 NodeAgentClient.java:535 [NodeAgentGrpcPool-8] com.yugabyte.yw.common.NodeAgentClient Running install software phase: remove-older-release Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/auto_flags.json-to-/home/yugabyte/master/auto_flags.json Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/bin-to-/home/yugabyte/master/bin Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/lib-to-/home/yugabyte/master/lib Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/master_flags.xml-to-/home/yugabyte/master/master_flags.xml Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/openssl-config-to-/home/yugabyte/master/openssl-config Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/postgres-to-/home/yugabyte/master/postgres Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/pylib-to-/home/yugabyte/master/pylib Running install software phase: symlink-/home/yugabyte/yb-software/yugabyte-2024.1.0.0-b47-centos-x86_64/share-to-/home/yugabyte/master/share ``` Reviewers: svarshney, yshchetinin, nbhatia Reviewed By: svarshney Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D43687

Commit:4cdd02d
Author:Minghui Yang
Committer:Minghui Yang

[BACKPORT 2025.1][#27028] YSQL: Let DDL statement update shared catalog version Summary: Commit 43032537707c82693281c33574e21d1b35f2da2d made a change that when doing incremental catalog cache refresh on detecting a newer shared catalog version, an additional RPC to master is made to retrieve the latest catalog version. This additional RPC to master allows to handle the case where a parent ysqlsh forks out a child ysqlsh that does a number of DDLs and then exits to return the control back to the parent ysqlsh. The parent ysqlsh's PG backend detects a newer shared catalog version caused by the catalog version increments made by the DDLs executed in the child ysqlsh. Without retrieving the latest master catalog version, it is possible that due to heartbeat delay, the parent ysqlsh's PG backend operates on the newer shared catalog version, which is already stale relative to the latest master catalog version. So its next DML failed with error: ``` The catalog snapshot used for this transaction has been invalidated: ``` Although this additional master RPC call avoids the error, it hurts performance. In particular, in TPCC benchmark with auto-analyze enabled the `Connection Acq Latency` has increased more than 300% compared with the baseline which has auto-analyze disabled. After debugging I found that this additional master RPC call is the culprit. Instead of making additional RPC call to master, if we can let the DDL statement update the shared catalog version, then after the child ysqlsh exits the parent ysqlsh's PG backend will see the latest catalog version in shared memory and there is no need to send a RPC to master for that. This diff adds `YbCheckNewSharedCatalogVersionOptimization` that is called after a DDL increments the catalog version successfully. It makes a local RPC call to the local tserver to setup the new catalog version together with the invalidation messages. All the backends on the same node will be able to see the latest catalog version earlier, before the next heartbeat response from master. A few unit tests are updated because they used to make two connections to the same node and relies on heartbeat delay for them to pass. I changed the tests to make two connections to different nodes so that heartbeat delay continue to work as expected by these tests. TestPgDdlConcurrency.java is also updated because the gflag name was wrong. **Upgrade/Rollback safety:** The src/yb/tserver/pg_client.proto change is only used in PG -> tserver communication which is upgrade safe. Jira: DB-16501 Original commit: a260932fe7aba6653b17812c7a8f8b6b6b2e4633 / D43651 Test Plan: (1) ./yb_build.sh --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.WaitForSharedCatalogVersionToCatchup (2) Running tpcc benchmark with 10 warehouses on my local dev vm RF-3 cluster: ``` incremental refresh/auto-analyze off (without diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3809 | 27.52 | 45.94 | 0.47 incremental refresh/auto-analyze on (without diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3836 | 30.37 | 57.84 | 1.18 incremental refresh/auto-analyze off (with diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3890 | 27.99 | 48.79 | 0.46 incremental refresh/auto-analyze on (with diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3900 | 29.91 | 51.47 | 0.56 ``` Consider `Connection Acq Latency`: Without diff, we see 1.18/0.47 = 251%. with diff, we see 0.56/0.46 = 122%. Reviewers: kfranz, sanketh, mihnea Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43798

Commit:a260932
Author:Minghui Yang
Committer:Minghui Yang

[#27028] YSQL: Let DDL statement update shared catalog version Summary: Commit 43032537707c82693281c33574e21d1b35f2da2d made a change that when doing incremental catalog cache refresh on detecting a newer shared catalog version, an additional RPC to master is made to retrieve the latest catalog version. This additional RPC to master allows to handle the case where a parent ysqlsh forks out a child ysqlsh that does a number of DDLs and then exits to return the control back to the parent ysqlsh. The parent ysqlsh's PG backend detects a newer shared catalog version caused by the catalog version increments made by the DDLs executed in the child ysqlsh. Without retrieving the latest master catalog version, it is possible that due to heartbeat delay, the parent ysqlsh's PG backend operates on the newer shared catalog version, which is already stale relative to the latest master catalog version. So its next DML failed with error: ``` The catalog snapshot used for this transaction has been invalidated: ``` Although this additional master RPC call avoids the error, it hurts performance. In particular, in TPCC benchmark with auto-analyze enabled the `Connection Acq Latency` has increased more than 300% compared with the baseline which has auto-analyze disabled. After debugging I found that this additional master RPC call is the culprit. Instead of making additional RPC call to master, if we can let the DDL statement update the shared catalog version, then after the child ysqlsh exits the parent ysqlsh's PG backend will see the latest catalog version in shared memory and there is no need to send a RPC to master for that. This diff adds `YbCheckNewSharedCatalogVersionOptimization` that is called after a DDL increments the catalog version successfully. It makes a local RPC call to the local tserver to setup the new catalog version together with the invalidation messages. All the backends on the same node will be able to see the latest catalog version earlier, before the next heartbeat response from master. A few unit tests are updated because they used to make two connections to the same node and relies on heartbeat delay for them to pass. I changed the tests to make two connections to different nodes so that heartbeat delay continue to work as expected by these tests. TestPgDdlConcurrency.java is also updated because the gflag name was wrong. **Upgrade/Rollback safety:** The src/yb/tserver/pg_client.proto change is only used in PG -> tserver communication which is upgrade safe. Jira: DB-16501 Test Plan: (1) ./yb_build.sh --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.WaitForSharedCatalogVersionToCatchup (2) Running tpcc benchmark with 10 warehouses on my local dev vm RF-3 cluster: ``` incremental refresh/auto-analyze off (without diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3809 | 27.52 | 45.94 | 0.47 incremental refresh/auto-analyze on (without diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3836 | 30.37 | 57.84 | 1.18 incremental refresh/auto-analyze off (with diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3890 | 27.99 | 48.79 | 0.46 incremental refresh/auto-analyze on (with diff) Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency NewOrder | 3900 | 29.91 | 51.47 | 0.56 ``` Consider `Connection Acq Latency`: Without diff, we see 1.18/0.47 = 251%. with diff, we see 0.56/0.46 = 122%. Reviewers: kfranz, sanketh, mihnea Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43651

Commit:fd744ca
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[BACKPORT 2025.1][#26526] Docdb: Table locks. Allow releasing exclusive/global locks through yb-admin Summary: Original commit: ceefc8e064585b93d4b6a31c212e3ad55caa2d1f / D43615 Add a yb-admin command to release global/exclusive table locks through the master. This is primarily a debug tool to clean up persisted locks at the master -- in case our existing mechanisms for cleaning up the locks are not enough and there is an unexpected leak. For releasing shared/DML locks on TServer, a simpler approach may be to restart the TServer (if possible). However, for exclusive/global locks restart does not help as they are persisted. The expected workflow would be for the user/engineer to curl the master's web-ui to identify the leaked locks and use the tool to release them -- ** after ** ensuring that the transaction is aborted (using `yb_cancel_transaction` as necessary). To prevent accidental misuse, this command is hidden from the displayed options in yb-admin. other minor changes: - do not pass server_address for uuid in yb-ts-cli - remove acquire codepath from yb-ts-cli **Upgrade/Rollback safety:** This feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-15892 Test Plan: yb_build.sh fastdebug --cxx-test yb-admin-test --gtest_filter AdminCliTestForTableLocks.TestYsqlTableLock Reviewers: bkolagani, zdrudi, rthallam Reviewed By: zdrudi Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D43741

Commit:9b99e99
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[BACKPORT 2025.1][#25686] Docdb: Table locks. Update wait states appropriately and integrate with ASH Summary: Original commit: 258bdc1375fc404e18f76918105483c08bf161b3 / D43393 Integrate table locks with ASH. 1. use kConflictResolution_WaitOnConflictingTxns instead of LockedBatchEntry_Lock while waiting for a lock 2. Plumb through ash metadata 3. use SET_WAIT_STATUS(YBClient_WaitingOnMaster); **Upgrade/Rollback safety:** The table locks feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-14944 Test Plan: Session 1: ``` yugabyte=# begin transaction; BEGIN yugabyte=# LOCK TABLE demo IN ACCESS SHARE MODE; LOCK TABLE yugabyte=# ``` Session 2: ``` yugabyte=# begin transaction; BEGIN yugabyte=# LOCK TABLE demo IN ACCESS EXCLUSIVE MODE; --- waits --- ``` Session 3: Query ASH: ``` yugabyte=# SELECT substring(stmts.query, 1, 50) AS query, ash.wait_event_component, ash.c AS count, ash.wait_event, ash.wait_event_class, ash.wait_event_aux FROM ( SELECT query_id, wait_event_component, wait_event_class, wait_event, wait_event_aux, count(*) AS c FROM yb_active_session_history WHERE sample_time >= current_timestamp - interval '1 minutes' GROUP BY query_id, wait_event_component, wait_event_class, wait_event, wait_event_aux ) ash LEFT JOIN pg_stat_statements stmts ON stmts.queryid = ash.query_id ORDER BY ash.query_id, ash.wait_event_component; ``` Before the change: ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+-----------------------+------------------+------------------- LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 54 | OnCpu_Active | Common | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | YSQL | 54 | WaitingOnTServer | TServerWait | AcquireObjectLock | TServer | 54 | LockedBatchEntry_Lock | TabletWait | (3 rows) ``` After the change: ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+------------------------------------------+------------------+------------------- LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 14 | ConflictResolution_WaitOnConflictingTxns | TabletWait | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 14 | YBClient_WaitingOnMaster | Client | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | YSQL | 14 | WaitingOnTServer | TServerWait | AcquireObjectLock (3 rows) ``` With the diff: running ` alter table demo add column bar3 integer ` while a another session was doing DML/Insert into the same table `demo` ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+------------------------------------------+------------------+------------------- alter table demo add column bar3 integer | TServer | 18 | ConflictResolution_WaitOnConflictingTxns | TabletWait | alter table demo add column bar3 integer | TServer | 18 | YBClient_WaitingOnMaster | Client | alter table demo add column bar3 integer | YSQL | 18 | WaitingOnTServer | TServerWait | AcquireObjectLock | TServer | 1 | Raft_WaitingForReplication | Consensus | ded652464dda431 | TServer | 1 | WAL_Append | Consensus | ded652464dda431 (5 rows) ``` Reviewers: asaha, bkolagani, rthallam Reviewed By: asaha Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D43700

Commit:3d0302d
Author:Abhinab Saha
Committer:Abhinab Saha

[BACKPORT 2024.2][#26808] DocDB, ASH: Fix some "query id 0" samples Summary: Original commit: 77a723d3e4d / D42993 We observe the query id as 0 in some cases, mostly this is due to these reasons - 1. We sample before updating the metadata 2. The RPC is sent by the master (we don't have metadata passing through master yet) 3. The RPC is handed off to another thread and we didn't inherit the wait state in the other thread This diff attempts to solve 3 in the read - write paths by adopting the wait state pointer when we are changing threads. Summary of changes - - Store the current thread local wait state pointer in the YBTransaction::Impl object, we use this to update the wait state pointer when changing threads in the transaction layer - Pass ASH metadata with UpdateTransaction and AbortTransaction RPCs - Add YBClient_LookingUpTablet wait states when we wait for tablet information in meta_cache.cc Jira: DB-16198 **Upgrade/Downgrade safety:** The field added is optional, if the field is not present, then ASH will not have some metadata, which is fine during upgrades/rollbacks Test Plan: New unit test added ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kYBClient_WaitingOnMaster Manual testing Ran the yb-sample-apps workload ``` java -jar yb-sample-apps.jar --workload SqlInserts --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSecondaryIndex --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSnapshotTxns --nodes 127.0.0.1:5433 ``` No "0 query id" samples with wait states other than OnCpu_Active and OnCpu_Passive were observed Reviewers: amitanand, hsunder, mlillibridge Reviewed By: amitanand Subscribers: yql, hbhanawat Differential Revision: https://phorge.dev.yugabyte.com/D43594

Commit:ceefc8e
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[#26526] Docdb: Table locks. Allow releasing exclusive/global locks through yb-admin Summary: Add a yb-admin command to release global/exclusive table locks through the master. This is primarily a debug tool to clean up persisted locks at the master -- in case our existing mechanisms for cleaning up the locks are not enough and there is an unexpected leak. For releasing shared/DML locks on TServer, a simpler approach may be to restart the TServer (if possible). However, for exclusive/global locks restart does not help as they are persisted. The expected workflow would be for the user/engineer to curl the master's web-ui to identify the leaked locks and use the tool to release them -- ** after ** ensuring that the transaction is aborted (using `yb_cancel_transaction` as necessary). To prevent accidental misuse, this command is hidden from the displayed options in yb-admin. other minor changes: - do not pass server_address for uuid in yb-ts-cli - remove acquire codepath from yb-ts-cli **Upgrade/Rollback safety:** This feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-15892 Test Plan: yb_build.sh fastdebug --cxx-test yb-admin-test --gtest_filter AdminCliTestForTableLocks.TestYsqlTableLock Reviewers: bkolagani, zdrudi, rthallam Reviewed By: zdrudi Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D43615

Commit:258bdc1
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[#25686] Docdb: Table locks. Update wait states appropriately and integrate with ASH Summary: Integrate table locks with ASH. 1. use kConflictResolution_WaitOnConflictingTxns instead of LockedBatchEntry_Lock while waiting for a lock 2. Plumb through ash metadata 3. use SET_WAIT_STATUS(YBClient_WaitingOnMaster); **Upgrade/Rollback safety:** The table locks feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-14944 Test Plan: Session 1: ``` yugabyte=# begin transaction; BEGIN yugabyte=# LOCK TABLE demo IN ACCESS SHARE MODE; LOCK TABLE yugabyte=# ``` Session 2: ``` yugabyte=# begin transaction; BEGIN yugabyte=# LOCK TABLE demo IN ACCESS EXCLUSIVE MODE; --- waits --- ``` Session 3: Query ASH: ``` yugabyte=# SELECT substring(stmts.query, 1, 50) AS query, ash.wait_event_component, ash.c AS count, ash.wait_event, ash.wait_event_class, ash.wait_event_aux FROM ( SELECT query_id, wait_event_component, wait_event_class, wait_event, wait_event_aux, count(*) AS c FROM yb_active_session_history WHERE sample_time >= current_timestamp - interval '1 minutes' GROUP BY query_id, wait_event_component, wait_event_class, wait_event, wait_event_aux ) ash LEFT JOIN pg_stat_statements stmts ON stmts.queryid = ash.query_id ORDER BY ash.query_id, ash.wait_event_component; ``` Before the change: ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+-----------------------+------------------+------------------- LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 54 | OnCpu_Active | Common | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | YSQL | 54 | WaitingOnTServer | TServerWait | AcquireObjectLock | TServer | 54 | LockedBatchEntry_Lock | TabletWait | (3 rows) ``` After the change: ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+------------------------------------------+------------------+------------------- LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 14 | ConflictResolution_WaitOnConflictingTxns | TabletWait | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | TServer | 14 | YBClient_WaitingOnMaster | Client | LOCK TABLE demo IN ACCESS EXCLUSIVE MODE | YSQL | 14 | WaitingOnTServer | TServerWait | AcquireObjectLock (3 rows) ``` With the diff: running ` alter table demo add column bar3 integer ` while a another session was doing DML/Insert into the same table `demo` ``` query | wait_event_component | count | wait_event | wait_event_class | wait_event_aux ------------------------------------------+----------------------+-------+------------------------------------------+------------------+------------------- alter table demo add column bar3 integer | TServer | 18 | ConflictResolution_WaitOnConflictingTxns | TabletWait | alter table demo add column bar3 integer | TServer | 18 | YBClient_WaitingOnMaster | Client | alter table demo add column bar3 integer | YSQL | 18 | WaitingOnTServer | TServerWait | AcquireObjectLock | TServer | 1 | Raft_WaitingForReplication | Consensus | ded652464dda431 | TServer | 1 | WAL_Append | Consensus | ded652464dda431 (5 rows) ``` Reviewers: asaha, bkolagani, rthallam Reviewed By: asaha Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D43393

Commit:cd7c2d5
Author:Abhinab Saha
Committer:Abhinab Saha

[BACKPORT 2025.1][#26808] DocDB, ASH: Fix some "query id 0" samples Summary: Original commit: 77a723d3e4d / D42993 We observe the query id as 0 in some cases, mostly this is due to these reasons - 1. We sample before updating the metadata 2. The RPC is sent by the master (we don't have metadata passing through master yet) 3. The RPC is handed off to another thread and we didn't inherit the wait state in the other thread This diff attempts to solve 3 in the read - write paths by adopting the wait state pointer when we are changing threads. Summary of changes - - Store the current thread local wait state pointer in the YBTransaction::Impl object, we use this to update the wait state pointer when changing threads in the transaction layer - Pass ASH metadata with UpdateTransaction and AbortTransaction RPCs - Add YBClient_LookingUpTablet wait states when we wait for tablet information in meta_cache.cc Jira: DB-16198 **Upgrade/Downgrade safety:** The field added is optional, if the field is not present, then ASH will not have some metadata, which is fine during upgrades/rollbacks Test Plan: New unit test added ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kYBClient_WaitingOnMaster Manual testing Ran the yb-sample-apps workload ``` java -jar yb-sample-apps.jar --workload SqlInserts --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSecondaryIndex --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSnapshotTxns --nodes 127.0.0.1:5433 ``` No "0 query id" samples with wait states other than OnCpu_Active and OnCpu_Passive were observed Reviewers: amitanand, hsunder, mlillibridge Reviewed By: amitanand Subscribers: yql, hbhanawat Differential Revision: https://phorge.dev.yugabyte.com/D43593

Commit:80bef2a
Author:svarshney
Committer:svarshney

[PLAT-17221] Implement setup software in node-agent Summary: [PLAT-17221] Implement setup software in node-agent Test Plan: Configured runtime_config for enabling node_agent based configure. Verified universe creation. Reviewers: nsingh, anijhawan, nbhatia Reviewed By: nsingh Differential Revision: https://phorge.dev.yugabyte.com/D43561

Commit:d91d06b
Author:Anubhav Srivastava
Committer:Anubhav Srivastava

[BACKPORT 2024.2][#25202] YSQL: Clean up duplicated tablespace validation logic Summary: The tablespace validation logic is currently split and duplicated across multiple files. This diff unifies the logic into `TablespaceParser`, which replaces the vaguely named `PlacementInfoConverter`. It also replaces the custom `PgPlacementInfoPB` message type with the more common `ReplicationInfoPB`, which we convert to later in the call chain anyways. This diff should make two future changes easier: 1. Adding a check to prevent accidental duplicates in the placement blocks 2. Adding read replica support to tablespaces Because we unify the parse-time and creation-time checks, this diff could cause tablespace parsing to fail if a cluster has incorrect placement info. The only place where I think this is possible is if the already has extra fields in the replica_placement (i.e. the placement was created incorrectly, and before https://phorge.dev.yugabyte.com/D10277). To mitigate this, if a tablespace fails validation, we now only skip over that tablespace (before this diff, we skipped loading all tablespaces if one failed validation). **Upgrade/Rollback safety:** The `PgPlacementInfoPB` message type is replaced by `ReplicationInfoPB` in `PgValidatePlacementRequestPB`. That RPC is only sent from PG to the local tserver, so the code to process `PgValidatePlacementRequestPB` should upgrade and rollback at the same time as the calling code. Jira: DB-14387 Original commit: 77bee7035e4a66b06818f5ba2d0a00ff44dbcd87 / D40520 Test Plan: existing tests Reviewers: kfranz, hsunder Reviewed By: kfranz Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D43603

Commit:68f4089
Author:Anubhav Srivastava
Committer:Anubhav Srivastava

[BACKPORT 2024.2][#25202] docdb: Move ReplicationInfoPB to common_net.proto Summary: Backport changes: ``` - src/yb/integration-tests/upgrade-tests/replication_info_upgrade-test.cc >+ ReplicationInfoUpgradeTest() : UpgradeTestBase(kBuild_2_25_0_0) {} <+ ReplicationInfoUpgradeTest() : UpgradeTestBase(kBuild_2024_1_0_1) {} >+ ASSERT_OK(PerformYsqlMajorCatalogUpgrade()); <+ ASSERT_OK(PerformYsqlUpgrade()); Update to work with older function names ``` Move ReplicationInfoPB to `common_net.proto` because it is used in more than master, and future diffs will add even more logic accessing it from PG. **Upgrade/Rollback safety:** Moving a proto across namespaces is safe, as long as it is not used in any `Any` type messages (which ReplicationInfoPB is not). Jira: DB-14387 (cherry picked from commit b4184dbb7266538bfdbfe3d6bf4163a723bdb324) Original commit: b4184dbb7266538bfdbfe3d6bf4163a723bdb324 / D40826 Test Plan: `./yb_build.sh release --cxx-test integration-tests_replication_info_upgrade-test` Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D43606

Commit:eeb2665
Author:yusong-yan
Committer:yusong-yan

[BACKPORT 2025.1][#26956] docdb: Ensure GetTabletPeers returns only user tablets in test mode Summary: Original commit: 70ce65890148eb1ab7f7d5004e64e2303f636d6b / D43528 **Issue:** Many tests nowadays use `GetTabletPeers` to retrieve all tablet peers on a tserver, and assuming that the returned list includes only user table peers. However, when new system tablets are added to tserver, those tests can fail. Explanation: these tests typically run a workload, then call `GetTabletPeers` to verify things such as: each tablet has at least X RocksDB files, the total tablet count, or the number of transactions per tablet. These assumptions can break when system tablets are unexpectedly added to tserver. This issue arises when features like: * `ysql_yb_enable_advisory_locks` (adds an advisory lock system table), or * `ysql_enable_auto_analyze_infra` (adds a stateful service table) are enabled, as they create system tablets on tserver, which are not impacted by the test workload. **Fix:** Make `GetTabletPeers `to return only user-table tablets in test environments (MiniCluster and ExternalMiniCluster). **Upgrade/Rollback safety:**. `include_user_tablets_only` is optional and defaults to false. Existing clients or older nodes that do not populate this field will behave exactly as before. Jira: DB-16409 Test Plan: Jenkins Reviewers: bkolagani, rthallam Reviewed By: rthallam Subscribers: yql, rthallam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D43588

Commit:70bfe28
Author:Sergei Politov
Committer:Sergei Politov

[BACKPORT 2025.1] [#26872] DocDB: Add clone support for vector indexes Summary: Postgres allows cloning databases, where the cloned database contains all the data from the original database at the time of cloning. However, since vector indexes are colocated with their main tables, cloning vector indexes don't work out of the box. This diff implements specific logic to handle vector index cloning. Also added index_id to vector index options, it is unique for the tablet. And remains the same during clone. Also fixed the following issues: 1) Subprocess sequentially read stdout of the child process, then read stderr of the child process and then wait for process to finish. So if child process writes a lot of data to stderr it could hang because of buffer overflow. Fixed by using Poll to query whether we are ready to read stdout or stderr. 2) Index index for colocated tables was not sent from master to tserver during clone operation. Added logic to send index index, also added logic to update table ids referenced by index info. Original commit: 01f5ccabf3e297ca93815986f458a254a678d420 / D43277 **Upgrade/Rollback safety** Since vector indexes is not released yet, it is safe to change related protobufs. Jira: DB-16287 Test Plan: PgCloneTest.CloneVectorIndex Reviewers: zdrudi, asrivastava, mhaddad Reviewed By: zdrudi Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43596

Commit:1da8ba8
Author:Sergei Politov
Committer:Sergei Politov

[BACKPORT 2025.1] [#26884] DocDB: Add paging support to vector indexes Summary: When executing a vector index query, the operation may include complex filtering conditions that cannot be pushed down to the index level. A common example of this would be queries containing inner joins or other non-trivial relational operations. This presents a challenge: when such filters are applied after the vector search, we cannot determine in advance how many candidate vectors from the index will ultimately satisfy the complete query conditions. Consequently, we may need to retrieve significantly more vectors than the user-specified limit to ensure we return the correct number of valid results after filtering. To address this issue robustly, the system should implement paginated querying capability for vector index operations. This approach would allow us to: 1. Fetch vectors in manageable batches 2. Apply the complex filters incrementally 3. Continue retrieving additional pages until either: - We satisfy the user's requested limit - We exhaust the relevant portion of the index This pagination mechanism would provide both correctness (ensuring we respect all query conditions) and efficiency (avoiding loading excessive unnecessary vectors into memory at once). **Upgrade / Downgrade Safety** Safe since PgVector feature not released yet. Original commit: 94ac3c01a41ea70efdfeef24c19e3fe371d025f2 / D43353 Jira: DB-16298 Test Plan: PgVectorIndexTest.Paging/* Reviewers: arybochkin Reviewed By: arybochkin Subscribers: yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43572

Commit:77a723d
Author:Abhinab Saha
Committer:Abhinab Saha

[#26808] DocDB, ASH: Fix some "query id 0" samples Summary: We observe the query id as 0 in some cases, mostly this is due to these reasons - 1. We sample before updating the metadata 2. The RPC is sent by the master (we don't have metadata passing through master yet) 3. The RPC is handed off to another thread and we didn't inherit the wait state in the other thread This diff attempts to solve 3 in the read - write paths by adopting the wait state pointer when we are changing threads. Summary of changes - - Store the current thread local wait state pointer in the YBTransaction::Impl object, we use this to update the wait state pointer when changing threads in the transaction layer - Pass ASH metadata with UpdateTransaction and AbortTransaction RPCs - Add YBClient_LookingUpTablet wait states when we wait for tablet information in meta_cache.cc Jira: DB-16198 **Upgrade/Downgrade safety:** The field added is optional, if the field is not present, then ASH will not have some metadata, which is fine during upgrades/rollbacks Test Plan: New unit test added ./yb_build.sh --cxx-test wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kYBClient_WaitingOnMaster Manual testing Ran the yb-sample-apps workload ``` java -jar yb-sample-apps.jar --workload SqlInserts --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSecondaryIndex --nodes 127.0.0.1:5433 java -jar yb-sample-apps.jar --workload SqlSnapshotTxns --nodes 127.0.0.1:5433 ``` No "0 query id" samples with wait states other than OnCpu_Active and OnCpu_Passive were observed Reviewers: amitanand, hsunder, mlillibridge Reviewed By: amitanand, mlillibridge Subscribers: hbhanawat, yql Differential Revision: https://phorge.dev.yugabyte.com/D42993

Commit:70ce658
Author:yusong-yan
Committer:yusong-yan

[#26956] docdb: Ensure GetTabletPeers returns only user tablets in test mode Summary: **Issue:** Many tests nowadays use `GetTabletPeers` to retrieve all tablet peers on a tserver, and assuming that the returned list includes only user table peers. However, when new system tablets are added to tserver, those tests can fail. Explanation: these tests typically run a workload, then call `GetTabletPeers` to verify things such as: each tablet has at least X RocksDB files, the total tablet count, or the number of transactions per tablet. These assumptions can break when system tablets are unexpectedly added to tserver. This issue arises when features like: * `ysql_yb_enable_advisory_locks` (adds an advisory lock system table), or * `ysql_enable_auto_analyze_infra` (adds a stateful service table) are enabled, as they create system tablets on tserver, which are not impacted by the test workload. **Fix:** Make `GetTabletPeers `to return only user-table tablets in test environments (MiniCluster and ExternalMiniCluster). **Upgrade/Rollback safety:**. `include_user_tablets_only` is optional and defaults to false. Existing clients or older nodes that do not populate this field will behave exactly as before. Jira: DB-16409 Test Plan: Jenkins Reviewers: bkolagani, rthallam Reviewed By: bkolagani Subscribers: ybase, rthallam, yql Differential Revision: https://phorge.dev.yugabyte.com/D43528

Commit:0a63949
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[BACKPORT 2025.1][#25680, #26498] DocDB: Table locks : Make DDL Release operations be resilient towards master restarts/failovers. Summary: Original commit: 0e1ffe091269e19e1347c302e8c4c573bc3c4194 / D43061 1) Make DDL Release operations be resilient towards master restarts/failovers. Release operations in progress, will first be added to a persistent "in_progress_release_operations" which will track the release operations that have been launched. Loader should relaunch all the release requests after becoming leader. may be skip adding to in-progress requests, if it is already present/is relaunched 2) Modify the release rpc codepath to succeed as soon as the request has been persisted and launched. If the caller desires to wait until the release is done, we allow specifying a `wait_for_release` [default false] which ensures that the rpc will only be responded after success. Don't see a need for this outside of testing. 3) Also, allow for release requests to retry at the LocalTServer until a) either it succeeds, or b) the lease epoch for the tserver is changed. **Upgrade/Rollback safety:** This feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-14938, DB-15865 Test Plan: ybd --cxx-test object_lock-tests Reviewers: zdrudi, bkolagani, rthallam Reviewed By: rthallam Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43542

Commit:0e1ffe0
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[#25680, #26498] DocDB: Table locks : Make DDL Release operations be resilient towards master restarts/failovers. Summary: 1) Make DDL Release operations be resilient towards master restarts/failovers. Release operations in progress, will first be added to a persistent "in_progress_release_operations" which will track the release operations that have been launched. Loader should relaunch all the release requests after becoming leader. may be skip adding to in-progress requests, if it is already present/is relaunched 2) Modify the release rpc codepath to succeed as soon as the request has been persisted and launched. If the caller desires to wait until the release is done, we allow specifying a `wait_for_release` [default false] which ensures that the rpc will only be responded after success. Don't see a need for this outside of testing. 3) Also, allow for release requests to retry at the LocalTServer until a) either it succeeds, or b) the lease epoch for the tserver is changed. **Upgrade/Rollback safety:** This feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-14938, DB-15865 Test Plan: ybd --cxx-test object_lock-tests Reviewers: zdrudi, bkolagani, rthallam Reviewed By: zdrudi Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43061

Commit:d44486d
Author:Mark Lillibridge
Committer:Mark Lillibridge

[BACKPORT 2025.1][#25853] xCluster: move ground truth of replication role to system catalog Summary: Original commit: 85bf8afeee0bf671c1b450da2250dbaa3d829fec / D43178 Clean backport with no conflicts. Currently, for xCluster automatic mode, we install a DDL replication extension on both universes and change an extension-specific per-database variable to control what role the extension believes it is playing for that database. (E.g., if it's supposed to be the source then it should be capturing DDL's.) Unfortunately, this variable is only read by Postgres backends on startup so changing roles (e.g., during a switchover operation) does not affect existing Postgres connections. :-( This means among other things that if you start up replication while you have an existing Postgres connection, any DDL performed on that connection after the replication is set up will not be replicated. To solve this problem, this diff switches to using a new mechanism: the replication role for each database is now kept in the system catalog in the xCluster config system catalog entry. This information is distributed to the TServers via the master heartbeat response and stored in the TServer xcluster_context. The extension before each DDL calls the local TServer via pggate to get the current role information for the current database of that Postgres connection. Note that there is lag here -- the extension will not see the new information until after the heartbeat has occurred. We are planning a additional mechanism to "wait until every TServer has received a heartbeat response" to handle this lag. For now, the tests have sleeps and the xCluster replication cannot handle DDL's concurrently with set up and switch over. (The latter is a known to do task and needs other work.) In general, this diff is not intended to make other functional changes besides this mechanism switch; some small amount of per-files tidying has been included on the leave the code better than you found it principle. Other details: * a new extension function, get_replication_role, is provided to return the current role the extension believes it is playing for the current connections database. * this is provided both for testing and for future debugging * a new extension procedure, TEST_override_replication_role, is provided to override the current role for the current connection * again this allows testing but also may help with future investigations * note this is only for the current connection unlike the previous mechanisms * usefully, if the extension is created on a database not under xCluster replication, it does nothing * this can occur if you restore a backup taken while xCluster replication was running Note that there is a brief window after a TServer starts where it has not yet received a heartbeat response from master; if a Postgres connection is made and a DDL attempts to run, the DDL might fail due to role information unavailable. It is possible we should add some invisible retries here to make that less likely. Upgrade/rollback safety: We are assuming (via documentation warnings for unstable releases), that no legal upgrade across this diff starts with automatic mode replication running. (No previous stable release has automatic mode available; it is only available in 2.25.1, which can only legally be upgraded to 2.25.2.) Accordingly, we do not have to worry about upgrade safety in the presence of automatic mode already being used/started during the upgrade. This diff adds a new pggate RPC call: ``` ~/code/yugabyte-db/src/yb/tserver/pg_client.proto:52: rpc GetXClusterRole(PgGetXClusterRoleRequestPB) returns (PgGetXClusterRoleResponsePB); ``` Pggate RPC calls don't matter for upgrades because they only go to the local TServer and are thus upgraded in lockstep. This diff also adds a new field to the catalog namespace entries: ``` ~/code/yugabyte-db/src/yb/common/common_types.proto:219: // Namespaces (previously) under DB-scoped xCluster replication contain this information. message XClusterNamespaceInfoPB { UNSPECIFIED = 0; // This should never occur; not stored. UNAVAILABLE = 1; // Used to denote that we cannot determine the role; not stored. NOT_AUTOMATIC_MODE = 2; // No automatic mode replication is occurring for this namespace. AUTOMATIC_SOURCE = 3; AUTOMATIC_TARGET = 4; } required XClusterRole role = 1 [default = NOT_AUTOMATIC_MODE]; } ~/code/yugabyte-db/src/yb/master/catalog_entity_info.proto:442: message SysXClusterConfigEntryPB { ... // Local NamespaceId -> xCluster info for that namespace. map<string, XClusterNamespaceInfoPB> xcluster_info_per_namespace = 3; } ``` Given the reasonable default (empty map, a.k.a. no name spaces are under automatic replication), which matches the expectation that no automatic mode replication will be running as the upgrade starts, and the inability to use automatic mode during the upgrade, this cannot cause problems. Additional information is added to the master heartbeat response: ``` ~/code/yugabyte-db/src/yb/master/master_heartbeat.proto:266: message TSHeartbeatResponsePB { ... // Local NamespaceId -> xCluster info for that namespace. map<string, XClusterNamespaceInfoPB> xcluster_info_per_namespace = 30; } ``` Given the protobuf compatibility rules, this causes no problems. Note that in the absence of the DDL replication extension, nothing looks at this information. We are also changing the SQL objects created by the extension; we are not providing a migration from the previous version of the extension because again it should not be possible to upgrade from a version with the old extension. What about upgrading 2024.2.2 to 2025.1.0 with Semi-Automatic xCluster running? That causes no issues because there is no extension and in the absence of an extension none of these changes make a difference. Fixes #25853 Jira: DB-15147 Test Plan: New end-to-end test that existing connections have the role they use updated: ``` time ybd release --cxx-test xcluster_ddl_replication-test --gtest_filter '*.ExtensionRoleUpdating' --test-args '' >& /tmp/generic.mdl.log ``` New test to verify that we capture DDL's on existing connections after replication is set up: ``` time ybd release --cxx-test xcluster_ddl_replication-test --gtest_filter '*.CreateTableInExistingConnection' --test-args '' >& /tmp/generic.mdl.log ``` New regression test suite to verify that the new extension routines work correctly: postgres/yb-extensions/yb_xcluster_ddl_replication/sql/routines.sql ``` time ybd release --cxx-test xcluster_ddl_replication-test --test-args '' |& tee /tmp/generic.mdl.log | grep -P '^\[ ' ``` Reviewers: xCluster, jhe Reviewed By: jhe Subscribers: yql, ybase, hsunder, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D43459

Commit:85bf8af
Author:Mark Lillibridge
Committer:Mark Lillibridge

[#25853] xCluster: move ground truth of replication role to system catalog Summary: Currently, for xCluster automatic mode, we install a DDL replication extension on both universes and change an extension-specific per-database variable to control what role the extension believes it is playing for that database. (E.g., if it's supposed to be the source then it should be capturing DDL's.) Unfortunately, this variable is only read by Postgres backends on startup so changing roles (e.g., during a switchover operation) does not affect existing Postgres connections. :-( This means among other things that if you start up replication while you have an existing Postgres connection, any DDL performed on that connection after the replication is set up will not be replicated. To solve this problem, this diff switches to using a new mechanism: the replication role for each database is now kept in the system catalog in the xCluster config system catalog entry. This information is distributed to the TServers via the master heartbeat response and stored in the TServer xcluster_context. The extension before each DDL calls the local TServer via pggate to get the current role information for the current database of that Postgres connection. Note that there is lag here -- the extension will not see the new information until after the heartbeat has occurred. We are planning a additional mechanism to "wait until every TServer has received a heartbeat response" to handle this lag. For now, the tests have sleeps and the xCluster replication cannot handle DDL's concurrently with set up and switch over. (The latter is a known to do task and needs other work.) In general, this diff is not intended to make other functional changes besides this mechanism switch; some small amount of per-files tidying has been included on the leave the code better than you found it principle. Other details: * a new extension function, get_replication_role, is provided to return the current role the extension believes it is playing for the current connections database. * this is provided both for testing and for future debugging * a new extension procedure, TEST_override_replication_role, is provided to override the current role for the current connection * again this allows testing but also may help with future investigations * note this is only for the current connection unlike the previous mechanisms * usefully, if the extension is created on a database not under xCluster replication, it does nothing * this can occur if you restore a backup taken while xCluster replication was running Note that there is a brief window after a TServer starts where it has not yet received a heartbeat response from master; if a Postgres connection is made and a DDL attempts to run, the DDL might fail due to role information unavailable. It is possible we should add some invisible retries here to make that less likely. Upgrade/rollback safety: We are assuming (via documentation warnings for unstable releases), that no legal upgrade across this diff starts with automatic mode replication running. (No previous stable release has automatic mode available; it is only available in 2.25.1, which can only legally be upgraded to 2.25.2.) Accordingly, we do not have to worry about upgrade safety in the presence of automatic mode already being used/started during the upgrade. This diff adds a new pggate RPC call: ``` ~/code/yugabyte-db/src/yb/tserver/pg_client.proto:52: rpc GetXClusterRole(PgGetXClusterRoleRequestPB) returns (PgGetXClusterRoleResponsePB); ``` Pggate RPC calls don't matter for upgrades because they only go to the local TServer and are thus upgraded in lockstep. This diff also adds a new field to the catalog namespace entries: ``` ~/code/yugabyte-db/src/yb/common/common_types.proto:219: // Namespaces (previously) under DB-scoped xCluster replication contain this information. message XClusterNamespaceInfoPB { UNSPECIFIED = 0; // This should never occur; not stored. UNAVAILABLE = 1; // Used to denote that we cannot determine the role; not stored. NOT_AUTOMATIC_MODE = 2; // No automatic mode replication is occurring for this namespace. AUTOMATIC_SOURCE = 3; AUTOMATIC_TARGET = 4; } required XClusterRole role = 1 [default = NOT_AUTOMATIC_MODE]; } ~/code/yugabyte-db/src/yb/master/catalog_entity_info.proto:442: message SysXClusterConfigEntryPB { ... // Local NamespaceId -> xCluster info for that namespace. map<string, XClusterNamespaceInfoPB> xcluster_info_per_namespace = 3; } ``` Given the reasonable default (empty map, a.k.a. no name spaces are under automatic replication), which matches the expectation that no automatic mode replication will be running as the upgrade starts, and the inability to use automatic mode during the upgrade, this cannot cause problems. Additional information is added to the master heartbeat response: ``` ~/code/yugabyte-db/src/yb/master/master_heartbeat.proto:266: message TSHeartbeatResponsePB { ... // Local NamespaceId -> xCluster info for that namespace. map<string, XClusterNamespaceInfoPB> xcluster_info_per_namespace = 30; } ``` Given the protobuf compatibility rules, this causes no problems. Note that in the absence of the DDL replication extension, nothing looks at this information. We are also changing the SQL objects created by the extension; we are not providing a migration from the previous version of the extension because again it should not be possible to upgrade from a version with the old extension. What about upgrading 2024.2.2 to 2025.1.0 with Semi-Automatic xCluster running? That causes no issues because there is no extension and in the absence of an extension none of these changes make a difference. Fixes #25853 Jira: DB-15147 Test Plan: New end-to-end test that existing connections have the role they use updated: ``` time ybd release --cxx-test xcluster_ddl_replication-test --gtest_filter '*.ExtensionRoleUpdating' --test-args '' >& /tmp/generic.mdl.log ``` New test to verify that we capture DDL's on existing connections after replication is set up: ``` time ybd release --cxx-test xcluster_ddl_replication-test --gtest_filter '*.CreateTableInExistingConnection' --test-args '' >& /tmp/generic.mdl.log ``` New regression test suite to verify that the new extension routines work correctly: postgres/yb-extensions/yb_xcluster_ddl_replication/sql/routines.sql ``` time ybd release --cxx-test xcluster_ddl_replication-test --test-args '' |& tee /tmp/generic.mdl.log | grep -P '^\[ ' ``` Reviewers: xCluster, jhe Reviewed By: jhe Subscribers: svc_phabricator, hsunder, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D43178

Commit:cad0d0a
Author:Bvsk Patnaik
Committer:Bvsk Patnaik

[BACKPORT 2025.1][#24431] YSQL: Supplement read restart error with more info Summary: Original commit: 49f1b8cfc4aa718d84252b28b8fbb1714d0d3de5 / D42883 ### Motivation Read restart errors are a common occurence in YugabyteDB. Mitigation strategy varies case-by-case. Well known approaches: 1. Use READ COMMITTED isolation level. A popular mitigation strategy. 2. Increase ysql_output_buffer_size. This is applicable for large SELECTs that have a low read restart rate. 3. Use the deferrable mode. This is for background tasks that scan a lot of rows. 4. Leverage time sync service. Best option but still early. However, these mitigitation strategies do not always work. Such cases require addressing the underlying problem. Currently, the logs lack sufficient detail, leading to speculation rather than concrete diagnosis. ### Examples #### Running CREATE/DROP TEMP TABLE concurrently results in a read restart error ``` 2025-04-04 01:57:32.353 UTC [587443] ERROR: Restart read required at: read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (yb_max_query_layer_retries set to 0 are exhausted) ``` Users do not expect this error because TEMP TABLEs are local to each pg backend. Moreover, increasing ysql_output_buffer_size does not mitigate the error here because read restart errors on DDLs cannot be retried yet. That said, users find it helpful to understand the root cause of the error. After this revision, the log becomes ``` 2025-04-04 01:57:32.353 UTC [587443] DETAIL: restart_read_time: { read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 01:57:32.096749 } local_limit: { days: 20182 time: 01:57:32.098414 } global_limit: { days: 20182 time: 01:57:32.596749 } in_txn_limit: <max> serial_no: 0 }, table: template1.pg_yb_invalidation_messages [ysql_schema=pg_catalog] [00000001000030008001000000001f90], tablet: 00000000000000000000000000000000, leader_uuid: f1ac558309244cbab385761b531980f0, key: SubDocKey(DocKey(CoTableId=901f0000-0000-0180-0030-000001000000, [], [13515, 10]), [SystemColumnId(0)]), start_time: 2025-04-04 01:57:32.094965+00 ``` **Cause:** The read restart error is returned when accessing the `pg_yb_invalidation_messages` table in the above example. **Mitigation:** Avoid creating TEMP TABLEs from mutliple sessions concurrently. Note: The issue is fixed by table-level locking. #### CREATE INDEX NONCONCURRENTLY CREATE INDEX NONCONCURRENTLY accesses both the user table as well as syscatalog. How can the user identify whether the read restart error occurred due to a concurrent DDL or DML? ``` [ts-1] 2025-04-04 04:25:47.694 UTC [665450] DETAIL: restart_read_time: { read: { days: 20182 time: 04:25:47.512991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 04:25:47.012991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: { days: 20182 time: 04:25:47.512991 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.test_table [ysql_schema=public] [000034cb000030008000000000004000], tablet: bf9d2bf8908b43d4b8b5ff75b169bf11, leader_uuid: 6d60cea932ca41498fee5ad07c375d18, key: SubDocKey(DocKey(0x0a73, [5], []), [ColumnId(1)]) ``` **Cause:** In the above log, the error occurs on the user table. **Mitigation:** Stop concurrent writes when creating the INDEX. The read restart can also be on a catalog table. Example: ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### Read restarts on absent keys Users sometimes report that they notice read restart errors even though they are no conflicting keys. To better observe such scenarios, print a key on which read restart error occurred. There can be multiple such keys, so we print the first one we encounter. Here's an example fixed by #25214. Setup ``` CREATE TABLE keys(k1 INT, k2 INT, PRIMARY KEY(k1 ASC, k2 ASC)); INSERT INTO keys(k1, k2) VALUES (1, 1); ``` The following query ``` INSERT INTO keys(k1, k2) VALUES (0, 0), (9, 9) ON CONFLICT DO NOTHING ``` results in a read restart error even though none of 0 or 9 conflict with 1. ``` 2025-04-04 17:55:03.013 UTC [955800] DETAIL: restart_read_time: { read: { days: 20182 time: 17:55:03.012147 } local_limit: { days: 20182 time: 17:55:03.012147 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 17:55:02.983152 } local_limit: { days: 20182 time: 17:55:03.483152 } global_limit: { days: 20182 time: 17:55:03.483152 } in_txn_limit: { days: 20182 time: 17:55:03.011777 } serial_no: 0 }, table: yugabyte.keys [ysql_schema=public] [000034cb000030008000000000004000], tablet: 2a19ede71f7e4389b1ca682e87318b32, leader_uuid: 7010bbec8ec94110b5e2316bca5319f4, key: SubDocKey(DocKey([], [1, 1]), [SystemColumnId(0)]), start_time: 2025-04-04 17:55:03.010743+00 ``` This helps us understand where the read restart error occurred. It occurs on key = (1, 1), not relevant to the INSERT ON CONFLICT statement. The key field helps diagnose such scenarios. **Mitigation:** See 1bcb09cf49ab for the fix. #### Local Limit Fixes Commits d67ba1234996 and 02ced43d3b4e fix some bugs with the local llimit mechanism. Identifying the value of local limit when the issue occurs is fairly straightforward. The error message has local limit. ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### pg_partman faces read restart errors pg_partman faces a read restart error on max query ``` ERROR: Restart read required at: { read: { physical: 1730622229379200 } local_limit: { physical: 1730622229379200 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } CONTEXT: SQL statement "SELECT max(id)::text FROM public.t1_p100000" ``` **Cause:** This is because there is no range index on the tablet 1_p100000 to retrieve the max value efficiently using the index. A log such as ``` 2025-04-04 18:55:31.936 UTC [963409] DETAIL: restart_read_time: { read: { days: 20182 time: 18:55:31.934321 } local_limit: { days: 20182 time: 18:55:31.934321 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { days: 20182 time: 18:55:31.738805 }, table: yugabyte.tokens_idx [ysql_schema=public] [000034cb000030008000000000004003], tablet: {tokens_idx_tablet_id1}, leader_uuid: {ts2_peer_id}, key: SubDocKey(DocKey([], [300, "G%\xf8S\xde\xb0h\xe6)\xa3G0\x96\xae\x9e9\xf7r\xbe\xc3\x00\x00!!"]), [SystemColumnId(0)]), start_time: 2025-04-04 18:55:31.933012+00 ``` informs us the table on which the read restart error occurred. If the table is an index such as tokens_idx, the query is using an index scan. **Upgrade/Rollback safety:** Adds a new key restart_read_key to tserver.proto. The key is marked optional. When absent an empty value is printed in its stead. Jira: DB-13338 Test Plan: Jenkins Backport-through: 2024.2 Reviewers: smishra, pjain Reviewed By: pjain Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43417

Commit:b043c83
Author:Bvsk Patnaik
Committer:Bvsk Patnaik

[BACKPORT 2024.2][#24431] YSQL: Supplement read restart error with more info Summary: Original commit: 49f1b8cfc4aa718d84252b28b8fbb1714d0d3de5 / D42883 ### Motivation Read restart errors are a common occurence in YugabyteDB. Mitigation strategy varies case-by-case. Well known approaches: 1. Use READ COMMITTED isolation level. A popular mitigation strategy. 2. Increase ysql_output_buffer_size. This is applicable for large SELECTs that have a low read restart rate. 3. Use the deferrable mode. This is for background tasks that scan a lot of rows. 4. Leverage time sync service. Best option but still early. However, these mitigitation strategies do not always work. Such cases require addressing the underlying problem. Currently, the logs lack sufficient detail, leading to speculation rather than concrete diagnosis. ### Examples #### Running CREATE/DROP TEMP TABLE concurrently results in a read restart error ``` 2025-04-04 01:57:32.353 UTC [587443] ERROR: Restart read required at: read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (yb_max_query_layer_retries set to 0 are exhausted) ``` Users do not expect this error because TEMP TABLEs are local to each pg backend. Moreover, increasing ysql_output_buffer_size does not mitigate the error here because read restart errors on DDLs cannot be retried yet. That said, users find it helpful to understand the root cause of the error. After this revision, the log becomes ``` 2025-04-04 01:57:32.353 UTC [587443] DETAIL: restart_read_time: { read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 01:57:32.096749 } local_limit: { days: 20182 time: 01:57:32.098414 } global_limit: { days: 20182 time: 01:57:32.596749 } in_txn_limit: <max> serial_no: 0 }, table: template1.pg_yb_invalidation_messages [ysql_schema=pg_catalog] [00000001000030008001000000001f90], tablet: 00000000000000000000000000000000, leader_uuid: f1ac558309244cbab385761b531980f0, key: SubDocKey(DocKey(CoTableId=901f0000-0000-0180-0030-000001000000, [], [13515, 10]), [SystemColumnId(0)]), start_time: 2025-04-04 01:57:32.094965+00 ``` **Cause:** The read restart error is returned when accessing the `pg_yb_invalidation_messages` table in the above example. **Mitigation:** Avoid creating TEMP TABLEs from mutliple sessions concurrently. Note: The issue is fixed by table-level locking. #### CREATE INDEX NONCONCURRENTLY CREATE INDEX NONCONCURRENTLY accesses both the user table as well as syscatalog. How can the user identify whether the read restart error occurred due to a concurrent DDL or DML? ``` [ts-1] 2025-04-04 04:25:47.694 UTC [665450] DETAIL: restart_read_time: { read: { days: 20182 time: 04:25:47.512991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 04:25:47.012991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: { days: 20182 time: 04:25:47.512991 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.test_table [ysql_schema=public] [000034cb000030008000000000004000], tablet: bf9d2bf8908b43d4b8b5ff75b169bf11, leader_uuid: 6d60cea932ca41498fee5ad07c375d18, key: SubDocKey(DocKey(0x0a73, [5], []), [ColumnId(1)]) ``` **Cause:** In the above log, the error occurs on the user table. **Mitigation:** Stop concurrent writes when creating the INDEX. The read restart can also be on a catalog table. Example: ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### Read restarts on absent keys Users sometimes report that they notice read restart errors even though they are no conflicting keys. To better observe such scenarios, print a key on which read restart error occurred. There can be multiple such keys, so we print the first one we encounter. Here's an example fixed by #25214. Setup ``` CREATE TABLE keys(k1 INT, k2 INT, PRIMARY KEY(k1 ASC, k2 ASC)); INSERT INTO keys(k1, k2) VALUES (1, 1); ``` The following query ``` INSERT INTO keys(k1, k2) VALUES (0, 0), (9, 9) ON CONFLICT DO NOTHING ``` results in a read restart error even though none of 0 or 9 conflict with 1. ``` 2025-04-04 17:55:03.013 UTC [955800] DETAIL: restart_read_time: { read: { days: 20182 time: 17:55:03.012147 } local_limit: { days: 20182 time: 17:55:03.012147 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 17:55:02.983152 } local_limit: { days: 20182 time: 17:55:03.483152 } global_limit: { days: 20182 time: 17:55:03.483152 } in_txn_limit: { days: 20182 time: 17:55:03.011777 } serial_no: 0 }, table: yugabyte.keys [ysql_schema=public] [000034cb000030008000000000004000], tablet: 2a19ede71f7e4389b1ca682e87318b32, leader_uuid: 7010bbec8ec94110b5e2316bca5319f4, key: SubDocKey(DocKey([], [1, 1]), [SystemColumnId(0)]), start_time: 2025-04-04 17:55:03.010743+00 ``` This helps us understand where the read restart error occurred. It occurs on key = (1, 1), not relevant to the INSERT ON CONFLICT statement. The key field helps diagnose such scenarios. **Mitigation:** See 1bcb09cf49ab for the fix. #### Local Limit Fixes Commits d67ba1234996 and 02ced43d3b4e fix some bugs with the local llimit mechanism. Identifying the value of local limit when the issue occurs is fairly straightforward. The error message has local limit. ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### pg_partman faces read restart errors pg_partman faces a read restart error on max query ``` ERROR: Restart read required at: { read: { physical: 1730622229379200 } local_limit: { physical: 1730622229379200 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } CONTEXT: SQL statement "SELECT max(id)::text FROM public.t1_p100000" ``` **Cause:** This is because there is no range index on the tablet 1_p100000 to retrieve the max value efficiently using the index. A log such as ``` 2025-04-04 18:55:31.936 UTC [963409] DETAIL: restart_read_time: { read: { days: 20182 time: 18:55:31.934321 } local_limit: { days: 20182 time: 18:55:31.934321 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { days: 20182 time: 18:55:31.738805 }, table: yugabyte.tokens_idx [ysql_schema=public] [000034cb000030008000000000004003], tablet: {tokens_idx_tablet_id1}, leader_uuid: {ts2_peer_id}, key: SubDocKey(DocKey([], [300, "G%\xf8S\xde\xb0h\xe6)\xa3G0\x96\xae\x9e9\xf7r\xbe\xc3\x00\x00!!"]), [SystemColumnId(0)]), start_time: 2025-04-04 18:55:31.933012+00 ``` informs us the table on which the read restart error occurred. If the table is an index such as tokens_idx, the query is using an index scan. **Upgrade/Rollback safety:** Adds a new key restart_read_key to tserver.proto. The key is marked optional. When absent an empty value is printed in its stead. Jira: DB-13338 Test Plan: Jenkins Backport-through: 2024.2 Reviewers: smishra, pjain Reviewed By: pjain Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D43419

Commit:01f5cca
Author:Sergei Politov
Committer:Sergei Politov

[#26872] DocDB: Add clone support for vector indexes Summary: Postgres allows cloning databases, where the cloned database contains all the data from the original database at the time of cloning. However, since vector indexes are colocated with their main tables, cloning vector indexes don't work out of the box. This diff implements specific logic to handle vector index cloning. Also added index_id to vector index options, it is unique for the tablet. And remains the same during clone. Also fixed the following issues: 1) Subprocess sequentially read stdout of the child process, then read stderr of the child process and then wait for process to finish. So if child process writes a lot of data to stderr it could hang because of buffer overflow. Fixed by using Poll to query whether we are ready to read stdout or stderr. 2) Index index for colocated tables was not sent from master to tserver during clone operation. Added logic to send index index, also added logic to update table ids referenced by index info. **Upgrade/Rollback safety** Since vector indexes is not released yet, it is safe to change related protobufs. Jira: DB-16287 Test Plan: PgCloneTest.CloneVectorIndex Reviewers: zdrudi, asrivastava, mhaddad Reviewed By: zdrudi, asrivastava Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43277

Commit:49f1b8c
Author:Bvsk Patnaik
Committer:Bvsk Patnaik

[#24431] YSQL: Supplement read restart error with more info Summary: ### Motivation Read restart errors are a common occurence in YugabyteDB. Mitigation strategy varies case-by-case. Well known approaches: 1. Use READ COMMITTED isolation level. A popular mitigation strategy. 2. Increase ysql_output_buffer_size. This is applicable for large SELECTs that have a low read restart rate. 3. Use the deferrable mode. This is for background tasks that scan a lot of rows. 4. Leverage time sync service. Best option but still early. However, these mitigitation strategies do not always work. Such cases require addressing the underlying problem. Currently, the logs lack sufficient detail, leading to speculation rather than concrete diagnosis. ### Examples #### Running CREATE/DROP TEMP TABLE concurrently results in a read restart error ``` 2025-04-04 01:57:32.353 UTC [587443] ERROR: Restart read required at: read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (yb_max_query_layer_retries set to 0 are exhausted) ``` Users do not expect this error because TEMP TABLEs are local to each pg backend. Moreover, increasing ysql_output_buffer_size does not mitigate the error here because read restart errors on DDLs cannot be retried yet. That said, users find it helpful to understand the root cause of the error. After this revision, the log becomes ``` 2025-04-04 01:57:32.353 UTC [587443] DETAIL: restart_read_time: { read: { days: 20182 time: 01:57:32.350502 } local_limit: { days: 20182 time: 01:57:32.350502 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 01:57:32.096749 } local_limit: { days: 20182 time: 01:57:32.098414 } global_limit: { days: 20182 time: 01:57:32.596749 } in_txn_limit: <max> serial_no: 0 }, table: template1.pg_yb_invalidation_messages [ysql_schema=pg_catalog] [00000001000030008001000000001f90], tablet: 00000000000000000000000000000000, leader_uuid: f1ac558309244cbab385761b531980f0, key: SubDocKey(DocKey(CoTableId=901f0000-0000-0180-0030-000001000000, [], [13515, 10]), [SystemColumnId(0)]), start_time: 2025-04-04 01:57:32.094965+00 ``` **Cause:** The read restart error is returned when accessing the `pg_yb_invalidation_messages` table in the above example. **Mitigation:** Avoid creating TEMP TABLEs from mutliple sessions concurrently. Note: The issue is fixed by table-level locking. #### CREATE INDEX NONCONCURRENTLY CREATE INDEX NONCONCURRENTLY accesses both the user table as well as syscatalog. How can the user identify whether the read restart error occurred due to a concurrent DDL or DML? ``` [ts-1] 2025-04-04 04:25:47.694 UTC [665450] DETAIL: restart_read_time: { read: { days: 20182 time: 04:25:47.512991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 04:25:47.012991 } local_limit: { days: 20182 time: 04:25:47.512991 } global_limit: { days: 20182 time: 04:25:47.512991 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.test_table [ysql_schema=public] [000034cb000030008000000000004000], tablet: bf9d2bf8908b43d4b8b5ff75b169bf11, leader_uuid: 6d60cea932ca41498fee5ad07c375d18, key: SubDocKey(DocKey(0x0a73, [5], []), [ColumnId(1)]) ``` **Cause:** In the above log, the error occurs on the user table. **Mitigation:** Stop concurrent writes when creating the INDEX. The read restart can also be on a catalog table. Example: ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### Read restarts on absent keys Users sometimes report that they notice read restart errors even though they are no conflicting keys. To better observe such scenarios, print a key on which read restart error occurred. There can be multiple such keys, so we print the first one we encounter. Here's an example fixed by #25214. Setup ``` CREATE TABLE keys(k1 INT, k2 INT, PRIMARY KEY(k1 ASC, k2 ASC)); INSERT INTO keys(k1, k2) VALUES (1, 1); ``` The following query ``` INSERT INTO keys(k1, k2) VALUES (0, 0), (9, 9) ON CONFLICT DO NOTHING ``` results in a read restart error even though none of 0 or 9 conflict with 1. ``` 2025-04-04 17:55:03.013 UTC [955800] DETAIL: restart_read_time: { read: { days: 20182 time: 17:55:03.012147 } local_limit: { days: 20182 time: 17:55:03.012147 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 17:55:02.983152 } local_limit: { days: 20182 time: 17:55:03.483152 } global_limit: { days: 20182 time: 17:55:03.483152 } in_txn_limit: { days: 20182 time: 17:55:03.011777 } serial_no: 0 }, table: yugabyte.keys [ysql_schema=public] [000034cb000030008000000000004000], tablet: 2a19ede71f7e4389b1ca682e87318b32, leader_uuid: 7010bbec8ec94110b5e2316bca5319f4, key: SubDocKey(DocKey([], [1, 1]), [SystemColumnId(0)]), start_time: 2025-04-04 17:55:03.010743+00 ``` This helps us understand where the read restart error occurred. It occurs on key = (1, 1), not relevant to the INSERT ON CONFLICT statement. The key field helps diagnose such scenarios. **Mitigation:** See 1bcb09cf49ab for the fix. #### Local Limit Fixes Commits d67ba1234996 and 02ced43d3b4e fix some bugs with the local llimit mechanism. Identifying the value of local limit when the issue occurs is fairly straightforward. The error message has local limit. ``` [ts-1] 2025-04-04 02:17:48.392 UTC [592811] DETAIL: restart_read_time: { read: { days: 20182 time: 02:17:48.326938 } local_limit: { days: 20182 time: 02:17:48.326938 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { read: { days: 20182 time: 02:17:48.023542 } local_limit: { days: 20182 time: 02:17:48.024082 } global_limit: { days: 20182 time: 02:17:48.523542 } in_txn_limit: <max> serial_no: 0 }, table: yugabyte.pg_attribute [ysql_schema=pg_catalog] [000034cb0000300080010000000004e1], tablet: 00000000000000000000000000000000, leader_uuid: e941d4ee43bc40a8b4303eab317f52fd, key: SubDocKey(DocKey(CoTableId=e1040000-0000-0180-0030-0000cb340000, [], [16384, 5]), []), start_time: 2025-04-04 02:17:48.022154+00 [ts-1] 2025-04-04 02:17:48.392 UTC [592811] STATEMENT: CREATE INDEX NONCONCURRENTLY non_concurrent_idx_0 ON test_table(key) ``` #### pg_partman faces read restart errors pg_partman faces a read restart error on max query ``` ERROR: Restart read required at: { read: { physical: 1730622229379200 } local_limit: { physical: 1730622229379200 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } CONTEXT: SQL statement "SELECT max(id)::text FROM public.t1_p100000" ``` **Cause:** This is because there is no range index on the tablet 1_p100000 to retrieve the max value efficiently using the index. A log such as ``` 2025-04-04 18:55:31.936 UTC [963409] DETAIL: restart_read_time: { read: { days: 20182 time: 18:55:31.934321 } local_limit: { days: 20182 time: 18:55:31.934321 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }, original_read_time: { days: 20182 time: 18:55:31.738805 }, table: yugabyte.tokens_idx [ysql_schema=public] [000034cb000030008000000000004003], tablet: {tokens_idx_tablet_id1}, leader_uuid: {ts2_peer_id}, key: SubDocKey(DocKey([], [300, "G%\xf8S\xde\xb0h\xe6)\xa3G0\x96\xae\x9e9\xf7r\xbe\xc3\x00\x00!!"]), [SystemColumnId(0)]), start_time: 2025-04-04 18:55:31.933012+00 ``` informs us the table on which the read restart error occurred. If the table is an index such as tokens_idx, the query is using an index scan. **Upgrade/Rollback safety:** Adds a new key restart_read_key to tserver.proto. The key is marked optional. When absent an empty value is printed in its stead. Jira: DB-13338 Test Plan: Jenkins Backport-through: 2024.2 Reviewers: smishra, pjain Reviewed By: pjain Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42883

Commit:94ac3c0
Author:Sergei Politov
Committer:Sergei Politov

[#26884] DocDB: Add paging support to vector indexes Summary: When executing a vector index query, the operation may include complex filtering conditions that cannot be pushed down to the index level. A common example of this would be queries containing inner joins or other non-trivial relational operations. This presents a challenge: when such filters are applied after the vector search, we cannot determine in advance how many candidate vectors from the index will ultimately satisfy the complete query conditions. Consequently, we may need to retrieve significantly more vectors than the user-specified limit to ensure we return the correct number of valid results after filtering. To address this issue robustly, the system should implement paginated querying capability for vector index operations. This approach would allow us to: 1. Fetch vectors in manageable batches 2. Apply the complex filters incrementally 3. Continue retrieving additional pages until either: - We satisfy the user's requested limit - We exhaust the relevant portion of the index This pagination mechanism would provide both correctness (ensuring we respect all query conditions) and efficiency (avoiding loading excessive unnecessary vectors into memory at once). **Upgrade / Downgrade Safety** Safe since PgVector feature not released yet. Jira: DB-16298 Test Plan: PgVectorIndexTest.Paging/* Reviewers: arybochkin Reviewed By: arybochkin Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D43353

Commit:e9aeaec
Author:Yamen Haddad
Committer:Yamen Haddad

[#26527] Docdb: Extend import_snapshot to import tables based on relfilenode Summary: As part of supporting backups during DDLs, we want to extend the `import_snapshot` so that it can use the relfilenode on both the `SnapshotInfoPB` from the backup side and the target side to build the mappings between tables from the backup side and tables in the restore side. This is based on the new assumption (introduced at D42396) that the relfilenode for the table will be the same on the backup and restore sides. This is accomplished as follows: - `SnapshotInfoPB` objects created starting from this fix will have a new format version = 3. This new format indicates that the backup taken has the directive `binary_upgrade_set_next_heap_relfilenode` in its `ysql_dump` and that it will be respected on the restore side. - At the `import_snapshot` phase, if the `SnapshotInfoPB` object has the format version = 3, then we use the relfilenode to build the mappings between tables from the backup side and the restored tables on the target side. Otherwise, we revert back to using the table name to build the mappings. - Added a runtime gflag `import_snapshot_using_table_name` which reverts to the previous import snapshot workflow even if format version = 3. This works as a safety button to fall back to the old behavior of using table names to build the mappings in case any issue is encountered with the new flow at restore time. Users can set this flag to true and retry restore. **Upgrade/Downgrade safety** A new auto flag `enable_export_snapshot_using_relfilenode` is added, which switches the format version of the exported snapshot to 3 (indicating the usage of relfilenode) once all the nodes are on the new code version. This avoids a case where we might take an inconsistent backup while we are in the middle of an upgrade. For example, if the master is in a new version and thus sets the format_version to 3, however, ysql_dump runs on tserver that runs old version that doesn't have the relfilenode preserved. This will lead to a backup that might fail at restore. With the auto flag, we only set the format_version to 3 once all the nodes have the current diff. This also guarantees that all ysql_dump scripts created will have the directives that preserve relfilenode (as the ysql_dump diff has already landed). Another case to consider is if the user created the snapshot and got the ysql_dump script when the cluster was on the old version. After that, the user issues `export_snapshot` after the cluster has upgraded to the new version, then the snapshotInfo will have format_version=3. In this case, restore using relfilenode might fail. The solution would be to use the gflag `import_snapshot_using_table_name` at restore time. Otherwise, we need to persist that the snapshot was created using the new code in the `SysSnapshotEntryPB` but I think this case might be rare that no need to persist this info at snapshot creation time. Jira: DB-15893 Test Plan: ./yb_build.sh --cxx_test yb-backup-during-ddl-test --gtest-filter YBBackupTest.TestRestorePreserveRelfilenode ./yb_build.sh --cxx-test yb-backup-during-ddl-test --gtest_filter Colocation/YBBackupTestWithColocationParam.TestRestorePreserveRelfilenode/* Additionally, this diff is changing the way we are restoring backups. Therefore, all backup-related tests such as (B/R, DB clone, xcluster) are testing the new workflow. To give an idea, at the beginning of developing this diff, 100+ tests were failing due to various reasons that have been addressed. Therefore, we got enough test coverage for this high-impact change. Additionally, in xcluster failover tests such as `--cxx-test integration-tests_xcluster_dr-itest --gtest_filter XClusterDRTest.Failover`, import_snapshot was failing as these tests were not using a pristine database to restore to (we simulate a restore by dropping and creating the table in the existing database). Therefore `import_snapshot` was failing as we are not respecting the constraint that relfilenode is preserved on backup/restore sides. Solved the issue by reverting back to the old behavior of using table name to build the mapping in `import_snapshot`. This was a good occasion to test the gflag `import_snapshot_using_table_name`. Reviewers: asrivastava, zdrudi Reviewed By: asrivastava Subscribers: jhe, hsunder, ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D43159

Commit:6fe0575
Author:Naorem Khogendro Singh
Committer:Naorem Khogendro Singh

[PLAT-17216] Implement server control method in node agent Summary: This enables running server control as RPC to node agent. There are some features + refactoring changes. The feature is gated by a global runtime feature flag. 1. Implement java client + node agent server method for server control. 2. The file shell_task.go is split into the actual command creator and the task runner. The reason is that we will modules running command and the dependency is cleaner (no circular). 3. Node agent dead detector reduced to about 20mins. 4. proto file is split into 2 - 1. the actual core service 2. YB server related ones. Test Plan: 1. UT added for basic command generator. 2. Created a universe with this change, scaled out, stopped/started a tserver (after enabling this feature). ServerControl RPC is very fast. Stop tserver (no configure call) just ran very quickly. Reviewers: svarshney, yshchetinin, anijhawan, nbhatia Reviewed By: yshchetinin Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D43202

Commit:faad488
Author:Yury Shchetinin
Committer:Yury Shchetinin

[PLAT-16585] Earlyoom integration with node agent Summary: Added rpc to node agent to turn on/off already installed earlyoom service Test Plan: this diff is a part of a big diff where the whole cycle of installing + turning on/off via rpc was tested Reviewers: nsingh, anijhawan Reviewed By: nsingh, anijhawan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D42209

Commit:4b44b76
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#26702] DocDB: Add BSON type to yb Summary: Adding BSON as a dedicated data type. This is required for the Microsoft DocumentDB extension integration. The extension was imported as part of b01e29bafa6fb8bfb899f0b3ac6e98363340c8e7/D43110. The type currently maps to binary data type since we don't have any index support for this type yet. Mapping it to bson library and handling conversion to comparable binary format for the RocksDB key will be done in a later diff. **Upgrade/Rollback safety** The new type is not used in any existing code. The feature will be guarded by a AutoFlag when the DocumentDB extension is made available. Fixes#26702 Jira: DB-16078 Test Plan: build --clang17 --cxx-test pggate_test_type Regress tests will be added after the extension is merged with the new OID from this change. Reviewers: slingam, sergei, tnayak Reviewed By: sergei, tnayak Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D43048

Commit:e374708
Author:Zachary Drudi
Committer:Zachary Drudi

[#26739] docdb: Add test method to wait until tservers processed the ysql refresh lease response Summary: Adds a method to ExternalMiniCluster that waits until tservers have processed the refresh lease response from the master leader. **Upgrade/Rollback safety:** The proto changes are only for a new RPC. This new RPC is exclusively called by test infrastructure code. Jira: DB-16115 Test Plan: Existing tests. Reviewers: bkolagani Reviewed By: bkolagani Subscribers: rthallam, slingam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D43107

Commit:807d224
Author:Huapeng Yuan
Committer:Huapeng Yuan

[#26204] YSQL: Fix read time of fast-path txn with buffered writes Summary: Fix read time of fast-path txn with buffered writes. Problem: For fast-path transactions with buffer write (like COPY), the read timestamp is picked on PgClientSession because the manipulator is set to ENSURE_READ_TIME_IS_SET during FlushOperations in PG. But the conflict resolution on RegularDB is skipped for fast-path transactions, so it may not detect a concurrent write with higher timestamp. Thus, when a fast-path transaction runs concurrently with other INSERTs (by a distributed txn or fast-path txn), we might miss the duplicate INSERT, leading to incorrect behavior. This issue is especially problematic on single shard tables (e.g., colocated tables), where a COPY with fast-path transaction can unintentionally overwrite existing data. One example (similar to the issue fixed in [[ https://github.com/yugabyte/yugabyte-db/commit/fc210685aec45933a9e8dc3e8e2a2e3b02a7fda7 | fc21068 ]]): --------------------------------------------------- Assume two concurrent sessions: one is running insert by COPY with fast-path transaction and the other is trying to insert the same row via a distributed transaction. (1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession (2) Insert 2 (fast-path): pick a read time (rt2=6) on PgClientSession (3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db). (4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1, write row in intents db. (5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and enter wait queue (6) Insert 1: Apply distributed txn writes data to regular db with commit timestamp 10. (7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting intents again, none will be found. (8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2, write row in regular db with commit timestamp 12. In the step 8, the fast-path transaction with buffer write should fail with a "duplicate key value violates" error, but it succeeds. Fix: For fast-path transactions with buffer writes, we will not set EnsureReadTimeIsSet to true in pg_session.cc. As a result, the pg_client_session will not assign a read time for the write request (unless the write operation requires fanout to multiple tservers). Instead, a read time will be selected from docdb. This ensures that for single-shard tables (e.g., colocated tables), fast-path transactions will always pick the read timestamp from docdb. With this fix, even if regular DB is skipped during conflict resolution, the fast-path transactions can detect concurrent INSERTs during the read, preventing inconsistency issues. Upgrade/Rollback safety: The field 'non_transactional_buffered_write' added to pg_client.proto doesn't affect upgrade/rollback. The PgPerformOptionsPB is used only in the communication between PG and local proxy, the field is never be persisted so it is safe during upgrade/rollback. Jira: DB-15550 Test Plan: ./yb_build.sh release --cxx-test pg_read_time-test --gtest_filter PgReadTimeTest.CheckReadTimePickingLocation Reviewers: pjain, rthallam, patnaik.balivada Reviewed By: pjain, patnaik.balivada Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42271

Commit:4ca08c4
Author:Dmitry Uspenskiy
Committer:Dmitry Uspenskiy

[#26303] YSQL: Rid off redundant yb_cdc_snapshot_read_time field in SnapshotData structure Summary: The `yb_cdc_snapshot_read_time` field was added into the `SnapshotData` structure in context of the https://phorge.dev.yugabyte.com/D39355 diff. The purpose of this filed is to store snapshot's read time came from cdc. But the the `SnapshotData` structure already has the `yb_read_time_point_handle` field to store read time associated with snapshot. Having the `yb_cdc_snapshot_read_time` field is redundant in this case. This diff removes the `yb_cdc_snapshot_read_time` field and use `yb_read_time_point_handle` instead. In case of using CDC snapshot it's read time is associated with current read point in `PgTxnManager`. And this read time will be explicitly sent with next `Perform` request. The `SetTxnSnapshot` RPC to local t-server will not be called anymore. Also this RPC is renamed to `ImportTxnSnapshot` just like it was prior to the https://phorge.dev.yugabyte.com/D39355 diff. Note: The diff contains several code cleanup/simplification changes: - Unit tests in the `TestPgReplicationSlotSnapshotAction` file are simplified by introducing several helper functions. - The `PgTxnManager::CheckTxnSnapshotOptions` method is removed. Required checks from it are moved into the `PgTxnManager::CheckSnapshotTimeConflict` method. - The `YbcReadPointHandle` typedef is introduced to distinguish read point handle from read time value. **Upgrade/Rollback safety:** Changes affect pggate and local t-server communication only. Jira: DB-15648 Test Plan: Jenkins Reviewers: utkarsh.munjal, stiwary, pjain, patnaik.balivada Reviewed By: pjain Subscribers: jason, ybase, yql, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41802

Commit:60b44b7
Author:Sergei Politov
Committer:Sergei Politov

[BACKPORT 2024.2] [#19447] DocDB: Fix revoking multiple leader leases Summary: We could get into situation when leader changed twice in a short period of time. Consider there are 3 leaders in chronological order - L1, L2, L3. L1 could have leader lease, before L2 became leader. And there are scenarios when L2 did not revoke L1's lease before L3 became the leader. Since we keep only a single old leader lease, L3 could decide that old leader does not have lease when L2 lease was revoked. Fixed by keeping multiple old leader leases and revoking them individually. **Upgrade/Rollback safety:** Optional fields were replaced with repeated fields. The old software versions will see only the latest element on those fields. So we put most advanced lease to the last position. Jira: DB-8238 Original commit: 42b8ce8fe1b166948177fa075f04c040f6317b30 / D42229 Test Plan: QLTabletTest.LeaderLeaseRevocation Reviewers: hsunder, patnaik.balivada Reviewed By: hsunder, patnaik.balivada Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42986

Commit:b7a0458
Author:Tanuj Nayak
Committer:Tanuj Nayak

[#26629] YSQL: Push down ef_search parameter to ybhnsw Summary: The GUC `hnsw.ef_search` in pgvector controls the expansion factor for HNSW searches, where higher values improve accuracy at the cost of increased latency. This change introduces a corresponding GUC, `ybhnsw.ef_search`, and propagates it to DocDB. Additional work is required in DocDB to utilize the propagated value. **Upgrade/Rollback safety:** This change adds new fields to `PgVectorReadOptionsPB` in common.proto that are unused in DocDB code as of this change. This does not suggest any upgrade/rollback issues. Test Plan: Jenkins Manual testing by observing the docdb requests of ybhnsw queries to make sure the correct ef_search parameter is passed down Reviewers: kramanathan Reviewed By: kramanathan Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42934

Commit:69b4384
Author:Vaibhav Kushwaha
Committer:Vaibhav Kushwaha

[#26182] CDC: Add syntax support for replication slot ordering mode Summary: We need to provide the ability to specify the ordering mode being used by the replication slot i.e. `ROW` and `TRANSACTION`. This would essentially indicate how a replication slot is ordering change events. To support the above requirement, the following new syntax will be introduced and supported, with the addition of the options to specify `yb_ordering_mode` as `ROW` or `TRANSACTION`: ``` CREATE_REPLICATION_SLOT slot_name LOGICAL [yboutput|pgoutput|test_decoding|wal2json] [TEMPORARY] [HYBRID_TIME|SEQUENCE] [USE_SNAPSHOT|EXPORT_SNAPSHOT] [ROW|TRANSACTION] ``` * The optional value `[ROW|TRANSACTION]` denotes the ordering mode. The default value will be `TRANSACTION` and it will be valid in the context of a slot. **Upgrade / rollback safety:** This diff adds a new field for `ordering_mode` to the following proto files: 1. `pg_client.proto` 2. `common.proto` 3. `master_replication.proto` A new flag is introduced i.e. `ysql_yb_allow_replication_slot_ordering_modes` and the value of this flag will be checked whenever the fields for `ordering_mode` will be modified. Additionally, for the method `pg_create_logical_replication_slot`, we will be hardcoding the value for `yb_ordering_mode` to `TRANSACTION`. Jira: DB-15518 Test Plan: The diff introduced some tests and they can be run using: ``` ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestPgCreateReplicationSlotDefaultOrderingMode ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestCreateReplicationSlotWithOrderingModeRow ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestCreateReplicationSlotWithOrderingModeTransaction ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestReplicationSlotOrderingModePresentAfterRestart ``` Reviewers: skumar, stiwary, sumukh.phalgaonkar, utkarsh.munjal, aagrawal, #db-approvers Reviewed By: stiwary Subscribers: svc_phabricator, ycdcxcluster, jason, yql Differential Revision: https://phorge.dev.yugabyte.com/D42203

Commit:87cbe1b
Author:Zachary Drudi
Committer:Zachary Drudi

[#25832] docdb: Add cleanup loop to object info lock manager. Summary: Adds a poller to the object lock info manager. This poller does two things: - It marks a lease as expired if the lease's last refreshed time is too far in the past. - It kicks off subtasks to release locks held by expired lease epochs There is logic to skip kicking off a cleanup task for a lease epoch if it already has a cleanup task pending. This diff also does some refactors. All lease related state has been moved from the `TSDescriptor` into the `ObjectLockInfo` to eliminate duplicated state managed by the `ObjectLockInfoManager`. **Upgrade/Rollback safety:** The proto changes are guarded behind test only flags. In particular if `TEST_enable_ysql_operation_lease` is false the proto fields this diff removes are not populated. Jira: DB-15129 Test Plan: ``` % ./yb_build.sh fastdebug --with-tests --cxx-test object_lock-test ``` Reviewers: amitanand, bkolagani Reviewed By: amitanand Subscribers: rthallam, ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D42697

Commit:934bd9f
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[BACKPORT 2024.2][#26605] YSQL: Add RPC to get tserver pg socket dir Summary: Original commit: db60566a0b746a60c44f23de9fcd96141980fe3f / D42871 During YSQL Major upgrade test, the pg_upgrade process needs to connect to the v11 Postgres hosted by the TServer. When the TServer is running on the same node as the Master, we use the socket dir to connect to the Postgres instance. The socket dir can be different from the reported proxy host port in certain scenarios. This change adds a new RPC to get the socket dir directly from the TServer. **Upgrade/Rollback safety:** New RPC is optional. YB-Master uses the `ERROR_NO_SUCH_METHOD` to determine if the Tserver it connects to does not support this RPC, and falls back to the existing mechanism of using Master gFlag overrides. Jira: DB-15976 Test Plan: PgWrapperTest.GetPgSocketDir The following tests are added, but will be disabled until the upgrade from version is bumped to include this change: YsqlMajorUpgradeTestWithConnMgr.SimpleTableUpgrade YsqlMajorUpgradeTestWithConnMgr.SimpleTableRollback Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D42893

Commit:db60566
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#26605] YSQL: Add RPC to get tserver pg socket dir Summary: During YSQL Major upgrade test, the pg_upgrade process needs to connect to the v11 Postgres hosted by the TServer. When the TServer is running on the same node as the Master, we use the socket dir to connect to the Postgres instance. The socket dir can be different from the reported proxy host port in certain scenarios. This change adds a new RPC to get the socket dir directly from the TServer. **Upgrade/Rollback safety:** New RPC is optional. YB-Master uses the `ERROR_NO_SUCH_METHOD` to determine if the Tserver it connects to does not support this RPC, and falls back to the existing mechanism of using Master gFlag overrides. Jira: DB-15976 Test Plan: PgWrapperTest.GetPgSocketDir The following tests are added, but will be disabled until the upgrade from version is bumped to include this change: YsqlMajorUpgradeTestWithConnMgr.SimpleTableUpgrade YsqlMajorUpgradeTestWithConnMgr.SimpleTableRollback Reviewers: telgersma, fizaa Reviewed By: telgersma Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42871

Commit:22708c2
Author:Sergei Politov
Committer:Sergei Politov

[#26554] DocDB: Master leader could miss location data if full report was sent in multiple batches during restart Summary: **Tablet Server Heartbeat Mechanism & Full Tablet Reports** 1. **Heartbeat Communication** - The **tablet server (TS)** uses a **heartbeat mechanism** to periodically update the **master leader** about its state. - To optimize performance, the TS **only sends changed data** in heartbeats (avoiding redundant information). 2. **Master Leader Election & Data Loss Risk** - When a **new master leader is elected** (or restarted), it **reloads the system catalog** but may **miss non-persistent data** (e.g., tablet replica locations). - To recover this data, the master **requests a full tablet report** from the TS via the next heartbeat. 3. **Chunked Full Tablet Reports** - Since the full tablet report can be **large**, the TS may send it in **multiple chunks** over several heartbeats. - **Problem:** If the **master leader changes again** while receiving chunks: - The new master may **mistake partial chunks** for a complete report. - As a result, it **won’t request the full report again**, leading to **missing data** (from the first chunks). **Key Issue Summary** - **Race condition**: A master leader change **midway through chunked reporting** can cause data loss because the new master assumes the received chunks are a complete report. - **Impact**: The master may **lack critical replica location data**, leading to potential consistency or availability issues. Fixed by adding logic to track chunks in full tablet report. **Upgrade/Rollback safety:** 1) If TServer with new version sends full tablet report to a master with an old version, the new field will be just ignored. 2) If TServer with old version sends full tablet report to a master with a new version, there a logic that will be activated only if this field was specified. So this combination could still suffer the described issue, since master behaviour will be match not updated one. Jira: DB-15922 Test Plan: ./yb_build.sh release --gtest_filter CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat --cxx-test create-table-stress-test -n 120 Reviewers: zdrudi, asrivastava Reviewed By: zdrudi Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42785

Commit:0fee13f
Author:Anubhav Srivastava
Committer:Anubhav Srivastava

[#26543] docdb: Remove master_replication.pb.h from master_fwd.h Summary: master_fwd.h should only contain forwards, but it has master_replication.pb.h, which means every file that includes master_fwd.h has to parse master_replication.pb.h too, which takes significant time (according to ClangBuildAnalyzer). This diff also changes the forward generator in gen_yrpc (out protobuf plugin) to forward declare enums in `proto_foo.fwd.h`. This diff also removes some unused proto includes that were generating compiler warnings. **Upgrade / Downgrade Safety ** The proto changes are to unused includes. Jira: DB-15911 Test Plan: existing tests Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D42766

Commit:686da49
Author:Sfurti Sarah
Committer:Sfurti Sarah

[BACKPORT 2024.1][#25818] YSQL: Modify yb_servers() function to contain universe_uuid value Summary: Original commit: d8a8553fadc1dbd20f9809fa45016c99a0f83865 / D41512 Modify yb_servers() function to contain universe_uuid value. This is done to identify different clusters from the smart driver. Sample output : ``` yugabyte=# select * from yb_servers(); host | port | num_connections | node_type | cloud | region | zone | public_ip | uuid | universe_uuid -----------+------+-----------------+-----------+--------+-------------+-------+-----------+----------------------------------+----------------------------- --------- 127.0.0.1 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 15f0bdffcbc2414f9f290b6568da5263 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.3 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 8bc22e6aeb3744ca95318d0477b9beb8 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.2 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 9e4775bddd5742dd9becb10cc930d76c | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 (3 rows) ``` **Upgrade/Rollback safety:** Since only the pg_client.proto is only changed which is primarily meant for pg to local tserver RPCs, so there shouldn't be any upgrade/rollback issues Jira: DB-15115 Test Plan: Test PR up for review - https://github.com/yugabyte/driver-examples/pull/49 Reviewers: asaha Reviewed By: asaha Subscribers: hdaryani, hbhanawat, ashetkar Differential Revision: https://phorge.dev.yugabyte.com/D42646

Commit:09c05e2
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[#26281] Docdb: Table locks -- make local-lock-manager bootstrap logic resilient to network errors Summary: Prior to this diff, if somehow (due to a network partition/event) the response to the initial ysql-lease request is lost on the TServer side, the master does not send the ddl locks required to bootstrap in the subsequent requests (because the master believes that the TServer is already registered). This diff remediates this situation by ensuring that the TServer is the source of truth, and ensures that the TServer communicates to the master whether or not it needs a bootstrap **Upgrade/Rollback safety:** Code that uses the changed protos is gated behind the test flags TEST_enable_ysql_operation_lease and TEST_tserver_enable_ysql_lease_refresh which default to false. Jira: DB-15625 Test Plan: yb_build.sh --cxx-test object_lock-test --gtest_filter *ObjectLockTestWithMissingResponses* Reviewers: zdrudi, bkolagani Reviewed By: bkolagani Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D42648

Commit:cc6c9aa
Author:timothy-e
Committer:timothy-e

[BACKPORT 2024.2][#25976] YSQL: Send serialized expression version in DocDB request Summary: Original commit: 5275733aebd4162214513632c92fc112c7b43433 / D42054 Backport changes: * PgGate's AddQual clause was refactored after 2024.2. Move the expression version setting to the read op initialization. * `PG_MAJORVERSION_NUM` does not exist in PG11. Replace the constant with a hardcoded `11`. This change only exists in 2024.2 and there's no risk that 2024.2 will change major versions. Send `expression_serialization_version` in requests to DocDB, equal to either the current version or the major version compatibility mode. Docdb passes down the version number until the spot where the expression is deserialized, so that we can properly handle each possible case. Additionally, before this diff, `GetCurrentPgVersion` used substrings + stroi to get the current version from the version string. Using PG_MAJORVERSION_NUM gives just the number we are interested in, allowing some code to be removed and eliminating the need for some of the error handling. **Upgrade/Rollback safety:** DocDB doesn't require this field, so it is upgrade safe. Jira: DB-15310 Test Plan: Apply the patch ```lang=diff diff --git src/postgres/src/backend/executor/ybExpr.c src/postgres/src/backend/executor/ybExpr.c index b220d83d20..1f79062c9e 100644 --- src/postgres/src/backend/executor/ybExpr.c +++ src/postgres/src/backend/executor/ybExpr.c @@ -499,8 +499,7 @@ YbCanPushdownExpr(Expr *pg_expr, List **colrefs) * Respond with false if pushdown disabled in GUC, or during a YSQL major * upgrade. */ - if (!yb_enable_expression_pushdown || - yb_major_version_upgrade_compatibility > 0) + if (!yb_enable_expression_pushdown) return false; return !yb_pushdown_walker((Node *) pg_expr, colrefs); diff --git src/yb/yql/pgwrapper/pg_wrapper.cc src/yb/yql/pgwrapper/pg_wrapper.cc index 4da11001b6..8dd5417e60 100644 --- src/yb/yql/pgwrapper/pg_wrapper.cc +++ src/yb/yql/pgwrapper/pg_wrapper.cc @@ -334,7 +334,6 @@ DEFINE_NON_RUNTIME_PG_PREVIEW_FLAG(bool, yb_enable_query_diagnostics, false, DEFINE_RUNTIME_PG_FLAG(int32, yb_major_version_upgrade_compatibility, 0, "The compatibility level to use during a YSQL Major version upgrade. Allowed values are 0 and " "11."); -DEFINE_validator(ysql_yb_major_version_upgrade_compatibility, FLAG_IN_SET_VALIDATOR(0, 11)); DECLARE_bool(enable_pg_cron); ``` This removes validation from `ysql_yb_major_version_upgrade_compatibility` and allows expression pushdown to still occur when compatibility mode is enabled. This will be possible when some pushdowns are allowed (#25973). ```lang=sql CREATE TABLE t (a INT); INSERT INTO t VALUES (1); ``` **Basic case (PG11 version)** ```lang=sql SET yb_debug_log_docdb_requests = true; SELECT * FROM t WHERE a = 1; ``` ``` Applying operation: { READ active: 1 read_time: ... expression_serialization_version: 11 } ``` **Invalid Version Case** ./build/latest/bin/yb-ts-cli set_flag --server_address 127.0.0.1:9100 --force ysql_yb_major_version_upgrade_compatibility 10 ```lang=sql SET yb_debug_log_docdb_requests = true; SELECT * FROM t WHERE a = 1; ERROR: Unsupported expression version: 10 ``` ``` Applying operation: { READ active: 1 read_time: ... expression_serialization_version: 10 } ``` Reviewers: hsunder, fizaa Reviewed By: hsunder Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42499

Commit:52086de
Author:Aashir Tyagi
Committer:Aashir Tyagi

[#26272] CDC: Populate wal_status in pg_replication_slots Summary: Currently we have hardcoded `wal_status` field to lost in `pg_replication_slots` view. With this revision we are setting the value of `wal_status` as lost, i.e., declaring the stream as expired if the CDC stream has not been consumed (via the GetChanges RPC) within the past `cdc_intent_retention_ms` duration which is a Tserver GFlag, otherwise we are setting it as `reserved`. The way we determine the stream's last_active_time is same as the way it is determined for checking whether a replication slot is active or inactive. **Upgrade/Rollback safety:** This diff adds a field `expired` to the message PgReplicationSlotInfoPB. New field is added to the tserver response (to pggate) proto. This is upgrade and rollback safe. No new flags are added to guard the feature. Jira: DB-15617 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidAndWalStatusPopulationOnStreamRestart' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testWalStateLost' Reviewers: skumar, vkushwaha, sumukh.phalgaonkar, utkarsh.munjal Reviewed By: sumukh.phalgaonkar Subscribers: sumukh.phalgaonkar, yql Differential Revision: https://phorge.dev.yugabyte.com/D42239

Commit:1322965
Author:Sfurti Sarah
Committer:Sfurti Sarah

[BACKPORT 2024.2][#25818] YSQL: Modify yb_servers() function to contain universe_uuid value Summary: Original commit: d8a8553fadc1dbd20f9809fa45016c99a0f83865 / D41512 Modify yb_servers() function to contain universe_uuid value. This is done to identify different clusters from the smart driver. Sample output : ``` yugabyte=# select * from yb_servers(); host | port | num_connections | node_type | cloud | region | zone | public_ip | uuid | universe_uuid -----------+------+-----------------+-----------+--------+-------------+-------+-----------+----------------------------------+----------------------------- --------- 127.0.0.1 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 15f0bdffcbc2414f9f290b6568da5263 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.3 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 8bc22e6aeb3744ca95318d0477b9beb8 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.2 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 9e4775bddd5742dd9becb10cc930d76c | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 (3 rows) ``` **Upgrade/Rollback safety:** Since only the pg_client.proto is only changed which is primarily meant for pg to local tserver RPCs, so there shouldn't be any upgrade/rollback issues Jira: DB-15115 Test Plan: Test PR up for review - https://github.com/yugabyte/driver-examples/pull/49 Reviewers: asaha Reviewed By: asaha Subscribers: hdaryani, hbhanawat, ashetkar Differential Revision: https://phorge.dev.yugabyte.com/D42572

Commit:c970d75
Author:Basava
Committer:Basava

[#26127] DocDB: Invalidate db table cache on release of exclusive object locks Summary: Commit https://github.com/yugabyte/yugabyte-db/commit/47c5cf2fc56ebd0d7deb8b8a06731fd63c07adbc introduced mechanism for release of exclusive object locks that are taken as part of DDL execution. This revision builds on top of the above path and adds mechanism to invalidate db table cache on all tservers as part of the exclusive locks release, essentially invalidating the cache early as opposed to waiting for the subsequent tserver-master heartbeat to do it. Note - This invalidation uses the existing tserver cache invalidation mechanism, and so the master ends up sending a full update for all the dbs. This can be optimized to send version of the db touched by the ddl alone. This work is tracked by https://github.com/yugabyte/yugabyte-db/issues/26451. **Upgrade/Downgrade safety** 1. `ReleaseObjectLockRequestPB` & `SysObjectLockEntryPB` are protected by test flag which is disabled by default. 2. `YsqlCatalogVersionPB` & `DBCatalogVersionDataPB` were moved to a different file but were not modified. Jira: DB-15454 Test Plan: Jenkins: urgent ./yb_build.sh --cxx-test='TEST_F(ExternalObjectLockTest, ExclusiveLockReleaseInvalidatesCatalogCache) {' ./yb_build.sh --cxx-test='TEST_F(ExternalObjectLockTest, ConsecutiveAltersSucceedWithoutCatalogVersionIssues) {' 1. The first test fails when table locking feature is disabled. Take away is that with the table locking feature, statements wouldn't hit the retry statements due to schema/catalog version mismatch path because we end up invalidating the catalog cache early (exclusive object lock release path as opposed to tserver-master heartbeat path). 2. Without the feature, the second test occasionally fails with the following error when a higher value for tserver-master heartbeat interval is used. With changes in this revision, the test always passes. ``` yugabyte=# alter table t1 add column c4 int; ERROR: duplicate key value violates unique constraint "pg_attribute_relid_attnum_index" CONTEXT: Catalog Version Mismatch: A DDL occurred while processing this query. Try again. ``` Reviewers: amitanand, zdrudi, rthallam, myang Reviewed By: zdrudi, myang Subscribers: sanketh, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D42539

Commit:b9bb8cd
Author:Timur Yusupov
Committer:Timur Yusupov

[#24113] YSQL, DocDB: Block-based ANALYZE sampling support for non-colocated tables Summary: Implemented block-based sampling method for ANALYZE statement support for non-colocated tables. This work is made on top of D38082 / 441795614c197ba9ed3e0baab625b055e6ddff7d and enhances the same approach. Added `SampleRowsPickerIf` which is implemented by the following classes: - `SampleRowsPicker` - requests tserver to pick and return rows for the sample. Based on algorithm selected (YsqlSamplingAlgorithm) tserver may use block-based sampling for colocated tables internally and transparently for YSQL side. - `TwoStageSampleRowsPicker` - implements two-stage process described below. During 1st stage we collect sample blocks sequentially across all tablets using reservoir algorithm. This stage is performed by `SampleBlocksPicker` class. It runs `PgsqlReadOperation::ExecuteSampleBlockBased` sequentially on all tablets of the table being sampled. PgsqlSamplingStatePB is passed from operation performed at the previous tablet to the operation for the next tablet and new fields `PgsqlSamplingStatePB::num_blocks_collected` and `PgsqlSamplingStatePB::num_blocks_processed` are updated. Also, boundaries of picked sample blocks are returned in response and collected by `PgDocResult::ProcessSampleBlockBoundaries`. New field `PgsqlReadRequestPB::sample_blocks` is kept empty during 1st stage. After that we run 2nd stage where we collect rows for the final sample by sampling blocks collected during 1st stage. This stage is performed by `SampleRowsPicker` class. It also runs sequentially on all table tablets which have intersection with sample blocks picked at 1st stage. `PgsqlReadRequestPB::sample_blocks` is populated with sample block boundaries selected at 1st stage which relates to tablet this operation is intended for. `PgsqlSamplingStatePB::is_blocks_sampling_stage` has been added to define the stage for current operation. As a part of this revision the following changes also have been made: - Extracted logic related to sampling from `PgDocReadOp` into `PgDocSampleOp`. And now sample picker classes use `PgDocSampleOp`. - Updated QLRocksDBStorage to store encoded_partition_bounds_ of respective tablet. - Updated `QLRocksDBStorage::GetSampleBlocks` to support passing blocks sampling state from tablet to tablet. **Upgrade/Rollback safety:** - Added `yb_allow_separate_requests_for_sampling_stages` GUC and PG flag which allows to use new API (disabled by default, see https://github.com/yugabyte/yugabyte-db/issues/26366). - Kept old version of `PgsqlReadOperation::DEPRECATED_ExecuteSampleBlockBasedColocated` for colocated use case which execute both stages in one run as oppose to new unified version of `PgsqlReadOperation::ExecuteSampleBlockBased` function which is used for both colocated and non-colocated use cases when `yb_allow_separate_requests_for_sampling_stages` is true. Preliminary perf testing results shown the following improvement in ANALYZE query latency: - 50M rows (30GB) table: ~4x (23.6 vs 5.9 seconds) for range sharding and ~6x (22.4 vs 3.7 seconds) for hash sharding - 100M rows (60GB) table: ~3.3x (83 vs 25 seconds) for range sharding and ~4x (96 vs 23.5 seconds) for hash sharding - 200M rows (120GB) table: ~6x (288 vs 47 seconds) for range sharding and ~8x (333 vs 40 seconds) for hash sharding Jira: DB-13002 Test Plan: - PgAnalyzeTest.AnalyzeSamplingNonColocated Reviewers: amartsinchyk, arybochkin Reviewed By: amartsinchyk, arybochkin Subscribers: yql, ybase Tags: #jenkins-ready, #jenkins-trigger Differential Revision: https://phorge.dev.yugabyte.com/D41764

Commit:47c5cf2
Author:Basava
Committer:Basava

[#26026] DocDB: Handle Acquire/Release of DDL/exclusive object lock requests Summary: For DDLs, the object locks are routed to the master leader, which then forwards the lock request to all the tservers. These locks should be released once the DDL finishes and the corresponding schema changes have been applied at the docdb layer. TODO: Enfore catalog cache refresh on this release path, so queued new DMLs blocked behind an in-progress DDL, operate on a new schema once the DDL finishes. **Upgrade/Downgrade safety** No changes to proto messages/fields, edited a comment alone. no upgrade/downgrade implications. Jira: DB-15356 Test Plan: Jenkins ./yb_build.sh --cxx-test pgwrapper_pg_object_locks-test Reviewers: amitanand, zdrudi, rthallam, myang Reviewed By: amitanand, zdrudi Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42173

Commit:707211c
Author:Aashir Tyagi
Committer:Aashir Tyagi

[#25414] CDC: Populate fields in various replication slot views Summary: The `active_pid` field in `pg_replication_slots` view was always null. This was due to the fact that we were never storing or updating the pid for any slot. Now with this revision we are storing `active_pid` for each slot in the `cdc_state` table. New field introduced is present under the data column: active_pid (uint64). We have used the RPCs for initialising the virtual wal (`YBCInitVirtualWal`) and destroying it (`YBCDestroyVirtualWal`) for updating the value of `active_pid` in the `cdc_state` table. When the virtual wal is initialised we set the `active_pid` to `MyProcPid` and whenever it is destroyed we set it to 0. The `backend_xmin` field in `pg_stat_replication` view was not being populated. With this revision we are using the `xmin` value stored in `cdc_state` table to fill it. The `state` field in `pg_stat_replication` view was set to `catchup` always. With this revision we are setting the value of `state` to always `streaming`. The reason why we are setting this to always `streaming` is that the other two walsender states `catchup` and `stopping` will never be reached as when the walsender process goes down we just remove it, hence no possibility of `stopping` state and similarly since this entry will only be populated when walsender is active, and we use walsender for streaming only there is no possibility of `catchup` state. The three lag metrics (`flush_lag`, `write_lag` and `replay_lag`) in `pg_stat_replication` view were not being populated. With this revision we are setting the value of these metrics to the value of `cdcsdk_flush_lag`. For this a new RPC `YBCGetLagMetrics` has been introduced which based on the `stream_id` gets the value of the lag metric. **Upgrade / rollback safety:** This diff introduces a new RPC `GetLagMetrics`. This rpc will always be sent from PG to local tserver, hence an auto flag is not needed. This diff adds a field `active_pid` to the message `PgReplicationSlotInfoPB`. New field is added to the tserver response (to pggate) proto. When the value is absent, master (which is upgraded first) is expected to fill in the appropriate default value. This is upgrade and rollback safe. No new flags are added to guard the feature. Jira: DB-14647 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidNull' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidPopulationOnStreamRestart' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidPopulationFromDifferentTServers' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testBackendXminAndStatePopulation' Manual testing is performed for ensuring correct population of lag metrics in the pg_stat_replication view by comparing the output with the value of cdcsdk_flush_lag. Reviewers: skumar, vkushwaha, utkarsh.munjal, sumukh.phalgaonkar, stiwary Reviewed By: utkarsh.munjal, sumukh.phalgaonkar Subscribers: yql, ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D41819

Commit:46599b1
Author:Mark Lillibridge
Committer:Mark Lillibridge

[BACKPORT 2.25.1][#26190] xCluster: Add ability to flush TServer OID caches, bump up source normal-space OID counter on switchover Summary: This is a squash of two commits that are being backported together. Original commit: f10f9854d1d4d6c2b5a171291e522b33c740c4a8 / D42225 Other commit: rYBDBef44cc08c522979acc97f0c0330ab8bafb86508f / D42185 No conflicts when backporting. Info for D42225: Here we add a boolean to the master RPC ReservePgsqlOids that allows requesting all TServer OID caches be invalidated before new OIDs are reserved. This forces TServers to start using OIDs after the next_oid parameter of the RPC call. We need this for correctly handing OID preservation with xCluster automatic mode -- we need to bump up OID counters in some cases and ensure that no future use of earlier OIDs occurs. Note that this mechanism does not invalidate OID caches on masters (we are currently using pg_client_service, which is where this OID cache lives, on Masters only for major YSQL upgrades); this limitation is fine for what we need it for for xCluster. Implementation: * we add new persistent counter to SysClusterConfigEntryPB: ``` // This field is bumped to invalidate all the TServer OID caches. optional uint32 oid_cache_invalidations_count = 9 [default = 0]; ``` * bumping this causes all the pg_class_service's running on TServers to discard their previous cache contents * the ReservePgsqlOids RPC increments this counter if invalidation is requested * it returns the current value of the counter afterwards, which pg_client_service stores next to the OIDs returned in its cache * the heartbeat response from master includes a new field with current value of this counter: ``` optional uint32 oid_cache_invalidations_count = 28; ``` * the value in the heartbeat response is saved in a atomic variable in the tablet server * when it comes time to allocate a OID, pg_client_service compares the invalidation count of its cache with the tablet server; if it's count is lower than it discards its cache Technicalities: * the tablet server actually stores the maximum count received * it initially starts at zero and if for whatever reason the counter is not available on the master then it heartbeats 0, which does not disturb things Upgrade/rollback safety: * this adds a optional persistent field to SysClusterConfigEntryPB as well as optional fields to the request and response for the ReservePgsqlOids RPC * regardless, we are not using an auto flag for protection * RPC: old->new * request defaults to not asking for invalidating, returned count is ignored * RPC: new->old * new will never request invalidation here because this feature will first be used with xCluster automatic mode replication which will not be available until after finalizing upgrading to new * the absence of a returned count will be interpreted as 0, which cannot cause invalidation * heartbeat: old master, new TServer * the absence of a returned count will be interpreted as 0, which cannot cause invalidation * heartbeat: new master, old TServer * returned count is ignored * (after rollback) old master code will ignore the new SysClusterConfigEntryPB field Jira: DB-15535 Info for D42185: With automatic mode xCluster replication, we need to bump up the (original) source database normal space OID counter during switchover. In this diff we do that when we drop the replication. Why we need to do this is explained in the new test's comment: ``` TEST_F(XClusterDDLReplicationSwitchoverTest, SwitchoverBumpsAboveUsedOids) { // To understand this test, it helps to picture the result of A->B replication before we do a // switchover. The following is an example of the OID spaces of A and B for one database after A // has allocated three OIDs we don't care about preserving (the Ns) and one OID we do care about // preserving (P). The [OID ptr]'s indicate where the next OID would be allocated in each space // modulo we skip OIDs already in use in that space on that universe. // // In particular, the next normal space OID that will be allocated on B is the one marked (*), // which conflicts with an OID already in use in cluster A. While not a problem while B is a // target (targets only allocate in the secondary space), this will be a problem if we switch the // replication direction so B is now the source. // // Accordingly, xCluster is designed to bump up B's normal space [OID ptr] to after A's normal // space [OID ptr] as part of doing switchover; this test attempts to verify that that successfully // avoids the OID conflict problem described above. // // A: B: // Normal: // N [OID ptr] (*) // N // P P // N // [OID ptr] // // Secondary: // [OID ptr] N // N // N // [OID ptr] ``` Implementation: * dropping the original direction replication is done by switchover by calling DeleteOutboundReplicationGroup on the source universe * we modify this to get the current normal space OID counters for each namespace in the replication group * we do this by simply allocating new OIDs * this information is then passed to the target universe in the DeleteUniverseReplication RPC using a new field: ``` ~/code/yugabyte-db/src/yb/master/master_replication.proto: // producer_namespace_id -> oid_to_bump_above map<string,uint32> producer_namespace_oids = 5; } ``` * on the target, the RPC handling code then does the bumping of the replication group's namespaces before actually proceeding * in the process, it needs to translate between source and target namespace IDs, which can differ across universes Other: * the number of OIDs prefetched from master at a time is now exposed as a new gflag * this allows changing it in the test Upgrade/Rollback safety: * in this diff, we add an optional field to an RPC * because automatic mode is first becoming available in this release (2.25.1), it is impossible to upgrade to this code while an automatic xCluster replication is running * YBA does not allow setting up automatic replication while doing this upgrade * the use of the RPC field is gated on the replication being dropped being in automatic mode; thus the RPC field will not be used before the code is available * the absence of the RPC field on the target causes no behavior changes * summing up, there is no need for an auto flag here and we do not provide one Jira: DB-15535 Test Plan: Plan for D42225: ``` ybd --cxx-test xcluster_secondary_oid_space-test --gtest_filter '*.CacheInvalidation' ``` Plan for D42185: A new test, XClusterDDLReplicationSwitchoverTest.SwitchoverBumpsAboveUsedOids, verifies that these changes solves the problem in question. It has been verified to fail if the bumping is not done. ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.SwitchoverBumpsAboveUsedOids' ``` Reviewers: xCluster, jhe Reviewed By: jhe Subscribers: ybase, yql, jhe Differential Revision: https://phorge.dev.yugabyte.com/D42461

Commit:83d3cca
Author:Saurav Tiwary
Committer:Saurav Tiwary

[#3109] YSQL: Introduce MVP support for running DDLs within regular txn block Summary: Before this revision, DDLs run within their own autonomous transaction and autocommit at the end of the statement. This revision introduces support to run DDL statements within the regular transaction block i.e. without creating autonomous transactions. This support builds upon the existing rollback support added as part of the DDL atomicity project. Earlier, the autonomous transactions were started upon receiving a DDL statement and committed right after the execution. With the unification, there won't be any separate transaction for DDLs. Instead, they will execute as part of the regular transaction (aka plain) which are created lazily upon the first write. A notable exception to this support is the `CREATE INDEX` command which is by default executed as an online schema migration i.e. `CREATE INDEX CONCURRENTLY`. This statement still creates its own autonomous transaction because online schema migration divides the DDL into several steps which have to be committed separately and the changes made by a single statement should be visible to all other backends. Left for follow up revisions: 1. Handling savepoints in transactions containing DDLs 2. Table rewrites such as REINDEX, TRUNCATE are supported but need to be verified via additional unit tests in a follow-up. Note that `REINDEX CONCURRENTLY` isn't supported in YSQL currently. So only the non-concurrent version of `REINDEX` is being considered here. 3. More tests - DDLs from functions & procedures. DDL vs DDL/DML concurrency etc. 4. Promotion of transaction to global if a DDL is encountered. Need to validate that the auto promotion of transaction from local to global works with DDLs and if it doesn't, then will add code to do that. Left for follow up. **Upgrade/Downgrade safety:** The changes are protected via test flag `TEST_ysql_yb_ddl_transaction_block_enabled`. Jira: DB-1839 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestDdlTransactionBlocks' Reviewers: myang, pjain, aagrawal, skumar, hsunder Reviewed By: myang, pjain Subscribers: sanketh, jason, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41998

Commit:d8a8553
Author:Sfurti Sarah
Committer:Sfurti Sarah

[#25818] YSQL: Modify yb_servers() function to contain universe_uuid value Summary: Modify yb_servers() function to contain universe_uuid value. This is done to identify different clusters from the smart driver. Sample output : ``` yugabyte=# select * from yb_servers(); host | port | num_connections | node_type | cloud | region | zone | public_ip | uuid | universe_uuid -----------+------+-----------------+-----------+--------+-------------+-------+-----------+----------------------------------+----------------------------- --------- 127.0.0.1 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 15f0bdffcbc2414f9f290b6568da5263 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.3 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 8bc22e6aeb3744ca95318d0477b9beb8 | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 127.0.0.2 | 5433 | 0 | primary | cloud1 | datacenter1 | rack1 | | 9e4775bddd5742dd9becb10cc930d76c | a3a1bbd8-15f5-4ec5-a6cf-8891 47c68fd9 (3 rows) ``` **Upgrade/Rollback safety:** Since only the pg_client.proto is only changed which is primarily meant for pg to local tserver RPCs, so there shouldn't be any upgrade/rollback issues Jira: DB-15115 Test Plan: Test PR up for review - https://github.com/yugabyte/driver-examples/pull/49 Reviewers: asaha, ashetkar Reviewed By: asaha Subscribers: svc_phabricator, yql, hdaryani, hbhanawat, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41512

Commit:3544ddc
Author:Minghui Yang
Committer:Minghui Yang

[#23785] YSQL: Incrementally refresh PG backend catcaches (part 5) Summary: In this diff, I made changes so that PG backend can do incremental catalog cache refresh using invalidation messages. When a PG backend detects that a newer catalog version has arrived in shared memory, currently it does a full catalog cache refresh. I inserted steps before it invokes YBRefreshCache: (1) PG calls new function `YBCGetTserverCatalogMessageLists` that performs an RPC to the local tserver to retrieve the message lists that reflect the delta between the PG backend's local catalog version and the shared memory catalog version. For example, if the local catalog version is x, and the shared memory catalog version is y, then in the happy case `YBCGetTserverCatalogMessageLists` will return invalidation messages associated with catalog version `x + 1, x + 2, ..., x + k` (where k = y - x). (2) PG calls `YbApplyInvalidationMessages` which attempts to apply the invalidation messages retrieved above. These messages will be applied transactionally (all or none). If all the messages can be successfully applied, then we can skip `YBRefreshCache`. If any of the messages cannot be applied, then the incremental catalog cache refresh optimization has failed and `YBRefreshCache` is invoked just as before as a fall back. Two new metrics are added to count the number of full catalog cache refreshes and incremental catalog cache refreshes. A new unit test is added to verify that the incremental catalog cache refresh does happen. I moved some existing data structure `YsqlMetric` and two functions `ParsePrometheusMetrics` and `ParseJsonMetrics` to `LibPqTestBase` so that they can also be used in `PgCatalogVersionTest`. **Upgrade/Rollback safety:** The src/yb/tserver/pg_client.proto change is only used in PG -> tserver communication which is upgrade safe. The src/yb/common/common.proto does not change any existing proto message. New message and RPC should not be used when the upgrade has completed. They are added to support the PG -> tserver communication. The existing infrastructure (e.g., `get_ysql_db_oid_to_cat_version_info_map`) needs to have a RPC API on `TabletServerIf` and master tablet server is also a subclass of `TabletServerIf` and that's why the API is needed for both tserver and master. In the worst case the RPC will fail with "RPC Not implemented" error if the RPC is made to an old master leader, or fail with "Unexpected call" to a new master leader. This is ok and does not cause any correctness issues. Test Plan: ./yb_build.sh --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.InvalMessageCatCacheRefreshTest ./yb_build.sh --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.InvalMessageQueueOverflowTest Reviewers: kfranz, sanketh, mihnea Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42274

Commit:5275733
Author:timothy-e
Committer:timothy-e

[#25976] YSQL: Send serialized expression version in DocDB request Summary: Send `expression_serialization_version` in requests to DocDB, equal to either the current version or the major version compatibility mode. Docdb passes down the version number until the spot where the expression is deserialized, so that we can properly handle each possible case. Additionally, before this diff, `GetCurrentPgVersion` used substrings + stroi to get the current version from the version string. Using PG_MAJORVERSION_NUM gives just the number we are interested in, allowing some code to be removed and eliminating the need for some of the error handling. **Upgrade/Rollback safety:** DocDB doesn't require this field, so it is upgrade safe. Jira: DB-15310 Test Plan: Apply the patch ```lang=diff diff --git src/postgres/src/backend/executor/ybExpr.c src/postgres/src/backend/executor/ybExpr.c index b220d83d20..1f79062c9e 100644 --- src/postgres/src/backend/executor/ybExpr.c +++ src/postgres/src/backend/executor/ybExpr.c @@ -499,8 +499,7 @@ YbCanPushdownExpr(Expr *pg_expr, List **colrefs) * Respond with false if pushdown disabled in GUC, or during a YSQL major * upgrade. */ - if (!yb_enable_expression_pushdown || - yb_major_version_upgrade_compatibility > 0) + if (!yb_enable_expression_pushdown) return false; return !yb_pushdown_walker((Node *) pg_expr, colrefs); diff --git src/yb/yql/pgwrapper/pg_wrapper.cc src/yb/yql/pgwrapper/pg_wrapper.cc index 4da11001b6..8dd5417e60 100644 --- src/yb/yql/pgwrapper/pg_wrapper.cc +++ src/yb/yql/pgwrapper/pg_wrapper.cc @@ -334,7 +334,6 @@ DEFINE_NON_RUNTIME_PG_PREVIEW_FLAG(bool, yb_enable_query_diagnostics, false, DEFINE_RUNTIME_PG_FLAG(int32, yb_major_version_upgrade_compatibility, 0, "The compatibility level to use during a YSQL Major version upgrade. Allowed values are 0 and " "11."); -DEFINE_validator(ysql_yb_major_version_upgrade_compatibility, FLAG_IN_SET_VALIDATOR(0, 11)); DECLARE_bool(enable_pg_cron); ``` This removes validation from `ysql_yb_major_version_upgrade_compatibility` and allows expression pushdown to still occur when compatibility mode is enabled. This will be possible when some pushdowns are allowed (#25973). ```lang=sql CREATE TABLE t (a INT); INSERT INTO t VALUES (1); ``` **Basic case (PG15 version)** ```lang=sql SET yb_debug_log_docdb_requests = true; SELECT * FROM t WHERE a = 1; ``` ``` Applying operation: { READ active: 1 read_time: ... expression_serialization_version: 15 } ``` **Compatilbility Case** ./build/latest/bin/yb-ts-cli set_flag --server_address 127.0.0.1:9100 --force ysql_yb_major_version_upgrade_compatibility 11 ```lang=sql SET yb_debug_log_docdb_requests = true; SELECT * FROM t WHERE a = 1; ``` ``` Applying operation: { READ active: 1 read_time: ... expression_serialization_version: 11 } ``` **Invalid Version Case** ./build/latest/bin/yb-ts-cli set_flag --server_address 127.0.0.1:9100 --force ysql_yb_major_version_upgrade_compatibility 10 ```lang=sql SET yb_debug_log_docdb_requests = true; SELECT * FROM t WHERE a = 1; ERROR: Unsupported expression version: 10 ``` ``` Applying operation: { READ active: 1 read_time: ... expression_serialization_version: 10 } ``` Reviewers: hsunder, fizaa Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42054

Commit:88e84d7
Author:Anton Rybochkin
Committer:Anton Rybochkin

[#26338] docdb: Vector index compaction support in yb-ts-cli Summary: The change implements new command `yb-ts-cli compact_vector_index` to allow particular tablet's vector indexes compaction. The format: ``` yb-ts-cli compact_vector_index <tablet_id> [<vector_index_id1> <vector_index_id12> ...] ``` **Upgrade/Rollback safety:** It is safe to upgrade and rollback as the feature is not released. Also the presence of the values are ignored by old releases which is acceptable behaviour (same is without the change). Jira: DB-15687 Test Plan: Jenkins Reviewers: sergei, slingam Reviewed By: sergei Subscribers: svc_phabricator, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42413

Commit:42b8ce8
Author:Sergei Politov
Committer:Sergei Politov

[#19447] DocDB: Fix revoking multiple leader leases Summary: We could get into situation when leader changed twice in a short period of time. Consider there are 3 leaders in chronological order - L1, L2, L3. L1 could have leader lease, before L2 became leader. And there are scenarios when L2 did not revoke L1's lease before L3 became the leader. Since we keep only a single old leader lease, L3 could decide that old leader does not have lease when L2 lease was revoked. Fixed by keeping multiple old leader leases and revoking them individually. **Upgrade/Rollback safety:** Optional fields were replaced with repeated fields. The old software versions will see only the latest element on those fields. So we put most advanced lease to the last position. Jira: DB-8238 Test Plan: QLTabletTest.LeaderLeaseRevocation Reviewers: hsunder, patnaik.balivada Reviewed By: hsunder, patnaik.balivada Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42229

Commit:12d3543
Author:Anton Rybochkin
Committer:Anton Rybochkin

[#26332] docdb: Vector index compaction support in yb-admin Summary: The code base is updated to allow vector index compaction be triggered by the following command: ``` yb-admin compact_table <vector index name> yb-admin compact_table_by_id <vector index id> ``` These commands trigger only vector index compaction and no compaction is triggered for the indexable table. On the other hand, triggering compaction on the indexable table will not trigger compaction on vector indexes -- this approach is selected to prevent undesirable effects of long vector index compactions (because they are not yet optimized). Additionally, `yb-admin flush_table <vector index>` is changed to trigger a flush only for vector index and no flushes are triggered for the indexable table. Previously, the given command was acting as if the flush was triggered for the indexable table. **Upgrade/Rollback safety:** It is safe to upgrade and rollback as the feature is not released. Also the presence of the values are ignored by old releases which is acceptable behaviour (same is without the change). Jira: DB-15678 Test Plan: Jenkins Reviewers: sergei, slingam Reviewed By: sergei Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D42404

Commit:ef44cc0
Author:Mark Lillibridge
Committer:Mark Lillibridge

[#26190] xCluster: bump up source normal-space OID counter on switchover Summary: With automatic mode xCluster replication, we need to bump up the (original) source database normal space OID counter during switchover. In this diff we do that when we drop the replication. Why we need to do this is explained in the new test's comment: ``` TEST_F(XClusterDDLReplicationSwitchoverTest, SwitchoverBumpsAboveUsedOids) { // To understand this test, it helps to picture the result of A->B replication before we do a // switchover. The following is an example of the OID spaces of A and B for one database after A // has allocated three OIDs we don't care about preserving (the Ns) and one OID we do care about // preserving (P). The [OID ptr]'s indicate where the next OID would be allocated in each space // modulo we skip OIDs already in use in that space on that universe. // // In particular, the next normal space OID that will be allocated on B is the one marked (*), // which conflicts with an OID already in use in cluster A. While not a problem while B is a // target (targets only allocate in the secondary space), this will be a problem if we switch the // replication direction so B is now the source. // // Accordingly, xCluster is designed to bump up B's normal space [OID ptr] to after A's normal // space [OID ptr] as part of doing switchover; this test attempts to verify that that successfully // avoids the OID conflict problem described above. // // A: B: // Normal: // N [OID ptr] (*) // N // P P // N // [OID ptr] // // Secondary: // [OID ptr] N // N // N // [OID ptr] ``` Implementation: * dropping the original direction replication is done by switchover by calling DeleteOutboundReplicationGroup on the source universe * we modify this to get the current normal space OID counters for each namespace in the replication group * we do this by simply allocating new OIDs * this information is then passed to the target universe in the DeleteUniverseReplication RPC using a new field: ``` ~/code/yugabyte-db/src/yb/master/master_replication.proto: // producer_namespace_id -> oid_to_bump_above map<string,uint32> producer_namespace_oids = 5; } ``` * on the target, the RPC handling code then does the bumping of the replication group's namespaces before actually proceeding * in the process, it needs to translate between source and target namespace IDs, which can differ across universes Other: * the number of OIDs prefetched from master at a time is now exposed as a new gflag * this allows changing it in the test Upgrade/Rollback safety: * in this diff, we add an optional field to an RPC * because automatic mode is first becoming available in this release (2.25.1), it is impossible to upgrade to this code while an automatic xCluster replication is running * YBA does not allow setting up automatic replication while doing this upgrade * the use of the RPC field is gated on the replication being dropped being in automatic mode; thus the RPC field will not be used before the code is available * the absence of the RPC field on the target causes no behavior changes * summing up, there is no need for an auto flag here and we do not provide one Jira: DB-15535 Test Plan: A new test, XClusterDDLReplicationSwitchoverTest.SwitchoverBumpsAboveUsedOids, verifies that these changes solves the problem in question. It has been verified to fail if the bumping is not done. ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.SwitchoverBumpsAboveUsedOids' ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D42185

Commit:d07d441
Author:Zachary Drudi
Committer:Zachary Drudi

[#25643] docdb: Move ysql lease refresh out of the TS heartbeat into its own RPC Summary: We've seen delays when processing `TSHeartbeat` requests on the master. The ysql lease needs to be refreshed promptly or tservers will stop serving user YSQL requests. To avoid `TSHeartbeat` delays causing availability issues, this diff moves the ysql lease refresh logic into a new RPC. New gflags: - `ysql_lease_refresher_rpc_timeout_ms` - timeout used for RPCs from the tserver to the master made by the ysql lease refresh poller - `ysql_lease_refresher_interval_ms` - the frequency with which the ysql lease refresh poller runs - `tserver_enable_ysql_lease_refresh` - whether to enable the ysql lease refresh poller. It this is false the poller does nothing. **Upgrade/Rollback safety:** Code that uses the changed protos is gated behind the test flag `TEST_enable_ysql_operation_lease` which defaults to false. Jira: DB-14894 Test Plan: Existing tests in `object_lock-test` and the new master tests. ``` ./yb_build.sh release --with-tests --cxx-test object_lock-test --test-timeout-sec 120 ./yb_build.sh release --with-tests --cxx-test master-test --gtest_filter 'MasterTest.RefreshYsqlLeaseWithoutRegistration' ./yb_build.sh release --with-tests --cxx-test master-test --gtest_filter 'MasterTest.RefreshYsqlLease' ``` Reviewers: amitanand, bkolagani, esheng Reviewed By: amitanand Subscribers: rthallam, slingam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D42044

Commit:f10f985
Author:Mark Lillibridge
Committer:Mark Lillibridge

[#26190] xCluster: Add ability to flush TServer OID caches Summary: Here we add a boolean to the master RPC ReservePgsqlOids that allows requesting all TServer OID caches be invalidated before new OIDs are reserved. This forces TServers to start using OIDs after the next_oid parameter of the RPC call. We need this for correctly handing OID preservation with xCluster automatic mode -- we need to bump up OID counters in some cases and ensure that no future use of earlier OIDs occurs. Note that this mechanism does not invalidate OID caches on masters (we are currently using pg_client_service, which is where this OID cache lives, on Masters only for major YSQL upgrades); this limitation is fine for what we need it for for xCluster. Implementation: * we add new persistent counter to SysClusterConfigEntryPB: ``` // This field is bumped to invalidate all the TServer OID caches. optional uint32 oid_cache_invalidations_count = 9 [default = 0]; ``` * bumping this causes all the pg_class_service's running on TServers to discard their previous cache contents * the ReservePgsqlOids RPC increments this counter if invalidation is requested * it returns the current value of the counter afterwards, which pg_client_service stores next to the OIDs returned in its cache * the heartbeat response from master includes a new field with current value of this counter: ``` optional uint32 oid_cache_invalidations_count = 28; ``` * the value in the heartbeat response is saved in a atomic variable in the tablet server * when it comes time to allocate a OID, pg_client_service compares the invalidation count of its cache with the tablet server; if it's count is lower than it discards its cache Technicalities: * the tablet server actually stores the maximum count received * it initially starts at zero and if for whatever reason the counter is not available on the master then it heartbeats 0, which does not disturb things Upgrade/rollback safety: * this adds a optional persistent field to SysClusterConfigEntryPB as well as optional fields to the request and response for the ReservePgsqlOids RPC * regardless, we are not using an auto flag for protection * RPC: old->new * request defaults to not asking for invalidating, returned count is ignored * RPC: new->old * new will never request invalidation here because this feature will first be used with xCluster automatic mode replication which will not be available until after finalizing upgrading to new * the absence of a returned count will be interpreted as 0, which cannot cause invalidation * heartbeat: old master, new TServer * the absence of a returned count will be interpreted as 0, which cannot cause invalidation * heartbeat: new master, old TServer * returned count is ignored * (after rollback) old master code will ignore the new SysClusterConfigEntryPB field Jira: DB-15535 Test Plan: ``` ybd --cxx-test xcluster_secondary_oid_space-test --gtest_filter '*.CacheInvalidation' ``` Reviewers: xCluster, jhe Reviewed By: jhe Subscribers: jhe, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42225

Commit:ed352ba
Author:Abhinab Saha
Committer:Abhinab Saha

[#26268] YSQL: Add check to have ash_metadata in PgClient RPC Request PBs Summary: 9a11dd27422e531b6e1a7bb85d08b31764a44069 / D41326 removed this check to have ash_metadata in PgClient RPC Request PBs. This diff reverts this change and also adds more explanation in the code on how to use this. Jira: DB-15612 **Upgrade/Downgrade safety:** This only adds a field to a pg - local tserver RPC, so this should be safe to upgrade/downgrade Test Plan: Jenkins Reviewers: sergei, dmitry Reviewed By: sergei, dmitry Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42281

Commit:83b8ea1
Author:Sergei Politov
Committer:Sergei Politov

[#25859] DocDB: Pass ybhnsw options to docdb layer Summary: Support specifying index creation options m, m0, ef_construction for ybhnsw. Standard Postgres pgvector hsnw supports the following options: index creation options: m, ef_construction query options: ef_search Reference: https://github.com/pgvector/pgvector?tab=readme-ov-file#index-options For ybhsnw, this diff: adds plumbing for index creation options: m, ef_construction to be passed down adds a new index creation option m0 (used by usearch) and adds plumbing for it It does not yet handle query option ef_search. **Upgrade/Rollback safety:** Safe, since modified field is used only by the code that was not yet released. Jira: DB-15153 Test Plan: PgVectorIndexTest.Options/* Reviewers: tnayak, arybochkin, jason Reviewed By: tnayak Subscribers: mihnea, smishra, jason, slingam, aleksandr.ponomarenko, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41785

Commit:4789974
Author:Sergei Politov
Committer:Sergei Politov

[#24064] DocDB: Write tombstone to reverse vector index when vector is updated or removed Summary: When vector column is updated we generate new vector id and add reverse index entry to the regular DB. Then during vector index search we query reverse index for found vector id and check whether existing column has the same id. It is used to understand whether this vector is valid for particular read time. The similar logic is applied when row is deleted. The referenced row should exist. So to check single vector we need 2 queries to RocksDB. One to fetch reverse index record, then query main table to check whether particular row was updated or deleted. This diff adds logic to write tombstone entries for obsolete vector ids. So we don't have to perform query to main table to check whether vector is valid. **Upgrade/Rollback safety:** New fields are used only by not yet released code. Jira: DB-12955 Test Plan: PgVectorIndexTest.DeleteAndUpdate/* Reviewers: arybochkin Reviewed By: arybochkin Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42176

Commit:f85bbca
Author:Eric Sheng
Committer:Eric Sheng

[#25700] docdb: Initialize shared memory allocators during postmaster/server startup Summary: This diff adds the virtual address negotiation added in D41083 / f17f92e30b7fb8c8663173843a2a28143a4cb7f1 and initialization of the shared memory allocator added in D40272 / de280e748d005dcd5c252c89e29072160a9b409f to the postmaster/server startup paths. The basic setup logic is as follows: - TServer creates the shared memory allocators with `TServerSharedData` as the prepare user data (`InitializeTServer`). This is mapped at a temporary location. At this point, fields that do not require pointers within shared memory can be used, but those that require pointer support cannot be used. - TServer sets up the negotiation process (`PrepareNegotiationTServer`), which involves creating a temporary anonymous shared memory object. The file descriptor for this is passed to Postmaster via the `YB_PG_ADDRESS_NEGOTIATOR_FD` environment variable. - Postmaster is started, and loads the temporary shared memory object. GFlags are not set up at this point, so we have to use environment variables directly to load the file descriptor. - TServer and Postmaster perform address negotiation to agree on an address segment to map the shared memory to (`InitializePostmaster` and `ParentNegotiatorThread`). - TServer remaps the shared memory to the agreed upon address segment. Pointers are now supported. Objects that require pointer support are initialized. - Postmaster does `fork()` calls to start up backend processes. These backend processes open the shared memory at the start of `PgApiImpl` (via the `pggate_tserver_shared_memory_uuid` flag and `InitializePgBackend`), and map it to the agreed upon address segment. - Postmaster waits until objects that require pointer support are initialized, then proceeds as normal. Shared memory for the case of master and initdb processes does not have pointer support, since there is no "parent" process like postmaster from which all PG processes are forked from to do address negotiation with. initdb calls `fork()` and `exec()` to start up multiple PG processes, and the `exec()` call causes us to lose reservation of negotiated address segment. `TServerSharedData` currently contains no objects that require pointer support. One such object (table locks lock manager) will be added in a future change. **Upgrade/Rollback safety:** This diff deletes the `GetSharedData` rpc. This was only used for testing and was never upgrade safe (it just did a raw memory copy of TServerSharedData). Jira: DB-14954 Test Plan: Jenkins Reviewers: sergei, dmitry, jason Reviewed By: sergei, jason Subscribers: myang, rthallam, bkolagani, amitanand, zdrudi, yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41362

Commit:98137e8
Author:Minghui Yang
Committer:Minghui Yang

[#23785] YSQL: Incrementally refresh PG backend catcaches (part 3) Summary: In part 1 and part 2, I have made changes to write PG-generated invalidation messages of a DDL statement to the new table `pg_yb_invalidation_messages` at the same time when the DDL statement increments the catalog version. For each database, the table `pg_yb_invalidation_messages` maintains a small history of invalidation messages, one per catalog version with a expiration time (10 second by default). The idea is that as long as there is one successful heartbeat service done in 10 second window, a new catalog version and its associated invalidation messages will be propagated to tservers where a longer history of catalog version and invalidation messages can be maintained in memory. This diff adds changes to add the contents of `pg_yb_invalidation_messages` table to the heartbeat response, along side with the existing contents of `pg_yb_catalog_version` table. To reduce the number of reading of `pg_yb_invalidation_messages` which may become bulky when there are many databases and/or large invalidation messages, the table is only read when we detect a change in `pg_yb_catalog_version` table (via the existing fingerprint mechanism). So even though the writing to `pg_yb_catalog_version` and `pg_yb_invalidation_messages` are atomic/transactional, it is possible that in a heartbeat response, we only send back the new contents of `pg_yb_catalog_version` but not `pg_yb_invalidation_messages` if the reading of `pg_yb_catalog_version` succeeds but the reading of `pg_yb_invalidation_messages` fails because we only make a one time reading of `pg_yb_invalidation_messages` after `pg_yb_catalog_version` is read successfully. For example, if we execute 1 DDL every 11 seconds, then because the expiration time by default is 10 seconds, each reading of the pg_yb_invalidation_messages table is going to return 1 row for the DB where the DDL is executed because the next DDL's invocation of yb_increment_db_catalog_version_with_inval_messages will delete the previous row which is now expired. On the other hand, if we execute 1 DDL every second, then each reading of pg_yb_invalidation_messages is going to return a history of 10 rows for the given DB because when the 11th DDL is executed the 1st row is now expired and will be deleted. Sample annotated session showing how pg_yb_invalidation_messages is updated along with pg_yb_catalog_version: ``` yb1$ ./bin/yb-ctl create --rf 1 --tserver_flags TEST_yb_enable_invalidation_messages=true Creating cluster. Waiting for cluster to be ready. ---------------------------------------------------------------------------------------------------- | Node Count: 1 | Replication Factor: 1 | ---------------------------------------------------------------------------------------------------- | JDBC : jdbc:postgresql://127.0.0.1:5433/yugabyte | | YSQL Shell : bin/ysqlsh | | YCQL Shell : bin/ycqlsh | | YEDIS Shell : bin/redis-cli | | Web UI : http://127.0.0.1:7000/ | | Cluster Data : /net/dev-server-myang/share/yugabyte-data | ---------------------------------------------------------------------------------------------------- For more info, please use: yb-ctl status yb1$ ysqlsh ysqlsh (15.2-YB-2.25.2.0-b0) Type "help" for help. -- initial state yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 4 | 1 | 1 5 | 1 | 1 13515 | 1 | 1 13516 | 1 | 1 (5 rows) yugabyte=# select * from pg_yb_invalidation_messages; db_oid | current_version | message_time | messages --------+-----------------+--------------+---------- (0 rows) -- create table does not increment catalog version so its associated invalidation messages -- are not inserted into pg_yb_invalidation_messages. yugabyte=# create table foo(id int); CREATE TABLE yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 4 | 1 | 1 5 | 1 | 1 13515 | 1 | 1 13516 | 1 | 1 (5 rows) yugabyte=# select * from pg_yb_invalidation_messages; db_oid | current_version | message_time | messages --------+-----------------+--------------+---------- (0 rows) -- each alter table increments catalog version so its associated invalidation messages -- are inserted into pg_yb_invalidation_messages. yugabyte=# alter table foo add column id1 int; alter table foo add column id2 int; alter table foo add column id3 int; ALTER TABLE ALTER TABLE ALTER TABLE yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 4 | 1 | 1 5 | 1 | 1 13515 | 4 | 1 13516 | 1 | 1 (5 rows) -- each new catalog version has a row with its associated invalidation messsages yugabyte=# select * from pg_yb_invalidation_messages; db_oid | current_version | message_time | messages --------+-----------------+--------------+-------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------- ------------------ 13515 | 2 | 1739985961 | \x07700000733e2c00cb3400007651cba7000000000000000006700000733e2c00cb340000f7d687bd000000000 000000037700000733e2c00cb340000465708532d7000000000000036700000733e2c00cb34000021e2d2ca2d70000000000000fe000000733e2c00cb3400000040000 07882f33e01000000 13515 | 3 | 1739985961 | \x07700000733e2c00cb340000c370bca3000000000000000006700000733e2c00cb3400006c96b878000000000 000000037700000733e2c00cb340000465708532d7000000000000036700000733e2c00cb34000021e2d2ca2d70000000000000fe000000733e2c00cb3400000040000 07882f33e01000000 13515 | 4 | 1739985961 | \x07700000733e2c00cb34000062d2bacf000000000000000006700000733e2c00cb34000089da71bf000000000 000000037700000733e2c00cb340000465708532d7000000000000036700000733e2c00cb34000021e2d2ca2d70000000000000fe000000733e2c00cb3400000040000 07882f33e01000000 (3 rows) yugabyte=# ``` **Upgrade/Rollback safety:** The new field `db_catalog_inval_messages_data` in `TSHeartbeatResponsePB` is optional and will be ignored by an old tserver that does not have it. Its purpose is for incremental catalog cache refresh optimization so if ignored we just won't do the optimization in the PG backends spawned by an old tserver. Likewise, if a new tserver receives a `TSHeartbeatResponsePB` from a new master, the new field `db_catalog_inval_messages_data` will not be set and the new tserver just won't do the optimization of incremental catalog cache refresh. Test Plan: (1) the default YB_EXTRA_MASTER_FLAGS="--TEST_yb_enable_invalidation_messages=true --log_ysql_catalog_versions=true --vmodule=catalog_manager=2,heartbeater=2,master_heartbeat_service=2,pg_catversions=2" YB_EXTRA_TSERVER_FLAGS="--TEST_yb_enable_invalidation_messages=true --log_ysql_catalog_versions=true --vmodule=heartbeater=2,tablet_server=2,pg_catversions=2" ./yb_build.sh --cxx-test pg_catalog_version-test (2) with --enable_heartbeat_pg_catalog_versions_cache=true YB_EXTRA_MASTER_FLAGS="--TEST_yb_enable_invalidation_messages=true --log_ysql_catalog_versions=true --enable_heartbeat_pg_catalog_versions_cache=true --vmodule=catalog_manager=2,heartbeater=2,master_heartbeat_service=2,pg_catversions=2" YB_EXTRA_TSERVER_FLAGS="--TEST_yb_enable_invalidation_messages=true --log_ysql_catalog_versions=true --vmodule=heartbeater=2,tablet_server=2,pg_catversions=2" ./yb_build.sh --cxx-test pg_catalog_version-test In both cases, look at the test logs that indicate invalidation messages are read and set in the heartbeat response messages, and they are received at the tserver side: ``` [m-1] I0219 02:08:38.033061 2652477 master_heartbeat_service.cc:460] vlog2: TSHeartbeat: responding (to ts 8e40c9c3f02943e3a1111264bf74f9f7) db catalog versions: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 4 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 5 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13515 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13516 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 16384 current_version: 2 last_breaking_version: 1 }) db inval messages: db_catalog_inval_messages { db_oid: 16384 current_version: 2 message_list: "Pp\000\000qy(\000\000@\000\000@\301\353\n\000\000\000\000\000\000\000\000Op\000\000qy(\000\000@\000\000\305\366\351\242\000\000\000\000\000\000\000\000Pp\000\000qy(\000\000@\000\000G\242S{rp\000\000\000\000\000\000Op\000\000qy(\000\000@\000\000\360\230\2021rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000J4\027\233rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\00029X\237rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\204\237\234\023rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\002Q}$rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\325\031I,rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000S+\326Orp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\014\350L\363rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\361\367\247\350rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\354\273\251erp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\204\240\0360rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000?S\313\306rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\023\020\336\274rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\307ng\242rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\363\317\236\214rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\027\340 \035rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000K\243-\036rp\000\000\000\000\000\0007p\000\000qy(\000\000@\000\000FW\010Srp\000\000\000\000\000\0006p\000\000qy(\000\000@\000\000\360\230\2021rp\000\000\000\000\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000\376\000\000\000qy(\000\000@\000\000\000@\000\000\362I\221\365\001\000\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000" } [ts-1] I0219 02:08:38.034925 2652211 heartbeater.cc:561] vlog1: TryHeartbeat: got master db catalog version data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 4 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 5 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13515 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13516 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 16384 current_version: 2 last_breaking_version: 1 } db inval messages: db_catalog_inval_messages { db_oid: 16384 current_version: 2 message_list: "Pp\000\000qy(\000\000@\000\000@\301\353\n\000\000\000\000\000\000\000\000Op\000\000qy(\000\000@\000\000\305\366\351\242\000\000\000\000\000\000\000\000Pp\000\000qy(\000\000@\000\000G\242S{rp\000\000\000\000\000\000Op\000\000qy(\000\000@\000\000\360\230\2021rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000J4\027\233rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\00029X\237rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\204\237\234\023rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\002Q}$rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\325\031I,rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000S+\326Orp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\014\350L\363rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\361\367\247\350rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\354\273\251erp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\204\240\0360rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000?S\313\306rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\023\020\336\274rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\307ng\242rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000\363\317\236\214rp\000\000\000\000\000\000\007p\000\000qy(\000\000@\000\000\027\340 \035rp\000\000\000\000\000\000\006p\000\000qy(\000\000@\000\000K\243-\036rp\000\000\000\000\000\0007p\000\000qy(\000\000@\000\000FW\010Srp\000\000\000\000\000\0006p\000\000qy(\000\000@\000\000\360\230\2021rp\000\000\000\000\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000\376\000\000\000qy(\000\000@\000\000\000@\000\000\362I\221\365\001\000\000\000\373\000\000\000qy(\000\000@\000\0000\n\000\000\362I\221\365lU\000\000" } ``` Reviewers: hsunder, kfranz, sanketh, mihnea Reviewed By: hsunder, kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D42023

Commit:f3d2ce6
Author:Amitanand Aiyer
Committer:Amitanand Aiyer

[#25679] Table locks: serialize DDL acquire/release requests from master -> Tserver Summary: - Use a deadline for acquire requests to handle out of order acquire/release requests. The deadline is to be assigned by the pg_client/pg_client_session and is part of the acquire request. If there are any failures, the release request should be delayed until the previous acquire request's deadline to ensure that the Acquire request cannot apply after the release. The release request includes an apply_after field which ensures that the Tservers wait until the Acquire request's deadline is passed. Update ts_local_lock_manager to check for deadline/apply_after before handling the Acquire/Release requests accordingly. - Update ts_local_lock_manager to track the max seen lease_epoch for each tserver-uuid, and use it to reject old/out-of-order acquire messages that may be from an older epoch. - Update ts_local_lock_manager to ensure that it handles only one in-flight request per transaction. If there are multiple in-flight requests at the ts_local_lock_manager, only one will be allowed to process, and the rest of them should wait. - Use hybrid time to pass deadline/apply_after as part of the Acquire/Release ObjectLocks requests. Update the rpcs to also update hybrid-clock as required. - Use the deadline specified by the Acquire/Release ObjectLocksGlobal rpc as an upper bound for how long master -> tserver acquire/release rpcs are run. Designs considered: https://docs.google.com/document/d/1DtwMOoHC3bXmdyQjicRl5Di_29aiEYxBPgfUA2yg1ok/edit?usp=sharing **Upgrade/Rollback safety:** This feature is guarded by the test flag `TEST_enable_object_locking_for_table_locks` . Unless the flag is enabled, these changes do not have any effect. Jira: DB-14937 Test Plan: ybd --cxx-test object_lock-test --gtest_filter ObjectLockTest.IgnoreDDLAcquireAfterRelease.* Reviewers: bkolagani, rthallam, zdrudi Reviewed By: zdrudi Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41501

Commit:bb3d381
Author:jhe
Committer:jhe

[BACKPORT 2.25.1][#25888] xClusterDDLRepl: Support colocated index backfill + table rewrites Summary: Original commit: a1686e152232fd1d68999c65781e60d11ce4c37f / D41922 For colocated index backfills in db-scoped replication, we simply pick now as the backfill time on the target (since all we need is to pick a time that is after the source's backfill time). For DDL replication, this causes a deadlock, as ddl queue doesn't update the safe time until the backfill completes - and the backfill can't start until the safe time advances. This diff changes backfill for automatic mode to use tablet level consistency for reads. This is fine since we are reading at a set read time, so the reads will be consistent. This also solves the above deadlock since we remove the wait that backfill performs. Doing this read time change is not sufficient though for this to work completely. Consider now this example: ``` 1. ddl_queue is trying to run index backfill so it can bump safe time to 10 2. it waits for other safetimes to get to 10 3. other safe times jump up to 15 4. ddl_queue now runs the backfill 5. if we pick (safetime w/o ddlqueue) as the backfill time, we get 15 - this is also the case if we pick any backfill time > 10 6. after index backfill, backfilled rows are written at 15 7. but safe time is only bumped up to 10 8. so any reads at that point will not see the backfill rows! ``` In order to remove this window of inconsistency, we pass in the safe time that ddl_queue handler is trying to bump up to and use that as the backfill time on the target. Note that this is only done for colocated indexes, and is passed through via the xClusterContext -> XClusterTableInfo. **Other fixes:** - add support to table rewrites to properly add colocation_id for any indexes that it rewrites - add is_index field to yb_data json for indexes **Upgrade/Rollback safety:** - adding new field xcluster_backfill_hybrid_time - changing field in CreateTableRequestPB to XClusterTableInfo struct, fine as we have not released any releases with the prior field Jira: DB-15194 Test Plan: Added indexes to existing tests, and colocation to some of the relevant regress tests. ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedIndexes" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressCreateDropTable/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressCreateDropTable2/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressAlterTable/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressTableRewrite/0" ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42193

Commit:499c1b4
Author:Sergei Politov
Committer:Sergei Politov

[#396] DocDB: Don't write null frontiers Summary: Consider the following scenario. Large transaction is committed. It starts apply procedure but in the middle of apply process the tserver is restarted. After restart tserver continue to apply this transaction, but each write batch has null frontier, since we don't know apply operation id. If we trigger flush on this tablet when there are no other writes happened (for instance because of intents sst cleanup), we would get crash. Since our memtable flush filter expects that memtable should always have frontier. Fixed by storing apply operation in transaction apply state and recover it after restart. So we will always have non null frontiers while writing to RocksDB. **Upgrade/Downgrade safety:** apply op id added to apply state protobuf. So if restarted happened with an old version, while transaction in the process of apply the issue still could appear after restart. Please note that it is not regression. It just means that issue considered as fixed if restart happens with a new version. Jira: DB-6277 Test Plan: Jenkins Reviewers: rthallam, esheng Reviewed By: rthallam, esheng Subscribers: yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42097

Commit:a1686e1
Author:jhe
Committer:jhe

[#25888] xClusterDDLRepl: Support colocated index backfill + table rewrites Summary: For colocated index backfills in db-scoped replication, we simply pick now as the backfill time on the target (since all we need is to pick a time that is after the source's backfill time). For DDL replication, this causes a deadlock, as ddl queue doesn't update the safe time until the backfill completes - and the backfill can't start until the safe time advances. This diff changes backfill for automatic mode to use tablet level consistency for reads. This is fine since we are reading at a set read time, so the reads will be consistent. This also solves the above deadlock since we remove the wait that backfill performs. Doing this read time change is not sufficient though for this to work completely. Consider now this example: ``` 1. ddl_queue is trying to run index backfill so it can bump safe time to 10 2. it waits for other safetimes to get to 10 3. other safe times jump up to 15 4. ddl_queue now runs the backfill 5. if we pick (safetime w/o ddlqueue) as the backfill time, we get 15 - this is also the case if we pick any backfill time > 10 6. after index backfill, backfilled rows are written at 15 7. but safe time is only bumped up to 10 8. so any reads at that point will not see the backfill rows! ``` In order to remove this window of inconsistency, we pass in the safe time that ddl_queue handler is trying to bump up to and use that as the backfill time on the target. Note that this is only done for colocated indexes, and is passed through via the xClusterContext -> XClusterTableInfo. **Other fixes:** - add support to table rewrites to properly add colocation_id for any indexes that it rewrites - add is_index field to yb_data json for indexes **Upgrade/Rollback safety:** - adding new field xcluster_backfill_hybrid_time - changing field in CreateTableRequestPB to XClusterTableInfo struct, fine as we have not released any releases with the prior field Jira: DB-15194 Test Plan: Added indexes to existing tests, and colocation to some of the relevant regress tests. ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedIndexes" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressCreateDropTable/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressCreateDropTable2/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressAlterTable/0" ybd --cxx-test xcluster_ddl_replication_pgregress-test --gtest_filter "UseColocated/XClusterPgRegressDDLReplicationParamTest.PgRegressTableRewrite/0" ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41922

Commit:082600a
Author:Basava
Committer:Basava

[#25905] YSQL: Forward object lock calls to the TServer Summary: **Background** Distributed transactions In YB, we don't always create an explicit docdb distributed transaction unless required. For instance, 1. RC/RR txns in explicit transactional block with no reads with row locks or updates/writes 2. fast-path/single-shard writes etc. Since the above don't write/request intents, a docdb txn wasn't required. **Problem** We want to introduce support for table locks which would mean that we need some identifier to tag the locks to. Also, there could be deadlocks spanning row-level locks & object/table locks & advisory locks. So we would need to tie the object locks to the same docdb transaction. **Idea** We could start a distributed docdb transaction early (even for read-only txns), and keep re-using the transaction until either of the following happen: - a transaction with intents consumes it - a transaction that uses savepoints consumes it (excluding read committed isolation txns for now) - if the transaction ends up blocking a DDL (this would help prevent false deadlock issue since we would be reusing read only txns) The next subsequent statement would start another docdb transaction. That way, we would have an identifier to tag object/table locks to. For instance, consider the below example ``` ysqlsh> // create TXN1 begin; select * from t ...; // create TXN1. use <TXN1, subtxn_id> for object locks commit; // was a read only txn, reuse TXN1 if it satisfies all conditions update t set v=v+1 where k=1; // use <TXN1, subtxn_id> for object locks // fast path txn, so wouldn't have intents. can reuse TXN1 begin; select * from t ...; // use <TXN1, subtxn_id> for object locks update t set ...; // we use <TXN1> to write intents, so the txn cannot be reused commit; // TXN1 gets consumed select * from t ...; // create txn TXN2, and use <TXN2, subtxn_id> for object locks ``` The basic infrastructure piece for txn reuse was introduced in https://phorge.dev.yugabyte.com/D41282 This diff adds the following changes. When `TEST_enable_object_locking_for_table_locks` is enabled - object lock calls are forwarded from `pggate` to `pg_client_service` layer. Additionally, txn finish calls are forwarded from `pggate` for read-only statements/txns too. - shared locks are acquired on the local tserver, so essentially all DMLs would acquire object/table locks when the flag is set. - on statement/transaction finish, the local object locks are released in-line. The test flag is disabled by default, and hence the diff shouldn't cause any semantic change in behavior on both docdb and ysql ends. **DocDB changes that can go as subsequent diff** 1. Forward exclusive object locks to the master to acquire them globally. 2. In YB, READ_COMMITTED isolation internally leverages savepoints. Enabling the above would lead to burn of 1 docdb transaction for every RC read-only txn, which is not desirable. Hence for RC alone, this change ignores creating a doc dist txn on savepoint => object locks wouldn't be rolled back on rollback to savepoint if the txn hasn't written any intents until then (those object locks gets released only on finish). Fix behavior for RC. 3. Investigate failures with ParallelQuery and extended query mode execution when the test flag is enabled. **YSQL changes that might help test the diff better** 1. Need to expose something similar to `yb_get_current_transaction` that would help us add more tests for this change using `PGConn`. Modifying `yb_get_current_transaction` didn't seem like the correct approach as it could end up showing the same txn id across different YSQL read-only txns. **Upgrade/Downgrade safety** No upgrade/downgrade issues since the only proto changes are for local node communication between pggate and pg_client_service. Jira: DB-15217 Test Plan: Jenkins Reviewers: dmitry, pjain Reviewed By: dmitry Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D39894

Commit:b3c0a21
Author:Sergei Politov
Committer:Sergei Politov

[#26150] DocDB: Add vector index verification facility Summary: This diff adds logic to check whether vector index contains all entries from indexed table. **Upgrade/Rollback safety:** Added RPC function that used only from test. Jira: DB-15484 Test Plan: PgBackfillIndexTest.VectorIndex Reviewers: arybochkin Reviewed By: arybochkin Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D42077

Commit:f514c22
Author:Mark Lillibridge
Committer:Mark Lillibridge

[BACKPORT 2.25.1][#24038] xCluster: create separate secondary OID space for target usage Summary: Original commit: c2505c9061db9d79067c8bb2682f69517abf4834 / D41872 Clean cherry pick, no conflicts. In order to avoid OID collision, we are splitting the original Postgres OID space in half; the first part is the "normal" space, which continues to be used for most things and the second part is a new "secondary" space. The split is: ``` ~/code/yugabyte-db/src/yb/common/entity_ids.h:32: static const uint32_t kPgFirstNormalObjectId = 16384; // Hardcoded in transam.h // We include some OIDs that would be signed with int32_t to allow checking the Postgres logic that // handles those. See PgLibPqLargeOidTest.LargeOid test. static const uint32_t kPgUpperBoundNormalObjectId = 2'199'999'999;; // upper bound is exclusive // Secondary OID space is used by xCluster when a database is a target. static const uint32_t kPgFirstSecondarySpaceObjectId = 2'200'000'000; // = 0x83'21'56'00 static const uint32_t kPgUpperBoundSecondarySpaceObjectId = 0xff'ff'ff'ff; ``` where OIDs in range 1..16383 are still considered part of the normal space but reserved for initdb and the like. When automatic xCluster replication is running, the target universe will allocate OIDs from the secondary space instead of the normal space to avoid collisions with OIDs freshly allocated on the source universe. That is the only time we allocate OIDs from the secondary space. Implementation of the splitting: * we modify the existing per-database OID allocator * the system catalog entries for databases in addition to the OID pointer for the normal space have a new OID pointer for the secondary space * the RPC to the master to reserve a range of OIDs now takes a bool that determines which space we allocate from * the TServers cache the OIDs they get from the result of that RPC; each TServer caches OIDs from only one space, discarding their cache when it is necessary to switch spaces * we expect switching to only happen when xCluster replication is dropped so there is no need to maintain dual caches * Postgres backends do not cache OIDs unless --ysql_enable_pg_per_database_oid_allocator=false, which is not the default and has never been recommended * if for some reason that flag had been turned on and we need to allocate from the secondary space, an error is returned * we are considering a better guardrail for that in the future * when databases are copied from template databases, only their normal OID pointer is copied * this effectively gives a secondary space pointer pointing to the start of the secondary space * this is fine -- xCluster safety only requires the secondary space pointer not be reset while replication of a database is running * when a database is restored, both OID pointers are reset to the beginning of their respective spaces * as above, this is fine for the secondary space pointer * Postgres doesn't care about this because it expects and handles correctly OID counters that wrap; YugabyteDB expects the counter not to wrap during a Postgres session but we have not broken that expectation here * xCluster does care but we will handle the normal space pointer in a future diff by doing a scan of the OIDs in use when xCluster automatic mode replication is started up then setting the normal space OID pointer after that value **Upgrade/Rollback safety:** Renamed one proto-field, next_pg_oid->next_normal_pg_oid, and added one field, next_secondary_pg_oid to the system catalog entry for namespaces: ``` ~/code/yugabyte-db/src/yb/master/catalog_entity_info.proto:277: // The data part of a SysRowEntry in the sys.catalog table for a namespace. message SysNamespaceEntryPB { ... optional uint32 next_normal_pg_oid = 3; // Next normal space oid to assign. ... optional uint32 next_secondary_pg_oid = 9; // Next secondary space oid to assign. } ``` * Note that the field rename is not persisted and serves only to verify that no one is incorrectly depending on this field * (the field ID is unchanged and that is what is persisted) * the code treats the absence of the second field as if it was there and containing the start of the secondary space Added a new boolean argument to the RPC for allocating a range of OIDs: ``` ~/code/yugabyte-db/src/yb/master/master_client.proto:185: // Reserve Postgres oid message ReservePgsqlOidsRequestPB { optional bytes namespace_id = 1; // The namespace id of the Postgres database. optional uint32 next_oid = 2; // The next oid to reserve. optional uint32 count = 3; // The number of oids to reserve. // use_secondary_space is used by xCluster when a database is a target. optional bool use_secondary_space = 4; } ``` When omitted, this field defaults to false, which means to use the normal space as usual. We are not planning to use any auto flags here in spite of these changes changing persistent data and RPCs * automatic mode will not be available in the wild until after this diff is landed * and the customer has upgraded to a version after that * YBA will not allow setting up automatic mode replication until after that version is finalized * this means that the RPC will never be called with the extra field until after the upgrade is finalized * this means that the new persisted field will likewise never be set until after the upgrade is finalized Although no new fields will be used until the upgrade is finalized, the logic that limits the OIDs to the normal space applies immediately once the upgrade has started: * in theory, an upgrade could cause problems for a customer whose normal space OID pointer was already into the secondary space * this is very unlikely in practice because the customer would've have to gone through over a billion OIDs * worst case, we can safely reset their OID counter back to the beginning of the normal space * Postgres code comments basically says they can not handle more than a few million active OIDs at once due to poor algorithm choice * so the problem is the OID counter could get too high not that there are too many OIDs * vanilla Postgres is already designed to handle OID wrap and YugabyteDB can handle counter resets when xCluster is not running so the resetting should not be an issue * a rollback in the meantime will restore functionality because the previous version of the code does not honor the new lower normal space upper bound Implementation for using the secondary space on xCluster automatic mode replication targets: * extended TserverXClusterContextIf with a function for determining if a namespace is the target of xCluster automatic replication * that returning true plus being in xCluster read-only mode determines that we should use the secondary space * we carefully do not use the secondary space if other modes of xCluster are being used * this required some plumbing changes where we now pass/store TserverXClusterContext instead of TserverXClusterContextIf in some places Jira: DB-12928 Test Plan: I added a new test suite to unit test the OID allocator: ``` ybd --cxx-test xcluster_secondary_oid_space-test ``` as well as a new test to verify that collisions have been avoided if there are extra OID allocations on the target: ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.ExtraOidAllocationsOnTarget' ``` This test fails if we do not use the secondary space on the target. Note that there is additional work to remove collisions in all cases, particularly in the case of switch overs. Future diffs will address that. ``` ybd --cxx-test pg_libpq-test --gtest_filter '*.LargeOid' ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42031

Commit:a13818a
Author:yusong-yan
Committer:yusong-yan

[BACKPORT 2.25.1][##26057] docdb: Make pg_locks to display active pg advisory locks Summary: Original commit: 645c897997de166ec8686b12f99f11f1cc89aadb / D41929 The code changes address the issue where `pg_locks` does not display active advisory locks due to its assumption that it only interacts with `PGSQL_TABLE_TYPE` tablets. The advisory lock table, however, is of type `YQL_TABLE_TYPE`. To fix this: At tablet service layer, allow lock status calls for advisory lock tablets. At pggate layer, updated PgLockStatusRequestor to correctly identify and handle locks. Sample Output: ``` yugabyte=# begin transaction;select pg_advisory_xact_lock(1); BEGIN pg_advisory_xact_lock ----------------------- (1 row) yugabyte=*# select pg_advisory_lock(10); pg_advisory_lock ------------------ (1 row) yugabyte=*# select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | waitend | ybdetails ----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+--------------------------+---------+----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- advisory | 13515 | | | | | | | | | | | STRONG_READ,STRONG_WRITE | t | f | | 2025-02-18 20:25:26.105692+00 | {"node": "826f1f5296a346d1b4bc52ab5d76cb28", "tablet_id": "f367a32ce04f46aebf5e74337f2d0911", "blocked_by": null, "is_explicit": true, "transactionid": "a8f15eb1-8d92-4126-88f1-f10b9d1767b1", "keyrangedetails": {"cols": ["13515", "0", "1", "1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}, "subtransaction_id": 2} advisory | 13515 | | | | | | | | | | | STRONG_READ,STRONG_WRITE | t | f | | 2025-02-18 20:25:32.574545+00 | {"node": "826f1f5296a346d1b4bc52ab5d76cb28", "tablet_id": "f367a32ce04f46aebf5e74337f2d0911", "blocked_by": null, "is_explicit": true, "transactionid": "ecd54f35-9ed4-4933-a50d-5b9591916102", "keyrangedetails": {"cols": ["13515", "0", "10", "1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}, "subtransaction_id": 1} (2 rows) ``` Note that currently, the values of `classid`, `objid`, and `objsubid` are located inside `ybdetails -> keyrangedetails -> cols`. A separate diff will be made to move these values to their correct locations and refine the mode field. Upgrade/Rollback safety: The advisory lock feature is guarded by ysql_yb_enable_advisory_lock (false by default). Jira: DB-15383 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_advisory_lock-test --gtest_filter PgAdvisoryLockTest.PgLocksSanityTest Reviewers: rthallam, bkolagani, esheng Reviewed By: bkolagani Subscribers: slingam, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D42030

Commit:c2505c9
Author:Mark Lillibridge
Committer:Mark Lillibridge

[#24038] xCluster: create separate secondary OID space for target usage Summary: In order to avoid OID collision, we are splitting the original Postgres OID space in half; the first part is the "normal" space, which continues to be used for most things and the second part is a new "secondary" space. The split is: ``` ~/code/yugabyte-db/src/yb/common/entity_ids.h:32: static const uint32_t kPgFirstNormalObjectId = 16384; // Hardcoded in transam.h // We include some OIDs that would be signed with int32_t to allow checking the Postgres logic that // handles those. See PgLibPqLargeOidTest.LargeOid test. static const uint32_t kPgUpperBoundNormalObjectId = 2'199'999'999;; // upper bound is exclusive // Secondary OID space is used by xCluster when a database is a target. static const uint32_t kPgFirstSecondarySpaceObjectId = 2'200'000'000; // = 0x83'21'56'00 static const uint32_t kPgUpperBoundSecondarySpaceObjectId = 0xff'ff'ff'ff; ``` where OIDs in range 1..16383 are still considered part of the normal space but reserved for initdb and the like. When automatic xCluster replication is running, the target universe will allocate OIDs from the secondary space instead of the normal space to avoid collisions with OIDs freshly allocated on the source universe. That is the only time we allocate OIDs from the secondary space. Implementation of the splitting: * we modify the existing per-database OID allocator * the system catalog entries for databases in addition to the OID pointer for the normal space have a new OID pointer for the secondary space * the RPC to the master to reserve a range of OIDs now takes a bool that determines which space we allocate from * the TServers cache the OIDs they get from the result of that RPC; each TServer caches OIDs from only one space, discarding their cache when it is necessary to switch spaces * we expect switching to only happen when xCluster replication is dropped so there is no need to maintain dual caches * Postgres backends do not cache OIDs unless --ysql_enable_pg_per_database_oid_allocator=false, which is not the default and has never been recommended * if for some reason that flag had been turned on and we need to allocate from the secondary space, an error is returned * we are considering a better guardrail for that in the future * when databases are copied from template databases, only their normal OID pointer is copied * this effectively gives a secondary space pointer pointing to the start of the secondary space * this is fine -- xCluster safety only requires the secondary space pointer not be reset while replication of a database is running * when a database is restored, both OID pointers are reset to the beginning of their respective spaces * as above, this is fine for the secondary space pointer * Postgres doesn't care about this because it expects and handles correctly OID counters that wrap; YugabyteDB expects the counter not to wrap during a Postgres session but we have not broken that expectation here * xCluster does care but we will handle the normal space pointer in a future diff by doing a scan of the OIDs in use when xCluster automatic mode replication is started up then setting the normal space OID pointer after that value **Upgrade/Rollback safety:** Renamed one proto-field, next_pg_oid->next_normal_pg_oid, and added one field, next_secondary_pg_oid to the system catalog entry for namespaces: ``` ~/code/yugabyte-db/src/yb/master/catalog_entity_info.proto:277: // The data part of a SysRowEntry in the sys.catalog table for a namespace. message SysNamespaceEntryPB { ... optional uint32 next_normal_pg_oid = 3; // Next normal space oid to assign. ... optional uint32 next_secondary_pg_oid = 9; // Next secondary space oid to assign. } ``` * Note that the field rename is not persisted and serves only to verify that no one is incorrectly depending on this field * (the field ID is unchanged and that is what is persisted) * the code treats the absence of the second field as if it was there and containing the start of the secondary space Added a new boolean argument to the RPC for allocating a range of OIDs: ``` ~/code/yugabyte-db/src/yb/master/master_client.proto:185: // Reserve Postgres oid message ReservePgsqlOidsRequestPB { optional bytes namespace_id = 1; // The namespace id of the Postgres database. optional uint32 next_oid = 2; // The next oid to reserve. optional uint32 count = 3; // The number of oids to reserve. // use_secondary_space is used by xCluster when a database is a target. optional bool use_secondary_space = 4; } ``` When omitted, this field defaults to false, which means to use the normal space as usual. We are not planning to use any auto flags here in spite of these changes changing persistent data and RPCs * automatic mode will not be available in the wild until after this diff is landed * and the customer has upgraded to a version after that * YBA will not allow setting up automatic mode replication until after that version is finalized * this means that the RPC will never be called with the extra field until after the upgrade is finalized * this means that the new persisted field will likewise never be set until after the upgrade is finalized Although no new fields will be used until the upgrade is finalized, the logic that limits the OIDs to the normal space applies immediately once the upgrade has started: * in theory, an upgrade could cause problems for a customer whose normal space OID pointer was already into the secondary space * this is very unlikely in practice because the customer would've have to gone through over a billion OIDs * worst case, we can safely reset their OID counter back to the beginning of the normal space * Postgres code comments basically says they can not handle more than a few million active OIDs at once due to poor algorithm choice * so the problem is the OID counter could get too high not that there are too many OIDs * vanilla Postgres is already designed to handle OID wrap and YugabyteDB can handle counter resets when xCluster is not running so the resetting should not be an issue * a rollback in the meantime will restore functionality because the previous version of the code does not honor the new lower normal space upper bound Implementation for using the secondary space on xCluster automatic mode replication targets: * extended TserverXClusterContextIf with a function for determining if a namespace is the target of xCluster automatic replication * that returning true plus being in xCluster read-only mode determines that we should use the secondary space * we carefully do not use the secondary space if other modes of xCluster are being used * this required some plumbing changes where we now pass/store TserverXClusterContext instead of TserverXClusterContextIf in some places Jira: DB-12928 Test Plan: I added a new test suite to unit test the OID allocator: ``` ybd --cxx-test xcluster_secondary_oid_space-test ``` as well as a new test to verify that collisions have been avoided if there are extra OID allocations on the target: ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.ExtraOidAllocationsOnTarget' ``` This test fails if we do not use the secondary space on the target. Note that there is additional work to remove collisions in all cases, particularly in the case of switch overs. Future diffs will address that. ``` ybd --cxx-test pg_libpq-test --gtest_filter '*.LargeOid' ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41872

Commit:645c897
Author:yusong-yan
Committer:yusong-yan

[##26057] docdb: Make pg_locks to display active pg advisory locks Summary: The code changes address the issue where `pg_locks` does not display active advisory locks due to its assumption that it only interacts with `PGSQL_TABLE_TYPE` tablets. The advisory lock table, however, is of type `YQL_TABLE_TYPE`. To fix this: At tablet service layer, allow lock status calls for advisory lock tablets. At pggate layer, updated PgLockStatusRequestor to correctly identify and handle locks. Sample Output: ``` yugabyte=# begin transaction;select pg_advisory_xact_lock(1); BEGIN pg_advisory_xact_lock ----------------------- (1 row) yugabyte=*# select pg_advisory_lock(10); pg_advisory_lock ------------------ (1 row) yugabyte=*# select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | waitend | ybdetails ----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+--------------------------+---------+----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- advisory | 13515 | | | | | | | | | | | STRONG_READ,STRONG_WRITE | t | f | | 2025-02-18 20:25:26.105692+00 | {"node": "826f1f5296a346d1b4bc52ab5d76cb28", "tablet_id": "f367a32ce04f46aebf5e74337f2d0911", "blocked_by": null, "is_explicit": true, "transactionid": "a8f15eb1-8d92-4126-88f1-f10b9d1767b1", "keyrangedetails": {"cols": ["13515", "0", "1", "1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}, "subtransaction_id": 2} advisory | 13515 | | | | | | | | | | | STRONG_READ,STRONG_WRITE | t | f | | 2025-02-18 20:25:32.574545+00 | {"node": "826f1f5296a346d1b4bc52ab5d76cb28", "tablet_id": "f367a32ce04f46aebf5e74337f2d0911", "blocked_by": null, "is_explicit": true, "transactionid": "ecd54f35-9ed4-4933-a50d-5b9591916102", "keyrangedetails": {"cols": ["13515", "0", "10", "1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}, "subtransaction_id": 1} (2 rows) ``` Note that currently, the values of `classid`, `objid`, and `objsubid` are located inside `ybdetails -> keyrangedetails -> cols`. A separate diff will be made to move these values to their correct locations and refine the mode field. Upgrade/Rollback safety: The advisory lock feature is guarded by ysql_yb_enable_advisory_lock (false by default). Jira: DB-15383 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_advisory_lock-test --gtest_filter PgAdvisoryLockTest.PgLocksSanityTest Reviewers: rthallam, bkolagani, esheng Reviewed By: rthallam, bkolagani Subscribers: ybase, yql, slingam Differential Revision: https://phorge.dev.yugabyte.com/D41929

Commit:93e200e
Author:Zachary Drudi
Committer:Zachary Drudi

[#25641] docdb: TServers kill their existing PG sessions when they get a new ysql lease Summary: Add code in the tserver to kill all hosted pg sessions when the master tells the tserver it received a new lease. Also switch from using hybrid time to used mono time to track the ysql operation lease at both the master and the tserver. TServers use the time immediately before they sent an ACKed heartbeat as the lease refresh time, masters use the time they process the heartbeat. Nodes compute the lease deadline by adding a flag to the last lease refresh time. Masters use `master_ysql_operation_lease_ttl_ms`, tservers will use a different flag to be added later. This behaviour is currently flag guarded behind the flag `enable_ysql_operation_lease`, which is a test flag that defaults to `false`. **Upgrade/Rollback safety:** The new field `YSQLLeaseInfoPB lease_info` included in the `ListTabletServers` RPC is optional, only populated if a test flag defaulting to `false` is set, and has no non-test clients so it is upgrade/rollback safe. Jira: DB-14893, DB-14892 Test Plan: ``` ./yb_build.sh release --cxx-test object_lock-test --gtest_filter 'ExternalObjectLockTest.TabletServerKillsSessionsWhenItAcquiresNewLease' ``` Reviewers: amitanand, bkolagani, kramanathan, sergei Reviewed By: amitanand, kramanathan, sergei Subscribers: svc_phabricator, sergei, sanketh, rthallam, ybase, yql, slingam Differential Revision: https://phorge.dev.yugabyte.com/D41449

Commit:e2828c7
Author:Sanketh I
Committer:Sanketh I

[#25799] YSQL: Invalidate tserver PG table cache by catalog version instead of time. Summary: After a DDL change, the tserver side table cache is invalidated by pggate from each backend. This means that each backend needs to reopen pg_ catalog tables from the master, which is expensive on a faraway node. This is because the table cache invalidation is based on a timestamp. This fix changes the tserver side invalidation to happen by the catalog version instead. Without this fix, we see the following pattern causing high latency on all backends in a faraway region after a DDL statement. 1. The tserver notices the DDL change, invalidates its PG table cache. 2. First backend notices the DDL change, calls into the tserver to invalidate the tserver table cache. A subset of catalog tables are prefetched into the tserver. The latency seen is the sum of prefetching and refetching docdb schema. 3. The second backend notices the DDL change. It benefits from prefetched data but it again invalidates the tserver table cache. Docdb schema has to be refetched (including for the prefetched tables) causing latency in the order of secs (though lesser than (2). **Upgrade/Rollback safety:** * The proto fields are only used in tserver -> PG communication, which are upgraded at the same time, so no upgrade safety concerns are involved. Jira: DB-15099 Test Plan: Existing tests pass. We don't have a perf scenario covering this. Stress test runs on the following scenarios passed. ``` test_intensive_multi_tenancy_workload|test_connections_memory_consumptions|test_ysql_tablets_dml ``` Reviewers: myang, dmitry, sergei Reviewed By: myang, sergei Subscribers: svc_phabricator, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41346

Commit:99c0530
Author:jhe
Committer:jhe

[#22318] xClusterDDLRepl: Support create colocated table Summary: Support creation of colocated tables. Creation of colocated indexes will come in a follow-up diff (to handle index backfill). **Background:** XCluster requires a persisted mapping of equivalent source -> target packing schemas, which we use to rewrite replicate records. This mapping is based on actual docdb schema versions of these tables. With DDLRepl, we process `ChangeMetadataOps` (CMOPs) prior to running the equivalent DDL - this means that we can get new packing schemas + data for these packing schemas before we get the new corresponding schema on the target. To handle this, we insert the packing schema into old_packing_schemas so that we will have a valid schema version that we can point to. See D38293 for more info on this. For colocated tables, things work a bit differently. Create tables are essentially CMOPs on the parent table, which means that we would try to handle them as an alter operation. However, this table doesn't even exist yet on the target, so we have nowhere to store the old_packing_schemas... **Solution:** This diff solves this by storing the old_packing_schemas within the UniverseReplication object - called via InsertHistoricalColocatedSchemaPacking. We store the original packing schema for the new table, along with any new packing schemas that may come around before the table is created (eg if there are immediate alters following the create). InsertHistoricalColocatedSchemaPacking checks that the table has not yet been created (otherwise we go down the regular alter flow) and that we don't have an existing compatible schema. When the table does get created on the target, this information gets stored in the TableInfo and we bump up the table's initial schema version to (max_old_schema_version + 1). This information gets sent to the colocated tablet and inserted into the old_schema_packings on the tablet, thus persisting the xcluster mapping for the new table. **Other changes:** - The DDLRepl extension is changed to also capture colocation_ids and store those in the yb_data json. - The ddl_queue handler then stores these colocation_ids in the xcluster_context and forces that colocation_id for the table on the target as well. - GetCompatibleSchemaVersionRpc was changed to fail on NotFound errors for colocated tables (this avoids a 30s delay when creating colocated tables, since otherwise the target would keep retrying the search for this colocation id) - Cleanup of the xcluster tableinfo is now handled at the end of the ysql transaction, which will ensure that we only clean up these fields once the object has been committed. **Follow-ups:** - Index backfill of colocated indexes - Handle compactions of colocated tablet on the target **Upgrade/Rollback safety:** Modifying fields that are protected under the automatic xcluster gflags and that are not yet released. Jira: DB-11227 Test Plan: ``` ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedTables" ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedTableWithPause" ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedTableWithSourceFailures" ybd --cxx-test xcluster_ddl_replication-test --gtest_filter "XClusterDDLReplicationTest.CreateColocatedTableWithTargetFailures" ``` Reviewers: xCluster, hsunder, #db-approvers Reviewed By: hsunder, #db-approvers Subscribers: svc_phabricator, yql, ycdcxcluster, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41367

Commit:2ab4fbe
Author:Siddharth Shah
Committer:Siddharth Shah

[BACKPORT 2024.2][#25897] CDC: Logical replication with consistency for subset of tablets Summary: **Backport description:** Minor conflicts in following files: 1. guc.c - due to different position of existing flags. 2. yb_guc.h - Changes from ybc_guc.h (only present in master) got mapped to this file. Conflict encountered as flag declarations from ybc_util.h has been moved to ybc_guc.h in master. 3. ybc_pggate.h & ybc_pggate.cc for some refactoring changes. Had to add the flag introduced in this diff in ybc_util.h because of refactoring mentioned in point 2. **Original description:** Original commit: 2c0f31bf1967e7bdd1bcf2334f43362702bf0e3f / D41520 In logical replication, we are introducing a new mode that provides parallel consumption of changes from a table using multiple replication slots where each slot is polling changes from a subset of tablets. This model in guarded under a tserver preview flag - `ysql_yb_enable_consistent_replication_from_hash_range` To support this mode, we have introduced a new slot option "hash_range" that can be passed with the START_REPLICATION SLOT command. This option takes in 2 comma-separated values 'start hash range' & 'end hash range'. Tablets of the particular table whose start hash range falls within the slot's hash range will be considered for streaming data from that slot. Requirements for this mode: 1. Publication should only contain 1 table. 2. Slot, once started with a particular hash range, can only be restarted with the same hash range. When the replication slot is started with hash ranges for the 1st time, these values are extracted and the hash_range option is removed from the option list as this list is passed to output plugin for further validation. These extracted values are passed to VirtualWAL. Here, we persist these hash ranges in the slot's cdc_state entry during initialistion of VirtualWAL. These values will be later checked when the slot is restarted. If the slot is started with no hash ranges for the 1st time and thereafter, nothing will be persisted in slot's state table entry and all tablets of tables under the publication will be polled. **Upgrade/Rollback safety:** All changes for this model is covered under a new tserver preview flag `ysql_yb_enable_consistent_replication_from_hash_range`. New proto fields introduced in `InitvirtualWALRequestPB` are only populated & accessed if the flag is set. A separate auto flag is not required for proto changes as the RPC is sent from pg to local tserver. Jira: DB-15209 Test Plan: Jenkins: urgent ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testConsumptionOnSubsetOfTabletsFromMultipleSlots' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testNonNumericHashRangeWithSlot' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testOutOfBoundHashRangeWithSlot' ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestReplicationWithHashRangeConstraintsAndTabletSplit Reviewers: skumar, sumukh.phalgaonkar, stiwary, utkarsh.munjal Reviewed By: sumukh.phalgaonkar Subscribers: ycdcxcluster, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41790

Commit:2c0f31b
Author:Siddharth Shah
Committer:Siddharth Shah

[#25897] CDC: Logical replication with consistency for subset of tablets Summary: In logical replication, we are introducing a new mode that provides parallel consumption of changes from a table using multiple replication slots where each slot is polling changes from a subset of tablets. This model in guarded under a tserver preview flag - `ysql_yb_enable_consistent_replication_from_hash_range` To support this mode, we have introduced a new slot option "hash_range" that can be passed with the START_REPLICATION SLOT command. This option takes in 2 comma-separated values 'start hash range' & 'end hash range'. Tablets of the particular table whose start hash range falls within the slot's hash range will be considered for streaming data from that slot. Requirements for this mode: 1. Publication should only contain 1 table. 2. Slot, once started with a particular hash range, can only be restarted with the same hash range. When the replication slot is started with hash ranges for the 1st time, these values are extracted and the hash_range option is removed from the option list as this list is passed to output plugin for further validation. These extracted values are passed to VirtualWAL. Here, we persist these hash ranges in the slot's cdc_state entry during initialistion of VirtualWAL. These values will be later checked when the slot is restarted. If the slot is started with no hash ranges for the 1st time and thereafter, nothing will be persisted in slot's state table entry and all tablets of tables under the publication will be polled. **Upgrade/Rollback safety:** All changes for this model is covered under a new tserver preview flag `ysql_yb_enable_consistent_replication_from_hash_range`. New proto fields introduced in `InitvirtualWALRequestPB` are only populated & accessed if the flag is set. A separate auto flag is not required for proto changes as the RPC is sent from pg to local tserver. Jira: DB-15209 Test Plan: Jenkins: urgent ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testConsumptionOnSubsetOfTabletsFromMultipleSlots' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testNonNumericHashRangeWithSlot' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testOutOfBoundHashRangeWithSlot' ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestReplicationWithHashRangeConstraintsAndTabletSplit Reviewers: skumar, sumukh.phalgaonkar, stiwary, utkarsh.munjal Reviewed By: sumukh.phalgaonkar, stiwary, utkarsh.munjal Subscribers: yql, ybase, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41520

Commit:d4fb8a5
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#25899] DocDB: Remove NonRuntime AutoFlag Summary: Non-runtime AutoFlags has never been used. The macro was removed almost 2 years back in 7db84bce339fb664c79774495589ca1781ca069b/D25569 This change cleans up all the code related to non-runtime AutoFlags. **Upgrade/Downgrade safety: ** `promote_non_runtime_flags` field in `PromoteAutoFlagsRequestPB` has been removed. This was never used except for in tests and yb-admin, which have also been cleaned up. Jira: DB-15211 Test Plan: Jenkins Reviewers: asrivastava, xCluster Reviewed By: asrivastava Subscribers: yugaware, ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D41745

Commit:3e16961
Author:Dmitry Uspenskiy
Committer:Dmitry Uspenskiy

[#25847] YSQL: Use pimpl design pattern for PgClientSession Summary: Nowadays PgClientSession contains lots of inner helper classes and private helper methods. It is reasonable to use pimpl design pattern to move all this stuff in cc file. **Note:** In context of this diff deprecated potobuf's in `pg_client.proto` was removed as part of cleanup. Jira: DB-15142 Test Plan: Jenkins Reviewers: sergei, mlillibridge, yyan, tnayak, kramanathan Reviewed By: sergei Subscribers: myang, mlillibridge, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41631

Commit:3eaa5e8
Author:Sergei Politov
Committer:Sergei Politov

[#25844] DocDB: Support vector index backfill in chunks and restart during backfill Summary: Vector index could be created on a table with pretty large dataset. To avoid generating too big vector index chunks and mem tables we should fill them in chunks. Also added support for continuing backfill after restart. **Upgrade/Rollback safety:** Safe to upgrade/rollback since newly added field used only with newly added tables. No extra flags are required. Jira: DB-15139 Test Plan: PgVectorIndexTest.ManyRowsWithBackfillAndRestart/Distributed Reviewers: arybochkin Reviewed By: arybochkin Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41639

Commit:20dd717
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[BACKPORT 2024.2][#24583] YSQL: Support Snapshot Options in CREATE_REPLICATION_SLOT Summary: Backport Summary: - src/postgres/src/backend/replication/logical/snapbuild.c: -- Master commit introduced an include, which caused conflict with change of order of includes from PG11 to PG15, resolved by introducing the master commit include and keeping the original order intact -- SnapBuildBuildSnapshot: Master commit introduced setting of `yb_cdc_snapshot_read_time` in the function which conflicted with PG15 introduced member `snapXactCompletionCount`. Resolved by introducing the master commit changes - src/postgres/src/backend/replication/walsender.c: -- Minor conflicts because of formatting. Resolved by using the right formatting. - src/postgres/src/backend/storage/ipc/procarray.c: -- GetSnapshotDataReuse: Master commit introduced setting of `yb_cdc_snapshot_read_time` in the function but this function does not exist in PG11 branches, resolved by accepting PG11 branch changes. - src/postgres/src/backend/utils/time/snapmgr.c -- ExportSnapshot: Master commit moved creation of dummy snapshot in YB case to `YbInitSnapshot` function the conflict is because of missing field of `SnapshotData` struct `snapshot_type` inb PG11 branch. Resolved introducing the `YbInitSnapshot` with proper fields. - src/yb/yql/pggate/ybc_pg_typedefs.h: -- Minor conflict because of formatting differences (`YbcPgReplicationSlotSnapshotAction` vs `YBCPgReplicationSlotSnapshotAction`) - src/yb/yql/pggate/ybc_pggate.cc: -- Minor conflict because of formatting differences (`YbcStatus` vs `YBCStatus`) - src/yb/yql/pggate/ybc_pggate.h: -- Minor conflict because of formatting differences (`YbcStatus` vs `YBCStatus`) Original Summary: This revision introduces support for snapshot options in replication slot creation with the following options: - **USE_SNAPSHOT** - **EXPORT_SNAPSHOT** **Overview** When a replication slot is created with `USE_SNAPSHOT`, `consistent_snapshot_time` is fetched from the CDC service and set for the current session using the same API calls used by `SET TRANSACTION SNAPSHOT`. For `EXPORT_SNAPSHOT`, this `consistent_snapshot_time` is used to export the snapshot, similar to `pg_export_snapshot()`. However, instead of the `read_time` typically picked in the tserver as with `pg_export_snapshot()`, this `consistent_snapshot_time` provided by the CDC service is directly set as the `read_time` associated and stored with the snapshot. Implementation Details Commit 17c2711109b63125d34caa4d781b33b7b4929167 / D38542 introduced support for `pg_export_snapshot` and `SET TRANSACTION SNAPSHOT` in YB. This revision builds on those APIs to support the above-mentioned snapshot options. For both options, a `consistent_snapshot_time` is fetched from the CDC service while creating the replication slot. This is a HybridTime (not ReadHybridTime). Both `USE_SNAPSHOT` and `EXPORT_SNAPSHOT` use this HybridTime for setting and exporting the snapshot, respectively. **USE_SNAPSHOT** - When this option is provided, a ReadHybridTime (constructed from the single HybridTime `consistent_snapshot_time`) is sent to the tserver via the `ImportTxnSnapshot` RPC. - A new field, `cdc_snapshot_read_time`, is introduced in `PgImportTxnSnapshotRequestPB` to carry this timestamp. - The rest of the flow remains the same as `SET TRANSACTION SNAPSHOT`, except that the read time is directly taken from the CDC-provided `consistent_snapshot_time` instead of stored `read_time` against specified snapshot ID. **EXPORT_SNAPSHOT** - A new field, `cdc_snapshot_read_time`, is added to `PgExportTxnSnapshotRequestPB` and sent to the tserver via the `ExportTxnSnapshot` RPC. - This value is then stored as the `read_time` for the snapshot, and a new snapshot ID is generated. **Upgrade/Rollback safety:** This revision is part of a new feature whose syntax is not currently used by any user and guarded by a preview flag. It introduces new fields in already existing messages, which will only be utilized when the new syntax is invoked. JIRA: DB-13622 Original commit: 4776752ad4f713a1701bdcbc45c32418983f3c8e / D39355 Test Plan: Jenkins: urgent ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlotSnapshotAction' ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCreateReplicationSlotExportSnapshot ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" Reviewers: skumar, stiwary, xCluster, hsunder Reviewed By: stiwary Subscribers: svc_phabricator, ycdcxcluster, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41753

Commit:4776752
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[#24583] YSQL: Support Snapshot Options in CREATE_REPLICATION_SLOT Summary: This revision introduces support for snapshot options in replication slot creation with the following options: - **USE_SNAPSHOT** - **EXPORT_SNAPSHOT** **Overview** When a replication slot is created with `USE_SNAPSHOT`, `consistent_snapshot_time` is fetched from the CDC service and set for the current session using the same API calls used by `SET TRANSACTION SNAPSHOT`. For `EXPORT_SNAPSHOT`, this `consistent_snapshot_time` is used to export the snapshot, similar to `pg_export_snapshot()`. However, instead of the `read_time` typically picked in the tserver as with `pg_export_snapshot()`, this `consistent_snapshot_time` provided by the CDC service is directly set as the `read_time` associated and stored with the snapshot. Implementation Details Commit 17c2711109b63125d34caa4d781b33b7b4929167 / D38542 introduced support for `pg_export_snapshot` and `SET TRANSACTION SNAPSHOT` in YB. This revision builds on those APIs to support the above-mentioned snapshot options. For both options, a `consistent_snapshot_time` is fetched from the CDC service while creating the replication slot. This is a HybridTime (not ReadHybridTime). Both `USE_SNAPSHOT` and `EXPORT_SNAPSHOT` use this HybridTime for setting and exporting the snapshot, respectively. **USE_SNAPSHOT** - When this option is provided, a ReadHybridTime (constructed from the single HybridTime `consistent_snapshot_time`) is sent to the tserver via the `ImportTxnSnapshot` RPC. - A new field, `cdc_snapshot_read_time`, is introduced in `PgImportTxnSnapshotRequestPB` to carry this timestamp. - The rest of the flow remains the same as `SET TRANSACTION SNAPSHOT`, except that the read time is directly taken from the CDC-provided `consistent_snapshot_time` instead of stored `read_time` against specified snapshot ID. **EXPORT_SNAPSHOT** - A new field, `cdc_snapshot_read_time`, is added to `PgExportTxnSnapshotRequestPB` and sent to the tserver via the `ExportTxnSnapshot` RPC. - This value is then stored as the `read_time` for the snapshot, and a new snapshot ID is generated. **Upgrade/Rollback safety:** This revision is part of a new feature whose syntax is not currently used by any user and guarded by a preview flag. It introduces new fields in already existing messages, which will only be utilized when the new syntax is invoked. JIRA: DB-13622 Test Plan: Jenkins: urgent ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlotSnapshotAction' ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestCreateReplicationSlotExportSnapshot ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" Reviewers: skumar, siddharth.shah, pjain, patnaik.balivada, stiwary, sumukh.phalgaonkar, hsunder, dmitry, xCluster Reviewed By: stiwary Subscribers: yql, ybase, ycdcxcluster, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D39355

Commit:7a93dcc
Author:Tanuj Nayak
Committer:Tanuj Nayak

[#25859] YSQL: Pass creation-time ybhnsw params to docdb Summary: ybhnsw is yb's access method implementation of hnsw (hierarchical navigable search worlds) indexes. These indexes have a few creation parameters, `ef_construction` and `m`, whose details can be found [[ https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md | here ]]. Before this change, users could create such ybhnsw indexes on vector columns as such: `CREATE INDEX sample_vectors_ybhnsw_idx ON sample_vectors USING ybhnsw (vector_column);` This diff adds reloptions ef_construction and m to allow users to control these parameters during ybhnsw index creation as follows: `CREATE INDEX sample_vectors_ybhnsw_idx ON sample_vectors USING ybhnsw (vector_column) WITH (m = 32, ef_construction = 300);` This change pushes `ef_construction` and `m` to DocDb by adding corresponding fields to `YbcPgVectorIdxOptions` inside src/yb/yql/pggate/ybc_pg_typedefs.h and `PgVectorIdxOptionsPB` in common.proto. This diff simply passes down these parameters to DocDB but does not change any DocDB code to use these parameters. The DocDB changes required to actually use these parameters will be done in a followup. Note: - a new function parameter "reloptions" is added to amapi yb_ambindschema. - _PG_init has been moved from ivfflat.c to vector.c just like in upstream pgvector v0.8.0. **Upgrade/Rollback safety:** This change adds two fields to `PgVectorIdxOptionsPB` in common.proto, `hnsw_ef` and `hnsw_m`. These fields are part of an unreleased feature so it should not be used. New code to use these fields in DocDB should test to make sure it runs fine without valid values in these fields. Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgVector' Reviewers: jason, aleksandr.ponomarenko Reviewed By: jason, aleksandr.ponomarenko Subscribers: kramanathan, aleksandr.ponomarenko, mlillibridge, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41658

Commit:1d8e225
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[BACKPORT 2024.2][#25817] DocDB: Add capability to FATAL a tserver from the master Summary: When a yb-tserver starts up it heartbeats with the master. At this time the master can determine the tserver is not allowed to join the universe and should immediately crash. For example, if an older tservers tries to join a universe that has already been upgraded and finalized to a newer version. Check #25785/D41510 to see how this is used in the newer version. **Upgrade/Downgrade safety:** New proto field is optional and only set after the upgrade to a version higher than this version. Fixes #25817 Jira: DB-15114 Test Plan: Manually tested with higher version Reviewers: zdrudi, #db-approvers Reviewed By: zdrudi, #db-approvers Subscribers: slingam, svc_phabricator, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41601

Commit:8768b15
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[BACKPORT 2024.2][#24159] YSQL : Cleanup of PG Transaction Snapshot Metadata Summary: Backport Summary: No merge conflicts Original Summary: 17c2711109b63125d34caa4d781b33b7b4929167/D38542 introduced support for exporting and setting transaction snapshots in YSQL. The exported metadata is stored in the memory of the exporting tserver (inside `PgTxnSnapshotManager`) and needs to be cleaned up when the exporting transaction ends. Reasons for Cleanup: # Redundant Metadata Removal. # PostgreSQL Compatibility: In PostgreSQL, exported snapshots are only valid while the exporting transaction is alive. Once the exporting transaction ends, importing these snapshots is disallowed. YugabyteDB aims to enforce the same constraint by promptly cleaning up metadata after the exporting transaction terminates. A Transaction can end in the following ways: # Graceful Termniation ## Transaction Commits or Aborts ## Session ends gracefully # Ungraceful Termination ## Session ends ungracefully ## Exporting Tserver crashes ## Entire Cluster goes down When a transaction commits or aborts, PostgreSQL calls the `AtEOXact_Snapshot` function at the end of the transaction. In this function, a call is made to `PgTxnManager` to check for exported snapshots. If an exported snapshot exists, an RPC call is sent to the local tserver to erase the snapshots stored against this particular `session_id`, hence cleaning up all the stored snapshots of this session. When a session ends (gracefully or ungracefully), corresponding `PgClientSession` object is shutdown and cleaned up by the `PgClientService` in the method `CheckExpiredSessions`, for all the expired sessions we call `UnregisterAll` method of `PgTxnSnapshotManager` which cleans up the stored snapshots against this session (this was introduced in 17c2711109b63125d34caa4d781b33b7b4929167/D38542). When exporting tserver crashes, snapshot data automatically becomes unavailable as it was stored in tserver's memory only. **Upgrade/Rollback safety:** This revision is part of a new feature whose syntax is not currently used by any user. It introduces a new set of RPCs and messages, which will only be utilized when the new syntax is invoked. JIRA: DB-13046 Original commit: 125683b9377f901002e311b5382a4da02ce5ff87 / D39556 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" Reviewers: skumar, dmitry, pjain, patnaik.balivada, hsunder Reviewed By: dmitry Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41585

Commit:158cb9b
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#25785] YSQL: Check yb-tserver versions during upgrade Summary: Include the yb-tserver versions information in the registration message it sends to yb-master. When a yb-tserer restarts it always sends this registration message. yb-master now persists this version information in the tserver descriptor. This version information is now used to protect Upgrade, and Rollback RPCs of YSQL Major catalog and AutoFlag. It is also used to prevent a yb-tserver running on an incorrect version from running, ex PG11 tserver after the ysql major upgrade has finalized, or PS15 tserver before the PG15 catalog is ready. Heartbeat response contains `is_fatal_error` which when set indicates to the yb-tserver that it should immediately FATAL and exit the process. ** Upgrade/Downgrade safety:** - `is_fatal_error` is added to `TSHeartbeatResponsePB`. This is intended as a safety mechanism in newer versions and older versions will ignore this. - The `version_info` is added to `TSRegistrationPB`. Older tservers will not set this, and this has been tested in the upgrade tests. - The `version_info` is added to `SysTabletServerEntryPB`. This will get populated as new tservers get upgraded. This is also persisted to disk via sys_catalog `TSERVER_REGISTRATION`. On a Rollback of master, these values are ignored. On a rollback, the tservers are first rolled back and will not send this info, causing us to clear the persisted value as well. When yb-master finally rollsback it will not have this persisted on disk. Fixes #25785 Jira: DB-15074 Test Plan: Pg15UpgradeTest.CheckVersion AutoFlagUpgradeTest.TestUpgrade Manually test the FATALs ``` yugabyte-2024.2.1.0 $ ./bin/yb-ctl wipe_restart --rf 3 Destroying cluster. Creating cluster. Waiting for cluster to be ready. ---------------------------------------------------------------------------------------------------- | Node Count: 3 | Replication Factor: 3 | ---------------------------------------------------------------------------------------------------- | JDBC : jdbc:postgresql://127.0.0.1:5433/yugabyte | | YSQL Shell : bin/ysqlsh | | YCQL Shell : bin/ycqlsh | | YEDIS Shell : bin/redis-cli | | Web UI : http://127.0.0.1:7000/ | | Cluster Data : /Users/hsunder/yugabyte-data | ---------------------------------------------------------------------------------------------------- For more info, please use: yb-ctl status yugabyte-2024.2.1.0 $ ./bin/yb-ts-cli set_flag --server_address 127.0.0.1:9100 --force ysql_yb_enable_expression_pushdown false yugabyte-2024.2.1.0 $ ./bin/yb-ts-cli set_flag --server_address 127.0.0.2:9100 --force ysql_yb_enable_expression_pushdown false yugabyte-2024.2.1.0 $ ./bin/yb-ts-cli set_flag --server_address 127.0.0.3:9100 --force ysql_yb_enable_expression_pushdown false master $ ./bin/yb-ctl restart_node 1 --master ./bin/yb-ctl restart_node 3 --master Stopping node master-1. Starting node master-1. Waiting for cluster to be ready. master $ ./bin/yb-ctl restart_node 2 --master Stopping node master-2. Starting node master-2. Waiting for cluster to be ready. master $ ./bin/yb-ctl restart_node 3 --master Stopping node master-3. Starting node master-3. Waiting for cluster to be ready. # Upgrading tserver before ysql major catalog should caus it to fatal master $ ./bin/yb-ctl restart_node 1 Stopping node tserver-1. Starting node tserver-1. Waiting for cluster to be ready. F0128 19:22:41.052047 1819340800 heartbeater.cc:495] Illegal state (yb/master/ysql/ysql_initdb_major_upgrade_handler.cc:676): yb-tserver YSQL major version 15 is too high. Cluster is not in a state to accept yb-tservers on a higher version yet. Restart the yb-tserver in the YSQL major version 15 Fatal failure details written to /Users/hsunder/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.FATAL.details.2025-01-28T19_22_41.pid24678.txt F20250128 19:22:41 ../../src/yb/tserver/heartbeater.cc:495] Illegal state (yb/master/ysql/ysql_initdb_major_upgrade_handler.cc:676): yb-tserver YSQL major version 15 is too high. Cluster is not in a state to accept yb-tservers on a higher version yet. Restart the yb-tserver in the previous YSQL major version @ 0x109cb6510 google::LogDestination::LogToSinks() @ 0x109cb5654 google::LogMessage::SendToLog() @ 0x109cb6034 google::LogMessage::Flush() @ 0x109cb9f3c google::LogMessageFatal::~LogMessageFatal() @ 0x109cb6ee0 google::LogMessageFatal::~LogMessageFatal() @ 0x105399424 yb::tserver::Heartbeater::Thread::TryHeartbeat() @ 0x105399abc yb::tserver::Heartbeater::Thread::DoHeartbeat() @ 0x105399e68 yb::tserver::Heartbeater::Thread::RunThread() @ 0x10bd34a04 yb::Thread::SuperviseThread() @ 0x189fa82e4 _pthread_start @ 0x189fa30fc thread_start yugabyte-2024.2.1.0 $ ./bin/yb-ctl restart_node 1 --tserver_flags='ysql_yb_enable_expression_pushdown=false' Stopping node tserver-1. Starting node tserver-1. Waiting for cluster to be ready. master $ ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 ysql_major_version_catalog_upgrade ysql major catalog upgrade started ysql major catalog upgrade completed successfully master $ ./bin/yb-ctl restart_node 2 --tserver_flags='ysql_yb_enable_expression_pushdown=false' Stopping node tserver-2. Starting node tserver-2. Waiting for cluster to be ready. master $ ./bin/yb-ctl restart_node 3 --tserver_flags='ysql_yb_enable_expression_pushdown=false' Stopping node tserver-3. Starting node tserver-3. Waiting for cluster to be ready. # Finalizing the upgrade before node 1 has upgraded should fail master $ ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 finalize_ysql_major_version_catalog_upgrade Error running finalize_ysql_major_version_catalog_upgrade: Illegal state (yb/master/ts_manager.cc:585): Unable to finalize ysql major catalog upgrade: Cannot finalize YSQL major catalog upgrade before all yb-tservers have been upgraded to the current version: yb-tserver(s) not on the correct version: [0x000000010e88aa98 -> { permanent_uuid: 1e12109aa3d241848c382a9d7f4f68fd registration: private_rpc_addresses { host: "127.0.0.1" port: 9100 } http_addresses { host: "127.0.0.1" port: 9000 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } placement_uuid: "" pg_port: 5433 placement_id: cloud1:datacenter1:rack1 }] master $ ./bin/yb-ctl restart_node 1 --tserver_flags='ysql_yb_enable_expression_pushdown=false' Stopping node tserver-1. Starting node tserver-1. Waiting for cluster to be ready. master $ ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 finalize_ysql_major_version_catalog_upgrade Finalize successful master $ ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 promote_auto_flags PromoteAutoFlags completed successfully New AutoFlags were promoted New config version: 2 master $ ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 upgrade_ysql YSQL successfully upgraded to the latest version ``` Reviewers: telgersma, zdrudi, asrivastava Reviewed By: telgersma, zdrudi Subscribers: svc_phabricator, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41510

Commit:3146198
Author:Basava
Committer:Basava

[#25801] DocDB: Remove txn reuse version field introduced for object locking feature Summary: For supporting object locks, earlier the idea was to version docdb transactions and reuse the same when the txn doesn't write any intents. The version felt necessary so as to prune obsolete wait-for edges at the deadlock detector since the only way they get pruned is when the status tablet responds that the txn is inactive/subtxn is inactive. Else, it could lead to false deadlocks. Instead, if we ensure that we don't reuse a docdb readonly transaction when it ends up blocking a DDL, we wouldn't need transaction versioning at the first place. The wait-for edges on the transaction would get pruned once the transaction finishes. This diff reverts the `txn_reuse_version` changes introduced as part of https://phorge.dev.yugabyte.com/D40202 **Upgrade/Downgrade safety** 1. Deprecating the field in productionized proto messages. 2. Deleting the field in object-locks specific proto messages since the usage is guarded by the test flag `enable_object_locking_for_table_locks` Jira: DB-15101 Test Plan: Jenkins Reviewers: amitanand, rthallam Reviewed By: amitanand Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41559

Commit:aed8c61
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[BACKPORT 2024.2][#24155] YSQL: Introduce support for synchronizing snapshots Summary: **Backport Summary:** 1. src/postgres/src/backend/utils/time/snapmgr.c ImportSnapshot: PG15 merge commit added `snapshot.snapshot_type = SNAPSHOT_MVCC;` and the original commit changed the indentation, which led to conflict. Resolved it by using the right indentation and keeping the PG11 contents 2. src/yb/ash/wait_state.h YB_DEFINE_TYPED_ENUM(PggateRPC,...: Original commit added `kExportTxnSnapshot` and `kImportTxnSnapshot` but 2024.2 branch didn't have `kAcquireAdvisoryLock` and `kReleaseAdvisoryLock` which resulted in conflict. Resolved by adding `kExportTxnSnapshot` and `kImportTxnSnapshot`. 3. src/yb/common/pgsql_protocol.proto Original commit added `message PgTxnSnapshotPB` master had `message PgsqlAdvisoryLockPB` and `message PgsqlLockRequestPB`, which was not there in 2024.2 which resulted in conflict. Resolved by only adding `message PgTxnSnapshotPB`. 4. src/yb/master/master_tserver.h Original commit added `GetLocalPgTxnSnapshot` and functions `SkipCatalogVersionChecks` `permanent_uuid` were not present in 2024.2, resolved by only adding `GetLocalPgTxnSnapshot`. 5. src/yb/tserver/pg_client.proto Original commit added messages and rpcs for export and import of pg txn snapshots, but 2024.2 branch didn't have messages and rpcs for Advisory locks, which caused the conflict. Resolved by just importing rpcs and messages for pg txn snapshot export/import. 6. src/yb/tserver/pg_client_service.cc Impl Constructor: Original commit added instatiation of txn_snapshot_manager_, but 2024.2 didn't have some other instatiations that were in master. Resolved by only keeping the `txn_snapshot_manager_` Original commit added `PgTxnSnapshotManager txn_snapshot_manager_;`, conflict was because of absence of `std::optional<cdc::CDCStateTable> cdc_state_table_;` in 2024.2, resolved by only importing `PgTxnSnapshotManager txn_snapshot_manager_;` 7. src/yb/tserver/pg_client_service.h In declaration of methods in `YB_PG_CLIENT_METHODS`, presence of Advisory locks caused the conflict. Resolved by just keeping the relevant methods (export/import pg txn snapshot). 8. src/yb/tserver/pg_client_session.cc Conflict in includes (`#include "yb/tserver/pg_txn_snapshot_manager.h"`) because of other includes in master(not present in 2024.2), resolved by keeping the relevant includes only. 9. src/yb/tserver/tablet_server_interface.h Original commit introduced `GetLocalPgTxnSnapshot`, conflict was because of other functions introduced in master but not in 2024.2. Resolved by keeping the relevant function `GetLocalPgTxnSnapshot`. 10. src/yb/tserver/tserver_service.proto `GetMetrics` rpc was removed in master but not in 2024.2, and original commit introduced `GetLocalPgTxnSnapshot` rpc. Resolved the conflict by adding both the rpcs. 11. src/yb/yql/pggate/pg_session.cc: Original commit changed `pg_txn_manager_->SetupPerformOptions(&options)` to `RETURN_NOT_OK(pg_txn_manager_->SetupPerformOptions(&options));` in `AcquireAdvisoryLock`, this function was not present in 2024.2 this resulted into a conflict. Resolved by accepting the 2024.2 changes. 12. src/yb/yql/pggate/ybc_pg_typedefs.h Original Commit introduced `ysql_enable_pg_export_snapshot` and in master `YBCPgGFlagsAccessor` was changed to `YbcPgGFlagsAccessor`. Resolved the conflict by adding `ysql_enable_pg_export_snapshot` and keeping the original name `YBCPgGFlagsAccessor`. 13. src/yb/yql/pggate/ybc_pggate.cc Original commit defined gflag `ysql_enable_pg_export_snapshot`, conflict was caused due to another gflag defintion in master (`ysql_block_dangerous_roles`). Resolved by only defining `ysql_enable_pg_export_snapshot`. Original commit introduced YBCPgExportSnapshot and YBCPgImportSnapshot, but master also had functions for advisory locks (not present in 2024.2). Resolved by only keeping the relevant functions. 14. src/yb/yql/pggate/ybc_pggate.h Original commit introduced YBCPgExportSnapshot and YBCPgImportSnapshot, but master also had functions for advisory locks (not present in 2024.2). Resolved by only keeping the relevant functions. **Original Summary:** This revision allows transactions to synchronize their snapshots. To that end, support the following PostgreSQL compatible syntax: - `SELECT pg_export_snapshot()` - `SET TRANSACTION SNAPSHOT <snapshot_name>` To synchronize snapshots of two active transactions, first get the snapshot id from the output of `SELECT pg_export_snapshot()`. Then, `SET TRANSACTION SNAPSHOT <snapshot_name>` in the second transaction to use the same database snapshot as the first one. Now, both the transactions see identical content in the database, apart from their own writes. Semantics: 1. SELECT pg_export_snapshot() and SET TRANSACTION SNAPSHOT are only valid from within REPEATABLE READ transactions. Other isolation levels return an error. 2. SELECT pg_export_snapshot() returns a id which can then be used to set the transaction snapshot of a transaction on any node of the universe. 3. The importing transaction retains its transaction semantics. 1. Cannot import snapshot into a transaction that already has a transaction snapshot. This is a fundamental property of a REPEATABLE READ transaction. 2. Avoids stale read anomalies as expected of a YugabyteDB transaction. 4. Currently, in YugabyteDB, the stored metadata is not cleaned up after a transaction ends. As a result, it is possible to import a snapshot even after the exporting transaction has concluded. However, this practice is not recommended and may lead to unpredictable behavior. Future revisions will introduce support for metadata cleanup, ensuring such imports are no longer allowed. **Design Overview**: The snapshot export and set functionality is enabled through the use of `read_time`. The `read_time` is associated with a snapshot ID and is stored in the exporting tserver's memory. When setting the transaction snapshot, this `read_time` is retrieved and set in tserver. The subsequent statements operate based on that specific read point. During the export of a snapshot, certain metadata (same as PG) such as database OID, isolation level, and read-only status is gathered. Here, YugabyteDB and PG diverge in how snapshots are managed: PG maintains snapshot boundaries using metadata fields including `xmin`and `xmax`, which track the visibility of transactions for consistency. YugabyteDB simplifies this with a single field, `read_time`, which serves as an equivalent to PG's xmin and xmax, encapsulating the read point for the snapshot. **Implementation Details of `pg_export_snapshot`** On the tserver, the `read_time` is determined using the `PgClientSession:SetupSession` function which is called with `read_time_manipulation` set to `ReadTimeManipulation::ENSURE_READ_TIME_IS_SET`, which ensures that a `read_time` is picked here and if it is a new transaction (`pg_export_snapshot` is the first statement) then current time is picked as `read_time`. The following metadata is stored in the tserver's memory: - Database OID - Isolation level - Read-only status - `read_time` **Snapshot Id** Snapshot Id returned by `pg_export_snapshot` is of the following structure: `<exporting tserver UUID>-<random UUID>` e.g. ``` yugabyte=*# SELECT pg_export_snapshot(); pg_export_snapshot --------------------------------------------------------------------- 4b6c7bfb62c6405db65b311d6516543e-c56d045007440191d34d983c5dbe8ab6 (1 row) ``` **Storage and Retrieval of Metadata** Upon receiving metadata in the `PgTxnSnapshotManager::Register`, a random not in use UUID is generated and, the metadata is stored in tserver's memory only against this UUID. To retrieve the data we obtain the UUID of the tserver on which this metadata is stored using the snapshot id. 1. If the importing tserver is same as the exporting tserver, then it fetches the metadata stored against the snapshot id(the random UUID). 2. If the importing tserver is different from exporting then an RPC call is made to exporting tserver's , which returns the requested metadata to this tserver for use. Metadata is not persisted to disk because if a tserver crashes, the associated exporting session ends, aligning with PG semantics that disallow importing snapshots from ended transactions. **Implementation Details of SET TRANSACTION SNAPSHOT** When the `SET TRANSACTION SNAPSHOT` statement is executed, the previously stored metadata is retrieved from the exporting remote tserver as described above and applied to set the `read time`. **Limitations**: Currently, this functionality is only supported in the `REPEATABLE READ` isolation level. The stored metadata is not yet deleted after the exporting transaction ends; this will be addressed in future commits. Additionally, importing a snapshot is not permitted if `yb_read_time`, `read_time_for_follower_reads_`, or `yb_read_after_commit_visibility` is set. **GFlag** This feature is guarded by a **PREVIEW** flag `ysql_enable_pg_export_snapshot`, with default value false. **Upgrade/Rollback safety:** No existing proto message has been modified. New messages and RPCs are only used when the user executed new SQL syntax SELECT pg_export_snapshot() and SET TRANSACTION SNAPSHOT . The user is only allowed to use the new syntax after the upgrade has completed. Using the syntax in older version will fail, and if incorrectly used in the middle of an upgrade in the worst case will fail with "RPC Not implemented" errors. This is ok and does not cause any correctness issues. No AutoFlag is used since a brand new SQL syntax is involved to trigger any of this code path. JIRA: DB-13042 Original commit: 17c2711109b63125d34caa4d781b33b7b4929167 / D38542 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" ./yb_build.sh --java-test 'TestPgBatch#testImportTxnSnapshot' ./yb_build.sh --java-test 'TestPgBatch#testExportTxnSnapshot' Reviewers: skumar, dmitry, pjain, patnaik.balivada, hsunder Reviewed By: dmitry Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41474

Commit:125683b
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[#24159] YSQL : Cleanup of PG Transaction Snapshot Metadata Summary: 17c2711109b63125d34caa4d781b33b7b4929167/D38542 introduced support for exporting and setting transaction snapshots in YSQL. The exported metadata is stored in the memory of the exporting tserver (inside `PgTxnSnapshotManager`) and needs to be cleaned up when the exporting transaction ends. Reasons for Cleanup: # Redundant Metadata Removal. # PostgreSQL Compatibility: In PostgreSQL, exported snapshots are only valid while the exporting transaction is alive. Once the exporting transaction ends, importing these snapshots is disallowed. YugabyteDB aims to enforce the same constraint by promptly cleaning up metadata after the exporting transaction terminates. A Transaction can end in the following ways: # Graceful Termniation ## Transaction Commits or Aborts ## Session ends gracefully # Ungraceful Termination ## Session ends ungracefully ## Exporting Tserver crashes ## Entire Cluster goes down When a transaction commits or aborts, PostgreSQL calls the `AtEOXact_Snapshot` function at the end of the transaction. In this function, a call is made to `PgTxnManager` to check for exported snapshots. If an exported snapshot exists, an RPC call is sent to the local tserver to erase the snapshots stored against this particular `session_id`, hence cleaning up all the stored snapshots of this session. When a session ends (gracefully or ungracefully), corresponding `PgClientSession` object is shutdown and cleaned up by the `PgClientService` in the method `CheckExpiredSessions`, for all the expired sessions we call `UnregisterAll` method of `PgTxnSnapshotManager` which cleans up the stored snapshots against this session (this was introduced in 17c2711109b63125d34caa4d781b33b7b4929167/D38542). When exporting tserver crashes, snapshot data automatically becomes unavailable as it was stored in tserver's memory only. **Upgrade/Rollback safety:** This revision is part of a new feature whose syntax is not currently used by any user. It introduces a new set of RPCs and messages, which will only be utilized when the new syntax is invoked. JIRA: DB-13046 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" Reviewers: hsunder, skumar, pjain, patnaik.balivada, dmitry, stiwary Reviewed By: dmitry Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D39556

Commit:9a11dd2
Author:Sergei Politov
Committer:Sergei Politov

[#25676] DocDB: Vector index backfill phase 1 Summary: Currently we ignore the data that was present in the table before vector index was created. It should be fixed by backfilling the data during index creation. This diff implements the first phase for backfill implementation. The backfill process is started as soon as vector index is added to the tablet. During nonconcurrent index creation we wait until backfill process finishes. The things left to implement: 1) In this diff index backfill happens in a single write. There is could be a lot of data in indexed table, so we should split writes into multiple chunks. 2) TServer could be restarted during backfill procedure. So resuming index backfill should be implemented. 3) When checking whether index is ready, only the leader state is checked. It is preferable to check for replica majority at least. 4) Concurrent index backfill. It could happen that concurrent index creation also does not work, did not check this part. 5) Backfill implemented and tested is for the nonconcurrently case only. 6) The concurrently case behaviour is undefined. Upgrade/Rollback safety: Safe to upgrade rollback. Jira: DB-14932 Test Plan: PgVectorIndexTest.ManyRowsWithBackfill Reviewers: arybochkin, tnayak, jason Reviewed By: arybochkin, jason Subscribers: jason, yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41326

Commit:207beba
Author:Vaibhav Kushwaha
Committer:Vaibhav Kushwaha

[BACKPORT 2024.2][#24108] CDC: Add column yb_lsn_type to pg_replication_slots view Summary: Original commit: 8ff3ca3a9c77843872bae4b604f4c3bfd550e297 / D38846 This diff adds a column for `yb_lsn_type` to the `pg_replication_slots` view. Additionally, a migration script has also been added to facilitate upgrades. **Upgrade / rollback safety:** This diff adds a field `yb_lsn_type` to the message `PgReplicationSlotInfoPB`. This revision does not alter persisted data. New fields are added to the tserver response (to pggate) proto. When the value is absent, master (which is upgraded first) is expected to fill in the appropriate default value. This is upgrade and rollback safe. No new flags are added to guard the feature. **Backport description:** 1. `src/postgres/src/backend/catalog/system_views.sql` - In 2024.2, this file does not contain the views with YB modifications so the changes for the view had to be incorporated to `src/postgres/src/backend/catalog/yb_system_views.sql` 2. `src/postgres/src/include/catalog/pg_proc.dat` - File in 2024.2 does not contain the columns like `wal_status`, `safe_wal_size` and `two_phase` - they were manually removed. 3. `src/postgres/src/include/catalog/pg_yb_migration.dat` - The migration script version had to be changed to `V59.4` to follow backport standards. 4. `src/postgres/src/test/regress/expected/yb_pg_rules.out` - While cherry-picking, other columns like `wal_status`, `safe_wal_size` and `two_phase` were also picked along with `yb_lsn_type` - they were manually removed too. 5. The migration script was renamed to `V59.4__24108__yb_lsn_type_in_pg_replication_slots.sql` and the columns `wal_status`, `safe_wal_size` and `two_phase` were removed while recreating the view along with the removal of the recreation for view `pg_stat_replication_slots` as it’s a PG15 specific view. 6. Changes also include removal of a parameter for `two_phase` from the method `pg_create_logical_replication_slot` from the files `src/postgres/src/test/regress/expected/yb_replication_slot.out` and `src/postgres/src/test/regress/sql/yb_replication_slot.sql` Jira: DB-12997 Test Plan: Run existing tests. Reviewers: skumar, stiwary, sumukh.phalgaonkar, aagrawal, utkarsh.munjal, xCluster, hsunder Reviewed By: aagrawal Subscribers: ycdcxcluster, ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D41253

Commit:6281c17
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#25746] DocDB: Support use of version_info.proto in RPC messages Summary: Switch `version_info.proto` to use `YRPC_GENERATE` so that it can be used in RPC messages. Moving `version_info.proto`, `version_info.cc` and `init.cc` to `yb_common` lib since `YRPC_GENERATE` depends on `yb_util` lib. Fixes #25746 Jira: DB-15027 Test Plan: Jenkins Reviewers: sergei, asrivastava Reviewed By: sergei Subscribers: esheng, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D41470

Commit:17c2711
Author:Utkarsh Munjal
Committer:Utkarsh Munjal

[#24155] YSQL: Introduce support for synchronizing snapshots Summary: This revision allows transactions to synchronize their snapshots. To that end, support the following PostgreSQL compatible syntax: - `SELECT pg_export_snapshot()` - `SET TRANSACTION SNAPSHOT <snapshot_name>` To synchronize snapshots of two active transactions, first get the snapshot id from the output of `SELECT pg_export_snapshot()`. Then, `SET TRANSACTION SNAPSHOT <snapshot_name>` in the second transaction to use the same database snapshot as the first one. Now, both the transactions see identical content in the database, apart from their own writes. Semantics: 1. SELECT pg_export_snapshot() and SET TRANSACTION SNAPSHOT are only valid from within REPEATABLE READ transactions. Other isolation levels return an error. 2. SELECT pg_export_snapshot() returns a id which can then be used to set the transaction snapshot of a transaction on any node of the universe. 3. The importing transaction retains its transaction semantics. 1. Cannot import snapshot into a transaction that already has a transaction snapshot. This is a fundamental property of a REPEATABLE READ transaction. 2. Avoids stale read anomalies as expected of a YugabyteDB transaction. 4. Currently, in YugabyteDB, the stored metadata is not cleaned up after a transaction ends. As a result, it is possible to import a snapshot even after the exporting transaction has concluded. However, this practice is not recommended and may lead to unpredictable behavior. Future revisions will introduce support for metadata cleanup, ensuring such imports are no longer allowed. **Design Overview**: The snapshot export and set functionality is enabled through the use of `read_time`. The `read_time` is associated with a snapshot ID and is stored in the exporting tserver's memory. When setting the transaction snapshot, this `read_time` is retrieved and set in tserver. The subsequent statements operate based on that specific read point. During the export of a snapshot, certain metadata (same as PG) such as database OID, isolation level, and read-only status is gathered. Here, YugabyteDB and PG diverge in how snapshots are managed: PG maintains snapshot boundaries using metadata fields including `xmin`and `xmax`, which track the visibility of transactions for consistency. YugabyteDB simplifies this with a single field, `read_time`, which serves as an equivalent to PG's xmin and xmax, encapsulating the read point for the snapshot. **Implementation Details of `pg_export_snapshot`** On the tserver, the `read_time` is determined using the `PgClientSession:SetupSession` function which is called with `read_time_manipulation` set to `ReadTimeManipulation::ENSURE_READ_TIME_IS_SET`, which ensures that a `read_time` is picked here and if it is a new transaction (`pg_export_snapshot` is the first statement) then current time is picked as `read_time`. The following metadata is stored in the tserver's memory: - Database OID - Isolation level - Read-only status - `read_time` **Snapshot Id** Snapshot Id returned by `pg_export_snapshot` is of the following structure: `<exporting tserver UUID>-<random UUID>` e.g. ``` yugabyte=*# SELECT pg_export_snapshot(); pg_export_snapshot --------------------------------------------------------------------- 4b6c7bfb62c6405db65b311d6516543e-c56d045007440191d34d983c5dbe8ab6 (1 row) ``` **Storage and Retrieval of Metadata** Upon receiving metadata in the `PgTxnSnapshotManager::Register`, a random not in use UUID is generated and, the metadata is stored in tserver's memory only against this UUID. To retrieve the data we obtain the UUID of the tserver on which this metadata is stored using the snapshot id. 1. If the importing tserver is same as the exporting tserver, then it fetches the metadata stored against the snapshot id(the random UUID). 2. If the importing tserver is different from exporting then an RPC call is made to exporting tserver's , which returns the requested metadata to this tserver for use. Metadata is not persisted to disk because if a tserver crashes, the associated exporting session ends, aligning with PG semantics that disallow importing snapshots from ended transactions. **Implementation Details of SET TRANSACTION SNAPSHOT** When the `SET TRANSACTION SNAPSHOT` statement is executed, the previously stored metadata is retrieved from the exporting remote tserver as described above and applied to set the `read time`. **Limitations**: Currently, this functionality is only supported in the `REPEATABLE READ` isolation level. The stored metadata is not yet deleted after the exporting transaction ends; this will be addressed in future commits. Additionally, importing a snapshot is not permitted if `yb_read_time`, `read_time_for_follower_reads_`, or `yb_read_after_commit_visibility` is set. **GFlag** This feature is guarded by a **PREVIEW** flag `ysql_enable_pg_export_snapshot`, with default value false. **Upgrade/Rollback safety:** No existing proto message has been modified. New messages and RPCs are only used when the user executed new SQL syntax SELECT pg_export_snapshot() and SET TRANSACTION SNAPSHOT . The user is only allowed to use the new syntax after the upgrade has completed. Using the syntax in older version will fail, and if incorrectly used in the middle of an upgrade in the worst case will fail with "RPC Not implemented" errors. This is ok and does not cause any correctness issues. No AutoFlag is used since a brand new SQL syntax is involved to trigger any of this code path. JIRA: DB-13042 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_export_snapshot-test --gtest_filter "PgExportSnapshotTest.*" ./yb_build.sh --java-test 'TestPgBatch#testImportTxnSnapshot' ./yb_build.sh --java-test 'TestPgBatch#testExportTxnSnapshot' Reviewers: skumar, pjain, stiwary, patnaik.balivada, mhaddad, aagrawal, dmitry, hsunder Reviewed By: patnaik.balivada, dmitry, hsunder Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38542

Commit:f976939
Author:Hari Krishna Sunder
Committer:Hari Krishna Sunder

[#25382] YSQL: PG15 Online Upgrade: Delete previous version of the catalog tables after upgrade Summary: After the ysql major version upgrade completes, delete the previous version of the catalog tables since they are no longer needed. The async task will keep retrying as long as the node is the master leader. We only expect to have transient failures, so this is safe. If a new master leader is elected, and the cleanup is pending it will kick off a new task. **Upgrade/Downgrade safety:** New proto field `previous_version_catalog_cleanup_required` is added. This is set to non-default value only after the upgrade is finalized, so its safe. Jira: DB-14612 Test Plan: YsqlMajorUpgradeRpcsTest.SimultaneousUpgrades YsqlMajorUpgradeRpcsTest.CleanupPreviousCatalog Reviewers: telgersma, smishra Reviewed By: telgersma Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D41420

Commit:54145b9
Author:Basava
Committer:Basava

[#25565] DocDB: Detecting deadlocks spanning session advisory locks and row locks Summary: Revision https://phorge.dev.yugabyte.com/D41048 introduced wait-on-conflict and deadlock detection for session advisory locks. But it doesn't addresses deadlocks spanning session advisory locks and row-level locks. Consider the following ``` ysqlsh> select pg_advisory_lock(10); ysqlsh> begin; update test set v=v+1 where k=1; select pg_advisory_lock(10); // session txn B waits on session txn A begin; update test set v=v+1 where k=1; // regular txn A -> regular txn B ``` Though a deadlock exists, the existing algorithm is unable to detect it since there are multiple active transactions for each ysql session and the wait-for dependencies don't span a common transaction set. This diff addresses the above issue in the following manner: 1. Each session level transaction creates an internal dependency (wait-for probe) on its current active regular transaction (kPlain, kDdl etc) 2. When requesting a session level lock in the scope of a (DML/kPlain) transaction, if there is a conflict, the request enters the wait-queue as the active regular txn (and not the session transaction) The above helps us detect the deadlock. Working the earlier example, 1. gives us `session txn A -> kPlain txn A` 2. gives us `kPlain txn B -> session txn A` 1 & 2 combined give us kPlain txn B -> kPlain txn A` and we have `kPlain txn A -> kPlain txn B` from the last blocked statement in session A, hence detecting the deadlock. Note: When we introduce support for object locks in the future, similar logic would be required at the object lock manager to detect deadlocks spanning object locks and session adv locks, i.e at the OLM's wait-queue, the request would need to pose as the active DDL/DML. And since the current diff takes care of introducing the internal wait-for edge from the session txn to active DDL/DML, the deadlock would be detected. Jira: DB-14817 **Upgrade/Downgrade safety** 1. Added field `background_txn_status_tablet` in `KeyValueWriteBatchPB ` which is accessed only when the message has `background_transaction_id` set. It should be good since both these fields are guarded by the same preview flag `ysql_yb_enable_advisory_locks`. 2. Added another field `background_transaction_meta` to `TransactionStatePB `. the receiver checks for existence before accessing the field. Test Plan: Jenkins ./yb_build.sh --cxx-test='TEST_F(PgAdvisoryLockTest, SessionLockDeadlockWithRowLocks) {' Reviewers: esheng, rthallam, amitanand Reviewed By: esheng Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D41242