Skip to content

Conversation

@dhartunian
Copy link
Collaborator

Backport 1/1 commits from #148542.

/cc @cockroachdb/release


Previously, tracing transactions in CRDB could be enabled using a single cluster setting: sql.trace.txn.enable_threshold, which, if it was non-zero, would enable tracing on all transactions and only keep and emit the traces for transactions which exceeded the execution time threshold. This has proven to be difficult to use as it reuquires paying tracing cost on the entire workload.

This change modifies the transaction tracing capabilities of CRDB to incorporate a sample rate. A new cluster setting is introduced, named sql.trace.txn.sample_rate which is zero by default. When set to a positive number between 0 and 1, this will set the probability of a transaction having tracing enabled.

For tracing to emit a trace for a transaction, both settings will now need to be set and will need to both trigger for successful output to the SQL_EXEC logging channel. Transactions first pass through the sample rate filter which governs whether a trace is captured, then they will be evaluated through the enable_threshold which governs whether the trace is emitted to the logs.

Epic: None
Resolves: CRDB-51662

Release note (ops change, sql change): In order to selectively capture traces for transactions running in an active workload without haing to capture them via statement diagnostic bundles, customers can now use the sql.trace.txn.sample_rate cluster setting to enable tracing for a fraction of their workload. The sql.trace.txn.enable_threshold will still need to be set to a positive value to provide a filter for how slow a transaction needs to be after being sampled to merit emitting a trace. Traces are emitted to the SQL_EXEC logging channel.

Previously, tracing transactions in CRDB could be enabled using a
single cluster setting: `sql.trace.txn.enable_threshold`, which, if
it was non-zero, would enable tracing on all transactions and only
keep and emit the traces for transactions which exceeded the execution
time threshold. This has proven to be difficult to use as it reuquires
paying tracing cost on the entire workload.

This change modifies the transaction tracing capabilities of CRDB to
incorporate a sample rate. A new cluster setting is introduced, named
`sql.trace.txn.sample_rate` which is zero by default. When set to a
positive number between 0 and 1, this will set the probability of a
transaction having tracing enabled.

For tracing to emit a trace for a transaction, both settings will now
need to be set and will need to both trigger for successful output to
the `SQL_EXEC` logging channel. Transactions first pass through the
sample rate filter which governs whether a trace is captured, then
they will be evaluated through the `enable_threshold` which governs
whether the trace is emitted to the logs.

Epic: None
Resolves: CRDB-51662

Release note (ops change, sql change): In order to selectively capture
traces for transactions running in an active workload without haing
to capture them via statement diagnostic bundles, customers can now
use the `sql.trace.txn.sample_rate` cluster setting to enable tracing
for a fraction of their workload. The `sql.trace.txn.enable_threshold`
will still need to be set to a positive value to provide a filter
for how slow a transaction needs to be after being sampled to merit
emitting a trace. Traces are emitted to the `SQL_EXEC` logging
channel.
@dhartunian dhartunian requested review from a team as code owners July 3, 2025 21:05
@dhartunian dhartunian requested review from alyshanjahani-crl and rytaft and removed request for a team July 3, 2025 21:05
@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 3, 2025
@dhartunian dhartunian added do-not-merge bors won't merge a PR with this label. and removed backport Label PR's that are backports to older release branches labels Jul 3, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2025

❌ PR #149497 does not comply with backport policy

Confidence: high
Explanation: The pull request does not meet the backport policy for CockroachDB regarding critical bug fixes or feature flags. The changes involve the introduction of a new cluster setting (sql.trace.txn.sample_rate) to control transaction tracing based on a sample rate, which is more akin to a new feature enhancement than a critical bug fix. Additionally, this new feature is not gated behind a feature flag that is disabled by default; the setting itself influences operational dynamics but does not strictly adhere to the 'feature flag' mechanism as required under the policy. There is no 'Release justification: .*' line in the PR body that might exempt it under the policy for exceptions. Furthermore, the changes affect production files like pkg/sql/conn_executor_exec.go, pkg/sql/conn_fsm.go, pkg/sql/exec_util.go, pkg/sql/txn_state.go, which are critical to the operation of the database.
Recommendation: Gate the PR behind a feature flag or reconsider the necessity of this backport to the release branch, as it introduces functionality adjustments rather than addressing a critical system flaw.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@dhartunian
Copy link
Collaborator Author

dhartunian commented Jul 11, 2025

These changes are included in #149607

@dhartunian dhartunian closed this Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge bors won't merge a PR with this label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants