Add new extension yagp hooks collector by leborchuk · Pull Request #1629 · apache/cloudberry

leborchuk · 2026-03-18T15:26:01Z

A diagnostic and monitoring extension for Cloudberry clusters

Briefly, the interaction scheme is:

Cloudberry (Run Executor hooks) -> hooks collector extension (Create protobuf messages and send them using UDS) -> local yagpcc agent

This is the part that sends data from the PG instance to an external agent.

External agent receives telemetry, store, aggregate and expose to the consumers. The agent source code is https://github.com/open-gpdb/yagpcc We're going to propose add it to Apache Cloudberry infrastructure later, after fixing all the tests.

This is quite similar to the #1085 idea. However, it has a different implementation. It is completely separate from database application and written in Go. Could be separately maintained. No influence to the main database, and so different development requirements.

Extension has pg_regress tests, they were copied from pg_stat_statements (PG18) and adjusted to the extension behaviour.

Original extension authors are Smyatkin-Maxim and NJrslv I left their commits as-is to have blame history commited source.

Add yagp_hooks_collector, a shared-preload module that hooks into ExecutorStart and ExecutorFinish to capture query lifecycle events. Includes Makefile with protobuf code generation, GRPC-based delivery, QueryInfo generation (query text, plan text, query_id, plan_id, session metadata), and basic protobuf message filling.

Guard against NULL plan state when generating EXPLAIN output.

… normalized texts Collect spill info (file count, bytes written). Generate normalized query and plan texts using a pg_stat_statements-derived parser. Collect buffer I/O counters, tuple counts, timing, and /proc/self CPU/memory/IO statistics.

…ility Use query_info_collect_hook for finer-grained lifecycle tracking. Fix two segfaults in early init paths. Skip hooks in UTILITY mode. General robustness improvements.

…lity

…ion GUCs Add missing Greenplum node types to pg_stat_statements parser. Move stats reporting to ExecutorEnd hook. Improve GRPC failure handling. Track CDB-specific metrics and initial query nesting level. Add resource group collection. Add GUCs for controlling collection. Skip nested and utility statements by default.

…safety Capture /proc stats at query start and compute diff at end rather than reporting lifetime totals. Suppress error rethrows from the collector to avoid breaking other extensions. Add missing hooks deinitialization. Modernize ereport style.

…ocesses Delay initialization of static singletons and GRPC connections to actual query handling time rather than _PG_init, since both are incompatible with fork().

Mute PG-destined signals in GRPC reconnection thread. Move debian config to CI. Redirect debug output to log file. Harden memory handling. Remove thread-unsafe logging and dead code.

Add a comma-separated GUC to suppress metrics collection for specified roles. Parse using SplitIdentifierString and cache in an unordered_set.

Remove GRPC dependency. Serialize metrics as protobuf messages and deliver them over a Unix domain socket. Replace server-side message queue with incremental per-query message building. Add clang-format configuration. Use deprecated protobuf API for bionic compatibility.

…ation

Add SQL functions stat_messages() and stat_messages_reset() exposing per-segment UDS transport counters: total_messages, send_failures, connection_failures, other_errors, max_message_size.

Move query message cleanup to the correct lifecycle point. Finalize fields before sending DONE event. Fix protobuf message memory leaks.

Use core query_id from Query instead of a separate hash. Resolve resource group from the current session rather than the role default.

Track query nesting level using a per-query key (tmid, ssid, ccnt, nesting_level, query_desc_addr). Maintain a state machine per active query to correctly sequence submit→start→end→done across nesting boundaries.

Trim query text and plan text to max_text_size and max_plan_size limits.

Capture elog error message at the done event for ERROR and CANCELED statuses. Properly send accumulated runtime metrics before teardown. Drop the intermediate CANCELLING event.

Take an initial metrics snapshot at submit so incremental stats are computed as deltas. Required for correct per-query accounting with nested statements.

… PG hooks

Don't normalize trimmed plans. Clean up stale text fields between events. Report nested queries only from dispatcher. Add slice_id. Aggregate inherited_calls and inherited_time on segments.

Factor out ProtoUtils, ProcStats, and PgUtils from EventSender.

…node types

Hook into ic_teardown to collect UDP-IFC packet-level counters. Compile-time gated behind IC_TEARDOWN_HOOK.

Fix UB in strcpy. General code refactoring.

gpcontrib/gp_stats_collector/gmon.out

…riable

Smyatkin-Maxim

LGTM, though I'm kind of an interested party myself - so feel free to ignore my ship :)

It's been widely used in all our production gp6 clusters and in terms of production-readiness we're quite confident in it and it's way better than gpperfmon. I guess as a contrib package it's should be good enough for cloudberry as well.

tuhaihe · 2026-03-28T12:38:39Z

LGTM, though I'm kind of an interested party myself - so feel free to ignore my ship :)

It's been widely used in all our production gp6 clusters and in terms of production-readiness we're quite confident in it and it's way better than gpperfmon. I guess as a contrib package it's should be good enough for cloudberry as well.

Great! I also believe it so. I would like to test on a local machine and give my +1 earlier next week. Does it sound good?

configure

.github/workflows/build-deb-cloudberry.yml

tuhaihe · 2026-03-30T06:46:35Z

Hi, I did a test on a local machine. After building the extension successfully, it showed that some files are deleted when running make distclean:

[gpadmin@cdw cloudberry]$ git status
On branch mergeYagpHooks
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)

        deleted:    gpcontrib/gp_stats_collector/results/gpsc_cursors.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_dist.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_guc_cache.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_locale.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_select.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_uds.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utf8_trim.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utility.out

Do we need to remove or keep them in the source? FYI.

gpcontrib/gp_stats_collector/src/UDSConnector.cpp

gpcontrib/gp_stats_collector/src/hook_wrappers.cpp

gpcontrib/gp_stats_collector/src/memory/gpdbwrappers.h

gpcontrib/gp_stats_collector/.clang-format

gpcontrib/gp_stats_collector/protos/gpsc_metrics.proto

gpcontrib/gp_stats_collector/src/Config.cpp

gpcontrib/gp_stats_collector/results/gpsc_cursors.out

Enable building gp_stats_collector by default in configure. Add missing check in verify_query() to ensure the extension does not execute main code while disabled. Always verify protobuf version once the shared library is preloaded.

Delete stale .gitignore. Add Apache headers to .proto files. Change #pragma once to #ifndef guards. Remove test result files from tree. Change ereport(FATAL) to ereport(ERROR). Remove internal naming suffixes. Apply clang-format from gporca.

src/backend/commands/portalcmds.c

gpcontrib/gp_stats_collector/src/UDSConnector.cpp

gpcontrib/gp_stats_collector/gp_stats_collector--1.0--1.1.sql

gpcontrib/gp_stats_collector/src/hook_wrappers.cpp

Add PG_TRY/PG_CATCH around query_info_collect_hook in PortalCleanup error path to prevent exceptions from propagating during cleanup.

Rename ON MASTER to ON COORDINATOR in test SQL. Prefer pg_usleep() over std::this_thread::sleep_for(). Add pg_unreachable() after ereport(ERROR). Widen motion stats fields to uint64.

my-ship-it

LGTM.

Great work on this extension — I can see a lot of effort went into it, and the earlier
review feedback has been well addressed.

Before merging, it might be nice to squash or tidy up the commit history a bit. The current log includes quite a few incremental fix-up commits from the review process.

Squashing these into a small number of logical commits — or even a single clean commit — would keep git log tidy and consistent with the project's commit conventions. Just a suggestion!

Other than these, everything looks good from my side. Thanks again for the great contribution!

devops/build/automation/cloudberry/scripts/configure-cloudberry.sh

NJrslv · 2026-04-01T07:58:17Z

@tuhaihe
Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

I think we're now good to merge.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

*Also, give me some time to write a complete commit message, if we squash it.

tuhaihe · 2026-04-01T08:25:07Z

@tuhaihe Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

Got it, thanks. Yes, we can regenerate it later, and welcome to create a new PR to update it then. BTW, my PR on generating the new configure file is still in review: #1651.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

I prefer to keep the history commits, which will be important for the potential contributors to learn more about the context. Here just need some time to tidy them up ( my common way is squashing some simple and related continuous commits into one and rewording the commit title/commit message body ) and make the commits look more logical, as suggested by @my-ship-it. But it looks good to me for now. FYI. @NJrslv

NJrslv · 2026-04-01T09:52:23Z

I reset this commit (f26c287) in the force push above because autoconf is not run during the configure-cloudberry.sh script. As a result, the extension was not built, so the tests failed.

The ideal case would be to make the change in configure.ac, regenerate configure, and push both files. However, since we agreed to regenerate configure later, the only thing that matters for now is configure.ac.

I left configure file as it was to ensure the tests pass.

Now I'm tidying up the commit history.

*I will approve the review once I am ready.

NJrslv

+1. I grouped the commits. We want to merge/(rebase + merge), with just merge preferred. However, merging main into the current branch might cause a criss-cross merge later.

Folks, thanks all for the review. We fixed a couple of problems during it that we hadn't discovered previously.

tuhaihe · 2026-04-02T08:40:56Z

+1. I grouped the commits. We want to merge/(rebase + merge), with just merge preferred. However, merging main into the current branch might cause a criss-cross merge later.

Folks, thanks all for the review. We fixed a couple of problems during it that we hadn't discovered previously.

Merged! Thanks @NJrslv @Smyatkin-Maxim for your great work again！

I'd also like to share an experience regarding integrating a long-standing project (let's call it Project A) into the current repository, for future reference. Initially, Project A might have been maintained in a separate repository with a directory structure that doesn't align with the current project. During the rebase process, conflicts can arise between Project A's commit history and the current directory files. This is because rebase replays each of Project A's past commits one by one.

Therefore, when introducing similar projects into the current repository in the future, it's best to first add a target directory prefix to Project A, and then import its commit history into this new target directory. This way, the replayed commit history will only occur within the target directory, preventing conflicts with the main repository's top-level directories. This is an experience I gained while merging disktuqota, and I wanted to share it for your reference.

Smyatkin-Maxim and others added 30 commits March 23, 2023 15:15

[yagp_hooks_collector] Fix segfault in plan text generator

8f3f0c6

Guard against NULL plan state when generating EXPLAIN output.

[yagp_hooks_collector] Apply llvm code style

7680c38

[yagp_hooks_collector] Switch to query_info_collect_hook and fix stab…

76cb553

…ility Use query_info_collect_hook for finer-grained lifecycle tracking. Fix two segfaults in early init paths. Skip hooks in UTILITY mode. General robustness improvements.

[yagp_hooks_collector] Add debian packaging and bionic GRPC compatibi…

ac9a4ea

…lity

[yagp_hooks_collector] Fix EventSender and GrpcConnector in forked pr…

244b1e3

…ocesses Delay initialization of static singletons and GRPC connections to actual query handling time rather than _PG_init, since both are incompatible with fork().

[yagp_hooks_collector] Fix memory leak in EXPLAIN ANALYZE code path

5011ddf

[yagp_hooks_collector] Add motion network and workfile spill stats

f54c583

[yagp_hooks_collector] Clean up threading, signal handling, and logging

2a40208

Mute PG-destined signals in GRPC reconnection thread. Move debian config to CI. Redirect debug output to log file. Harden memory handling. Remove thread-unsafe logging and dead code.

[yagp_hooks_collector] Add ignored_users_list GUC

036e15e

Add a comma-separated GUC to suppress metrics collection for specified roles. Parse using SplitIdentifierString and cache in an unordered_set.

[yagp_hooks_collector] Fix missing query statuses after protobuf migr…

7933790

…ation

[yagp_hooks_collector] Add stat_messages() runtime statistics view

6f416a9

Add SQL functions stat_messages() and stat_messages_reset() exposing per-segment UDS transport counters: total_messages, send_failures, connection_failures, other_errors, max_message_size.

[yagp_hooks_collector] Fix message lifecycle ordering and memory leaks

e941608

Move query message cleanup to the correct lifecycle point. Finalize fields before sending DONE event. Fix protobuf message memory leaks.

[yagp_hooks_collector] Improve query_id and resource group resolution

fb41ea5

Use core query_id from Query instead of a separate hash. Resolve resource group from the current session rather than the role default.

[yagp_hooks_collector] Add nested query tracking

14ca85b

Track query nesting level using a per-query key (tmid, ssid, ccnt, nesting_level, query_desc_addr). Maintain a state machine per active query to correctly sequence submit→start→end→done across nesting boundaries.

[yagp_hooks_collector] Add configurable text field trimming

a9048d4

Trim query text and plan text to max_text_size and max_plan_size limits.

[yagp_hooks_collector] Add error message reporting for failed queries

a861071

Capture elog error message at the done event for ERROR and CANCELED statuses. Properly send accumulated runtime metrics before teardown. Drop the intermediate CANCELLING event.

[yagp_hooks_collector] Change report_nested_queries to PGC_USERSET

965bb64

[yagp_hooks_collector] Diff per-query stats between submit and end

b61e59a

Take an initial metrics snapshot at submit so incremental stats are computed as deltas. Required for correct per-query accounting with nested statements.

[yagp_hooks_collector] Fix try/catch block when calling C++ code from…

e8be9c9

… PG hooks

[yagp_hooks_collector] Improve nested query handling and add slice info

ce906c2

Don't normalize trimmed plans. Clean up stale text fields between events. Report nested queries only from dispatcher. Add slice_id. Aggregate inherited_calls and inherited_time on segments.

[yagp_hooks_collector] Split EventSender into submodules

0d13067

Factor out ProtoUtils, ProcStats, and PgUtils from EventSender.

[yagp_hooks_collector] Ignore EXPLAIN VERBOSE errors for unsupported …

9fa6ab1

…node types

[yagp_hooks_collector] Add per-slice interconnect statistics

7c33bd1

Hook into ic_teardown to collect UDP-IFC packet-level counters. Compile-time gated behind IC_TEARDOWN_HOOK.

[yagp_hooks_collector] Fix user filtering propagation timing

15d4d6b

[yagp_hooks_collector] Miscellaneous fixes and refactoring

8ba10d7

Fix UB in strcpy. General code refactoring.

NJrslv force-pushed the mergeYagpHooks branch 2 times, most recently from 4d1ce48 to 105247a Compare March 26, 2026 07:51

tuhaihe reviewed Mar 26, 2026

View reviewed changes

gpcontrib/gp_stats_collector/gmon.out Outdated Show resolved Hide resolved

[gp_stats_collector] Simplify Makefile and add -Wno-unused-but-set-va…

5a98162

…riable

Smyatkin-Maxim approved these changes Mar 28, 2026

View reviewed changes

tuhaihe reviewed Mar 30, 2026

View reviewed changes

configure Show resolved Hide resolved

tuhaihe reviewed Mar 30, 2026

View reviewed changes

.github/workflows/build-deb-cloudberry.yml Outdated Show resolved Hide resolved

my-ship-it requested changes Mar 30, 2026

View reviewed changes

NJrslv added 2 commits March 31, 2026 08:34

[gp_stats_collector] Code quality cleanup

1c25757

Delete stale .gitignore. Add Apache headers to .proto files. Change #pragma once to #ifndef guards. Remove test result files from tree. Change ereport(FATAL) to ereport(ERROR). Remove internal naming suffixes. Apply clang-format from gporca.

NJrslv force-pushed the mergeYagpHooks branch from 07455c1 to 829ee53 Compare March 31, 2026 06:41

my-ship-it reviewed Mar 31, 2026

View reviewed changes

NJrslv added 2 commits March 31, 2026 14:15

[gp_stats_collector] Wrap hook call in try/catch on error path

6a84528

Add PG_TRY/PG_CATCH around query_info_collect_hook in PortalCleanup error path to prevent exceptions from propagating during cleanup.

[gp_stats_collector] Adapt namings for Cloudberry

157f1da

Rename ON MASTER to ON COORDINATOR in test SQL. Prefer pg_usleep() over std::this_thread::sleep_for(). Add pg_unreachable() after ereport(ERROR). Widen motion stats fields to uint64.

my-ship-it approved these changes Apr 1, 2026

View reviewed changes

tuhaihe reviewed Apr 1, 2026

View reviewed changes

devops/build/automation/cloudberry/scripts/configure-cloudberry.sh Outdated Show resolved Hide resolved

[gp_stats_collector] Remove unnecessary CONFIGURE_EXTRA_OPTS param

d5b88c6

tuhaihe approved these changes Apr 1, 2026

View reviewed changes

NJrslv force-pushed the mergeYagpHooks branch from ade1c9b to 4509d30 Compare April 1, 2026 09:46

NJrslv force-pushed the mergeYagpHooks branch from c9d52ae to d5b88c6 Compare April 2, 2026 07:06

NJrslv approved these changes Apr 2, 2026

View reviewed changes

tuhaihe merged commit af533b2 into apache:main Apr 2, 2026
78 of 79 checks passed

tuhaihe mentioned this pull request Apr 2, 2026

Introduce extension perfmon to monitor database cluster #1085

Open

12 tasks

NJrslv deleted the mergeYagpHooks branch April 2, 2026 10:32

Conversation

leborchuk commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Smyatkin-Maxim left a comment

Choose a reason for hiding this comment

Uh oh!

tuhaihe commented Mar 28, 2026

Uh oh!

Uh oh!

Uh oh!

tuhaihe commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

my-ship-it left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NJrslv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tuhaihe commented Apr 1, 2026

Uh oh!

NJrslv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NJrslv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tuhaihe commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leborchuk commented Mar 18, 2026 •

edited

Loading

NJrslv commented Apr 1, 2026 •

edited

Loading

NJrslv commented Apr 1, 2026 •

edited

Loading

NJrslv left a comment •

edited

Loading