Skip to content

Add new extension yagp hooks collector#1629

Merged
tuhaihe merged 49 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks
Apr 2, 2026
Merged

Add new extension yagp hooks collector#1629
tuhaihe merged 49 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks

Conversation

@leborchuk
Copy link
Copy Markdown
Contributor

@leborchuk leborchuk commented Mar 18, 2026

A diagnostic and monitoring extension for Cloudberry clusters

Briefly, the interaction scheme is:

Cloudberry (Run Executor hooks) -> hooks collector extension (Create protobuf messages and send them using UDS) -> local yagpcc agent

This is the part that sends data from the PG instance to an external agent.

External agent receives telemetry, store, aggregate and expose to the consumers. The agent source code is https://github.com/open-gpdb/yagpcc We're going to propose add it to Apache Cloudberry infrastructure later, after fixing all the tests.

This is quite similar to the #1085 idea. However, it has a different implementation. It is completely separate from database application and written in Go. Could be separately maintained. No influence to the main database, and so different development requirements.

Extension has pg_regress tests, they were copied from pg_stat_statements (PG18) and adjusted to the extension behaviour.

Original extension authors are Smyatkin-Maxim and NJrslv I left their commits as-is to have blame history commited source.

Smyatkin-Maxim and others added 30 commits March 23, 2023 15:15
Add yagp_hooks_collector, a shared-preload module that hooks into
ExecutorStart and ExecutorFinish to capture query lifecycle events.
Includes Makefile with protobuf code generation, GRPC-based delivery,
QueryInfo generation (query text, plan text, query_id, plan_id,
session metadata), and basic protobuf message filling.
Guard against NULL plan state when generating EXPLAIN output.
… normalized texts

Collect spill info (file count, bytes written).  Generate normalized
query and plan texts using a pg_stat_statements-derived parser.
Collect buffer I/O counters, tuple counts, timing, and /proc/self
CPU/memory/IO statistics.
…ility

Use query_info_collect_hook for finer-grained lifecycle tracking.
Fix two segfaults in early init paths.  Skip hooks in UTILITY mode.
General robustness improvements.
…ion GUCs

Add missing Greenplum node types to pg_stat_statements parser.  Move
stats reporting to ExecutorEnd hook.  Improve GRPC failure handling.
Track CDB-specific metrics and initial query nesting level.  Add
resource group collection.  Add GUCs for controlling collection.
Skip nested and utility statements by default.
…safety

Capture /proc stats at query start and compute diff at end rather than
reporting lifetime totals.  Suppress error rethrows from the collector
to avoid breaking other extensions.  Add missing hooks deinitialization.
Modernize ereport style.
…ocesses

Delay initialization of static singletons and GRPC connections to
actual query handling time rather than _PG_init, since both are
incompatible with fork().
Mute PG-destined signals in GRPC reconnection thread.  Move debian
config to CI.  Redirect debug output to log file.  Harden memory
handling.  Remove thread-unsafe logging and dead code.
Add a comma-separated GUC to suppress metrics collection for specified
roles.  Parse using SplitIdentifierString and cache in an unordered_set.
Remove GRPC dependency.  Serialize metrics as protobuf messages and
deliver them over a Unix domain socket.  Replace server-side message
queue with incremental per-query message building.  Add clang-format
configuration.  Use deprecated protobuf API for bionic compatibility.
Add SQL functions stat_messages() and stat_messages_reset()
exposing per-segment UDS transport counters: total_messages,
send_failures, connection_failures, other_errors, max_message_size.
Move query message cleanup to the correct lifecycle point.  Finalize
fields before sending DONE event.  Fix protobuf message memory leaks.
Use core query_id from Query instead of a separate hash.  Resolve
resource group from the current session rather than the role default.
Track query nesting level using a per-query key (tmid, ssid, ccnt,
nesting_level, query_desc_addr).  Maintain a state machine per active
query to correctly sequence submit→start→end→done across nesting
boundaries.
Trim query text and plan text to max_text_size and
max_plan_size limits.
Capture elog error message at the done event for ERROR and CANCELED
statuses.  Properly send accumulated runtime metrics before teardown.
Drop the intermediate CANCELLING event.
Take an initial metrics snapshot at submit so incremental stats are
computed as deltas.  Required for correct per-query accounting with
nested statements.
Don't normalize trimmed plans.  Clean up stale text fields between
events.  Report nested queries only from dispatcher.  Add slice_id.
Aggregate inherited_calls and inherited_time on segments.
Factor out ProtoUtils, ProcStats, and PgUtils from EventSender.
Hook into ic_teardown to collect UDP-IFC packet-level counters.
Compile-time gated behind IC_TEARDOWN_HOOK.
Fix UB in strcpy.  General code refactoring.
@NJrslv NJrslv force-pushed the mergeYagpHooks branch 2 times, most recently from 4d1ce48 to 105247a Compare March 26, 2026 07:51
Copy link
Copy Markdown
Contributor

@Smyatkin-Maxim Smyatkin-Maxim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm kind of an interested party myself - so feel free to ignore my ship :)

It's been widely used in all our production gp6 clusters and in terms of production-readiness we're quite confident in it and it's way better than gpperfmon. I guess as a contrib package it's should be good enough for cloudberry as well.

@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Mar 28, 2026

LGTM, though I'm kind of an interested party myself - so feel free to ignore my ship :)

It's been widely used in all our production gp6 clusters and in terms of production-readiness we're quite confident in it and it's way better than gpperfmon. I guess as a contrib package it's should be good enough for cloudberry as well.

Great! I also believe it so. I would like to test on a local machine and give my +1 earlier next week. Does it sound good?

@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Mar 30, 2026

Hi, I did a test on a local machine. After building the extension successfully, it showed that some files are deleted when running make distclean:

[gpadmin@cdw cloudberry]$ git status
On branch mergeYagpHooks
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)

        deleted:    gpcontrib/gp_stats_collector/results/gpsc_cursors.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_dist.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_guc_cache.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_locale.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_select.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_uds.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utf8_trim.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utility.out

Do we need to remove or keep them in the source? FYI.

NJrslv added 2 commits March 31, 2026 08:34
Enable building gp_stats_collector by default in configure.  Add
missing check in verify_query() to ensure the extension does not
execute main code while disabled.  Always verify protobuf version
once the shared library is preloaded.
Delete stale .gitignore.  Add Apache headers to .proto files.
Change #pragma once to #ifndef guards.  Remove test result files
from tree.  Change ereport(FATAL) to ereport(ERROR).  Remove
internal naming suffixes.  Apply clang-format from gporca.
NJrslv added 2 commits March 31, 2026 14:15
Add PG_TRY/PG_CATCH around query_info_collect_hook in PortalCleanup
error path to prevent exceptions from propagating during cleanup.
Rename ON MASTER to ON COORDINATOR in test SQL.
Prefer pg_usleep() over std::this_thread::sleep_for().
Add pg_unreachable() after ereport(ERROR).
Widen motion stats fields to uint64.
Copy link
Copy Markdown
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Great work on this extension — I can see a lot of effort went into it, and the earlier
review feedback has been well addressed.

Before merging, it might be nice to squash or tidy up the commit history a bit. The current log includes quite a few incremental fix-up commits from the review process.

Squashing these into a small number of logical commits — or even a single clean commit — would keep git log tidy and consistent with the project's commit conventions. Just a suggestion!

Other than these, everything looks good from my side. Thanks again for the great contribution!

@NJrslv
Copy link
Copy Markdown
Contributor

NJrslv commented Apr 1, 2026

@tuhaihe
Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

I think we're now good to merge.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

*Also, give me some time to write a complete commit message, if we squash it.

@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Apr 1, 2026

@tuhaihe Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

Got it, thanks. Yes, we can regenerate it later, and welcome to create a new PR to update it then. BTW, my PR on generating the new configure file is still in review: #1651.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

I prefer to keep the history commits, which will be important for the potential contributors to learn more about the context. Here just need some time to tidy them up ( my common way is squashing some simple and related continuous commits into one and rewording the commit title/commit message body ) and make the commits look more logical, as suggested by @my-ship-it. But it looks good to me for now. FYI. @NJrslv

@NJrslv
Copy link
Copy Markdown
Contributor

NJrslv commented Apr 1, 2026

I reset this commit (f26c287) in the force push above because autoconf is not run during the configure-cloudberry.sh script. As a result, the extension was not built, so the tests failed.

The ideal case would be to make the change in configure.ac, regenerate configure, and push both files. However, since we agreed to regenerate configure later, the only thing that matters for now is configure.ac.

I left configure file as it was to ensure the tests pass.

Now I'm tidying up the commit history.

*I will approve the review once I am ready.

Copy link
Copy Markdown
Contributor

@NJrslv NJrslv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I grouped the commits. We want to merge/(rebase + merge), with just merge preferred. However, merging main into the current branch might cause a criss-cross merge later.

Folks, thanks all for the review. We fixed a couple of problems during it that we hadn't discovered previously.

@tuhaihe tuhaihe merged commit af533b2 into apache:main Apr 2, 2026
78 of 79 checks passed
@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Apr 2, 2026

+1. I grouped the commits. We want to merge/(rebase + merge), with just merge preferred. However, merging main into the current branch might cause a criss-cross merge later.

Folks, thanks all for the review. We fixed a couple of problems during it that we hadn't discovered previously.

Merged! Thanks @NJrslv @Smyatkin-Maxim for your great work again!

I'd also like to share an experience regarding integrating a long-standing project (let's call it Project A) into the current repository, for future reference. Initially, Project A might have been maintained in a separate repository with a directory structure that doesn't align with the current project. During the rebase process, conflicts can arise between Project A's commit history and the current directory files. This is because rebase replays each of Project A's past commits one by one.

Therefore, when introducing similar projects into the current repository in the future, it's best to first add a target directory prefix to Project A, and then import its commit history into this new target directory. This way, the replayed commit history will only occur within the target directory, preventing conflicts with the main repository's top-level directories. This is an experience I gained while merging disktuqota, and I wanted to share it for your reference.

@NJrslv NJrslv deleted the mergeYagpHooks branch April 2, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants