Add new extension yagp hooks collector#1629
Conversation
Add yagp_hooks_collector, a shared-preload module that hooks into ExecutorStart and ExecutorFinish to capture query lifecycle events. Includes Makefile with protobuf code generation, GRPC-based delivery, QueryInfo generation (query text, plan text, query_id, plan_id, session metadata), and basic protobuf message filling.
Guard against NULL plan state when generating EXPLAIN output.
… normalized texts Collect spill info (file count, bytes written). Generate normalized query and plan texts using a pg_stat_statements-derived parser. Collect buffer I/O counters, tuple counts, timing, and /proc/self CPU/memory/IO statistics.
…ility Use query_info_collect_hook for finer-grained lifecycle tracking. Fix two segfaults in early init paths. Skip hooks in UTILITY mode. General robustness improvements.
…ion GUCs Add missing Greenplum node types to pg_stat_statements parser. Move stats reporting to ExecutorEnd hook. Improve GRPC failure handling. Track CDB-specific metrics and initial query nesting level. Add resource group collection. Add GUCs for controlling collection. Skip nested and utility statements by default.
…safety Capture /proc stats at query start and compute diff at end rather than reporting lifetime totals. Suppress error rethrows from the collector to avoid breaking other extensions. Add missing hooks deinitialization. Modernize ereport style.
…ocesses Delay initialization of static singletons and GRPC connections to actual query handling time rather than _PG_init, since both are incompatible with fork().
Mute PG-destined signals in GRPC reconnection thread. Move debian config to CI. Redirect debug output to log file. Harden memory handling. Remove thread-unsafe logging and dead code.
Add a comma-separated GUC to suppress metrics collection for specified roles. Parse using SplitIdentifierString and cache in an unordered_set.
Remove GRPC dependency. Serialize metrics as protobuf messages and deliver them over a Unix domain socket. Replace server-side message queue with incremental per-query message building. Add clang-format configuration. Use deprecated protobuf API for bionic compatibility.
Add SQL functions stat_messages() and stat_messages_reset() exposing per-segment UDS transport counters: total_messages, send_failures, connection_failures, other_errors, max_message_size.
Move query message cleanup to the correct lifecycle point. Finalize fields before sending DONE event. Fix protobuf message memory leaks.
Use core query_id from Query instead of a separate hash. Resolve resource group from the current session rather than the role default.
Track query nesting level using a per-query key (tmid, ssid, ccnt, nesting_level, query_desc_addr). Maintain a state machine per active query to correctly sequence submit→start→end→done across nesting boundaries.
Trim query text and plan text to max_text_size and max_plan_size limits.
Capture elog error message at the done event for ERROR and CANCELED statuses. Properly send accumulated runtime metrics before teardown. Drop the intermediate CANCELLING event.
Take an initial metrics snapshot at submit so incremental stats are computed as deltas. Required for correct per-query accounting with nested statements.
Don't normalize trimmed plans. Clean up stale text fields between events. Report nested queries only from dispatcher. Add slice_id. Aggregate inherited_calls and inherited_time on segments.
Factor out ProtoUtils, ProcStats, and PgUtils from EventSender.
Hook into ic_teardown to collect UDP-IFC packet-level counters. Compile-time gated behind IC_TEARDOWN_HOOK.
Fix UB in strcpy. General code refactoring.
4d1ce48 to
105247a
Compare
Smyatkin-Maxim
left a comment
There was a problem hiding this comment.
LGTM, though I'm kind of an interested party myself - so feel free to ignore my ship :)
It's been widely used in all our production gp6 clusters and in terms of production-readiness we're quite confident in it and it's way better than gpperfmon. I guess as a contrib package it's should be good enough for cloudberry as well.
Great! I also believe it so. I would like to test on a local machine and give my +1 earlier next week. Does it sound good? |
|
Hi, I did a test on a local machine. After building the extension successfully, it showed that some files are deleted when running Do we need to remove or keep them in the source? FYI. |
Enable building gp_stats_collector by default in configure. Add missing check in verify_query() to ensure the extension does not execute main code while disabled. Always verify protobuf version once the shared library is preloaded.
Delete stale .gitignore. Add Apache headers to .proto files. Change #pragma once to #ifndef guards. Remove test result files from tree. Change ereport(FATAL) to ereport(ERROR). Remove internal naming suffixes. Apply clang-format from gporca.
Add PG_TRY/PG_CATCH around query_info_collect_hook in PortalCleanup error path to prevent exceptions from propagating during cleanup.
Rename ON MASTER to ON COORDINATOR in test SQL. Prefer pg_usleep() over std::this_thread::sleep_for(). Add pg_unreachable() after ereport(ERROR). Widen motion stats fields to uint64.
my-ship-it
left a comment
There was a problem hiding this comment.
LGTM.
Great work on this extension — I can see a lot of effort went into it, and the earlier
review feedback has been well addressed.
Before merging, it might be nice to squash or tidy up the commit history a bit. The current log includes quite a few incremental fix-up commits from the review process.
Squashing these into a small number of logical commits — or even a single clean commit — would keep git log tidy and consistent with the project's commit conventions. Just a suggestion!
Other than these, everything looks good from my side. Thanks again for the great contribution!
devops/build/automation/cloudberry/scripts/configure-cloudberry.sh
Outdated
Show resolved
Hide resolved
|
@tuhaihe I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess. I think we're now good to merge. The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge. I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think? *Also, give me some time to write a complete commit message, if we squash it. |
Got it, thanks. Yes, we can regenerate it later, and welcome to create a new PR to update it then. BTW, my PR on generating the new
I prefer to keep the history commits, which will be important for the potential contributors to learn more about the context. Here just need some time to tidy them up ( my common way is squashing some simple and related continuous commits into one and rewording the commit title/commit message body ) and make the commits look more logical, as suggested by @my-ship-it. But it looks good to me for now. FYI. @NJrslv |
|
I reset this commit (f26c287) in the force push above because autoconf is not run during the configure-cloudberry.sh script. As a result, the extension was not built, so the tests failed. The ideal case would be to make the change in configure.ac, regenerate configure, and push both files. However, since we agreed to regenerate configure later, the only thing that matters for now is configure.ac. I left configure file as it was to ensure the tests pass. Now I'm tidying up the commit history. *I will approve the review once I am ready. |
There was a problem hiding this comment.
+1. I grouped the commits. We want to merge/(rebase + merge), with just merge preferred. However, merging main into the current branch might cause a criss-cross merge later.
Folks, thanks all for the review. We fixed a couple of problems during it that we hadn't discovered previously.
Merged! Thanks @NJrslv @Smyatkin-Maxim for your great work again! I'd also like to share an experience regarding integrating a long-standing project (let's call it Project A) into the current repository, for future reference. Initially, Project A might have been maintained in a separate repository with a directory structure that doesn't align with the current project. During the rebase process, conflicts can arise between Project A's commit history and the current directory files. This is because rebase replays each of Project A's past commits one by one. Therefore, when introducing similar projects into the current repository in the future, it's best to first add a target directory prefix to Project A, and then import its commit history into this new target directory. This way, the replayed commit history will only occur within the target directory, preventing conflicts with the main repository's top-level directories. This is an experience I gained while merging disktuqota, and I wanted to share it for your reference. |
A diagnostic and monitoring extension for Cloudberry clusters
Briefly, the interaction scheme is:
Cloudberry(Run Executor hooks) ->hooks collector extension(Create protobuf messages and send them using UDS) ->local yagpcc agentThis is the part that sends data from the PG instance to an external agent.
External agent receives telemetry, store, aggregate and expose to the consumers. The agent source code is https://github.com/open-gpdb/yagpcc We're going to propose add it to Apache Cloudberry infrastructure later, after fixing all the tests.
This is quite similar to the #1085 idea. However, it has a different implementation. It is completely separate from database application and written in Go. Could be separately maintained. No influence to the main database, and so different development requirements.
Extension has pg_regress tests, they were copied from pg_stat_statements (PG18) and adjusted to the extension behaviour.
Original extension authors are Smyatkin-Maxim and NJrslv I left their commits as-is to have blame history commited source.