build: kill PROXYSQLGENAI flag, PROXYSQL40=1 builds all plugins#5814
build: kill PROXYSQLGENAI flag, PROXYSQL40=1 builds all plugins#5814renecannao wants to merge 35 commits into
Conversation
PR #5708 added detection for MariaDB's `SET STATEMENT … FOR …` syntax to keep ProxySQL from invoking `unable_to_parse_set_statement()` and locking the session to a single hostgroup. The detection at `lib/MySQL_Session.cpp` was: if (strncasecmp(nq.c_str(), "SET STATEMENT ", 14) == 0 && strcasestr(nq.c_str(), " FOR ")) { return false; } `strcasestr(nq, " FOR ")` is a literal-substring search requiring a space BOTH before and after the FOR keyword. A customer report (see PR #5708 thread) reproduced the same error 9006 as before the fix by typing the statement across multiple lines at the mysql client prompt — the character after FOR is then a newline, not a space, so the detection misses, the code falls through to the regular SET parser, the parser doesn't know SET STATEMENT, and the hostgroup is locked to the user's default. The next non-default-hostgroup query fails with 9006. Verified with a standalone strcasestr probe: q1 (customer multi-line): strcasestr(" FOR ") matches ? NO q2 (existing TAP single-line): YES The fix uses `dig` (the digest text) for the detection instead of `nq` (the raw query). `dig` is whitespace-normalised by the digest builder, so " FOR " always has a space on both sides regardless of how the client formatted the query. This is consistent with the prefix check 35 lines above, which already uses `dig` for keyword detection. == Test coverage == `test/tap/tests/test_set_statement_for-t.cpp` previously had `plan(5)` and self-skipped on non-MariaDB backends via `skip_all()`. It was also registered ONLY in MySQL-family TAP groups (`legacy-g2`, `mysql84-g2`, etc.). The net effect was that the test self-skipped on every CI run since it was added — never actually executing an assertion in CI. The customer's bug regressed entirely unnoticed. This commit: 1. Removes the runtime MariaDB-detection helper and the `skip_all` call. A misregistered test should fail loudly, not silently skip; correct registration is enforced through the groups.json layer instead. 2. Moves the groups.json registration from the MySQL-only groups to `mariadb10-galera-g2`, which actually provides a MariaDB backend. 3. Adds four new test cases (assertions 6–13) covering the customer whitespace shapes that strcasestr (" FOR ") missed: newline, tab, CRLF, and multiple spaces between FOR and the inner statement. Each case verifies (a) the SET STATEMENT succeeds, and (b) a follow-up plain SELECT succeeds (i.e. no hostgroup lock persists — the customer's actual failure mode). Verified locally on TAP group mariadb10-galera-g2 against MariaDB 10.11.9: 13/13 ok, RC=0. A separate framework-cleanup follow-up should remove `skip_all()` from `test/tap/tap/tap.{h,cpp}` entirely and convert the other four callers (`test_mysqlx_e2e_*`, `test_mysqlx_route_drop_inflight-t`, `test_mysqlx_sigterm_inflight-t`) to either fail loudly or move to appropriately-gated TAP groups.
…ions Stage 2 follow-up to the PR #5708 reproduction debug. The framework helper test/tap/tap/tap.{h,cpp}::skip_all() prints "1..0 # skip ..." and exit(0), which the harness reports as "ignored" — zero assertions emitted, zero failures recorded, no notice that a test never ran. While investigating the customer regression on PR #5708 we discovered that test_set_statement_for-t had been silently skipping in CI since the day it was added: it self-skipped on non-MariaDB backends, but was registered only in MySQL-family TAP groups, so every run skipped. PR #5794 fixes that one test. The principled fix is to remove the mechanism that lets a test silently disappear. This commit converts the remaining four skip_all() callers and deletes the framework function: - test_mysqlx_e2e_routing-t BAIL_OUT if MYSQLX_E2E_PROXYSQL_PORT is not set. The group env file test/tap/groups/mysqlx-e2e/env.sh sets it; missing at runtime means the harness did not source it. - test_mysqlx_e2e_handshake-t BAIL_OUT if MYSQLX_E2E_HOST is not set (same env file, same rationale). - test_mysqlx_route_drop_inflight-t BAIL_OUT if the X-Protocol listener is not reachable. The mysqlx-soak-g1 harness is supposed to start ProxySQL with the mysqlx plugin and provision the route; a missing listener is a setup bug. - test_mysqlx_sigterm_inflight-t Deleted. The binary was a stub whose body was a TODO; the real sigterm scenario is implemented in test/scripts/mysqlx/behavioral_validation.py --scenario sigterm. With skip_all gone the stub had no remaining purpose. Also dropped from groups.json and the Makefile. After this commit, the test/tap codebase has zero call sites of skip_all(). The framework declaration and definition are deleted from tap.h and tap.cpp; the doc references at the bottom of tap.cpp are updated; a comment explains the rationale and points at BAIL_OUT() as the replacement. Verified locally: full TAP debug build succeeds, no undefined-reference errors, lint_groups_json + check_groups --source both green. Depends on PR #5794 (which removed the fifth caller in test_set_statement_for-t.cpp). This PR is based on that branch; merging in order keeps each merge a clean compile.
…verage is cancelled The previous pattern opened a LouisBrunner check_run with status=in_progress at job start and closed it via check_id at job end with if: always(). When the runner is killed mid-job (concurrency:cancel-in-progress, OOM, infra shutdown), the closing step never executes and the check_run is left in in_progress forever, blocking the PR rollup. Collapse to a single LouisBrunner call at the end: pass name: instead of check_id: so the action creates the check_run with a terminal conclusion in one shot. If the job is cancelled before the closing step runs, no check_run is ever created (PR rollup shows "missing" — recoverable via re-run) instead of "stuck" — unrecoverable without GitHub App auth. Refs: stuck check_run 77381366265 on PR #5752 (sha cf0b1d3) from 2026-05-22 12:39 UTC. Same pattern exists across ~163 other CI-* workflows; fixing only the workflow with the observed incident here. Sweep of the remaining files belongs in a separate PR.
test(framework): remove skip_all() — fail loudly on missing preconditions (Stage 2 follow-up to PR #5794)
…value list ParserSQL v1.0.2's PG SET parser had two bugs that were hidden in CI until PR #5760 fixed `test/infra/control/ensure-infras.bash` to actually dispatch `pre-proxysql.sql` hooks (previously only `.bash` hooks ran). With the harness fix in place, `set_parser_algorithm_3-g1` for the first time ran its tests under `pgsql-set_parser_algorithm=3`, surfacing: * `pgsql-set_parameter_validation_test-t` — `SET search_path TO "$user", public` parsed as `search_path = "$user"`, dropping every value after the first comma. * `pgsql-set_statement_test-t` — `SET TIME ZONE 'UTC'` parsed as `time = ZONE`, dropping the actual timezone value entirely. ## Upstream fix ParserSQL v1.0.3 (ProxySQL/ParserSQL PR #38) is the upstream parser fix: * PG `SET TIME ZONE <value>` is now recognised as the PG alias for `SET TimeZone = <value>`. The two-word keyword is detected via case-insensitive identifier lookahead since the tokenizer has no dedicated TK_TIME / TK_ZONE keyword. * PG `parse()` no longer shares MySQL's comma loop. In PG, the comma after the first value is value-list continuation — each subsequent expression is appended as another child of the same NODE_VAR_ASSIGNMENT node, alongside the first RHS. Full upstream test suite: 1223/1223 pass; 0 regressions. ## ProxySQL-side changes * `deps/parsersql/parsersql-1.0.3.tar.gz` — new tarball, replaces 1.0.2. * `deps/parsersql/parsersql` symlink retargeted to `parsersql-1.0.3`. * `lib/Query_Processor_ParserSQL.cpp::walk_set_stmt` — the PG multi-value fix means a single NODE_VAR_ASSIGNMENT can have multiple RHS siblings. The adapter now iterates `target->next_sibling` chain instead of reading a single sibling, so the `map<string, vector<string>>` for `SET search_path TO 'a', 'b', 'c'` carries `["a", "b", "c"]` rather than `["a"]`. Fully backward-compatible for MySQL (always one RHS). * `test/tap/tests/setparser_parsersql_test.cpp` — adds four permanent regression-test groups exercising the library + adapter end-to-end without a live backend: * `mysql_filtered_set` (7 cases) — verifies the filtered-SET variables parse cleanly so the downstream filter logic gets the input it expects. * `mysql_set_testing` (4 cases) — multi-variable comma-separated SET samples from `set_testing-240.csv`. * `pgsql_search_path` (8 cases) — single-value and multi-value shapes from `pgsql-set_parameter_validation_test-t`. * `pgsql_time_zone` (4 cases) — every `SET TIME ZONE ...` shape exercised by `pgsql-set_statement_test-t`. The 4 new assertions for the `INTERVAL '7' HOUR` case capture today's actual ParserSQL output (`timezone = [INTERVAL]`) — the expression parser does not yet consume the full interval modifier chain. The test reflects observable behaviour so it stays green; tightening to capture the full interval value is a future ExpressionParser improvement. ## Test plan - [x] Local rebuild of ProxySQL from the new 1.0.3 tarball: clean. - [x] `./setparser_parsersql_test`: 46 new assertions, all pass. - [x] No regressions in the 256 pre-existing assertions. The same 10 pre-existing failures (sql_mode 23-25, Set1_v2 0-1) — unrelated to this work, present before the upgrade. - [ ] CI on this branch: TAP groups `legacy-g2`, `mysql84-g2`, and `set_parser_algorithm_3-g1` are expected to flip green for the ParserSQL-related failures. Two non-ParserSQL bugs on the algorithm=3 ProxySQL session-dispatch path (`test_filtered_set_statements-t`, `set_testing-t`) remain to be investigated separately — the adapter now returns the correct maps, but downstream dispatch still misbehaves. ## Dependency Should not be merged before ProxySQL/ParserSQL PR #38 is merged and a v1.0.3 tag is cut. The vendored tarball in this commit was built from that PR's branch head; once the upstream tag exists the tarball SHA should be identical to a fresh `git archive --prefix=ParserSQL-1.0.3/ v1.0.3 | gzip` from the upstream tag.
…kends (#5790) Commit 7c6f42b ("fix: MySQL 9.x charset handling and log_last_insert_id test race") changed the version guard in lib/MySQL_Variables.cpp from testing the first character of server_version to using atoi(): - if (charset >= 255 && myconn->mysql->server_version[0] != '8') { + if (charset >= 255 && atoi(myconn->mysql->server_version) < 8) { The new form correctly extends the "supports collation id 255" treatment to MySQL 9.x and later, but it silently broke MariaDB: server_version old check new check ---------------------------- --------- --------- "5.7.x" replace replace "8.0.x" skip skip "9.x.y" replace[*] skip "10.11.13-MariaDB" replace skip <-- BUG "11.x.x-MariaDB" replace skip <-- BUG [*] wrong for MySQL 9.x; that is the case the original commit fixed. MariaDB does not implement utf8mb4_0900_ai_ci (collation id 255), regardless of its numeric major version. With the regression, ProxySQL no longer rewrites the unsupported collation before sending SET NAMES to the MariaDB backend, and the backend rejects it with: ERROR 1273 (HY000): Unknown collation: 'utf8mb4_0900_ai_ci' …breaking clients such as Dovecot that pick collation 255 from libmariadb's charset table at connect time. Fix: also treat MariaDB (detected by the "MariaDB" suffix in server_version) as a backend that cannot accept collation id >= 255, so handle_unknown_charset replacement applies there as well. The regression is present on v3.0 HEAD (and on the v3.1.8 / v4.0.8 tags), but NOT in the v3.0.7 / v3.0.8 release tags themselves; users hitting this in production are on post-v3.0.8 "v3.0-head" CI packages (`3.0.8-NN-gHASH`). Reported-by: @FoZo (#5790) Also-confirmed-by: @akrus Adds reg_test_5790-mariadb_collation_255-t, registered in mariadb10-galera-g1, which: 1. Skips when the backend is not MariaDB. 2. Sets mysql-handle_unknown_charset=1 and mysql-default_collation_connection=utf8mb4_general_ci. 3. Issues `SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci`. 4. Runs a SELECT that forces backend SET NAMES via handler_again___status_CHANGING_CHARSET. 5. Verifies the SELECT succeeds and that the backend's @@collation_connection is the configured default, not utf8mb4_0900_ai_ci. This test fails without the fix (errno 1273 from the backend) and passes with it.
…ehavior PR #5755 (commit e6bef5c) fixed `process_pg_typecast()` in `lib/pgsql_tokenizer.cpp` so that the trailing space after a PG typecast is preserved unless the next non-space character is a modifier `(` or array bracket `[`. The commit message documented this as an intentional behaviour change: "The new lookahead-conditional skip preserves the trailing space unless the very next non-space character is a modifier `(` or array bracket `[`." Three test cases in `tokenizer_payloads/pgsql_regular_tokenizer_digests.hjson` hardcoded expected digests against the OLD (buggy) tokenizer output, where the space-eating bug glued the preceding literal to the following operator. With the fix in place those expectations mismatch, making `pgsql-query_digests_stages_test-t` fail deterministically in the legacy-g4 TAP group on v3.0 HEAD: * `::json -> 'key'` STAGE 1: actual `select ? -> ?` vs old expected `select ?-> ?` * `::jsonb @> ...` STAGE 1-4: actual `select ? @> ?` vs old expected `select ?@> ?` * `::money + ...` STAGE 1: actual `select ? + ?` vs old expected `select ?+ ?` For `->` and `+`, only STAGE 1 changes — STAGE 2-4's existing whitespace-collapse step still glues `?->` and `?+` for those operators, so `s2`/`s3`/`s4` expectations stay as-is. For `@>`, none of the stages collapse the space, so all four expectations move. The new actual digest output is the more correct one — it mirrors the existing MySQL tokenizer's behaviour for the same shape, where `'literal' + 'literal'` already produces `? + ?` at STAGE 1. This commit only updates the fixture; no tokenizer or test source changes. Verified locally against the existing test binary (built from v3.0 HEAD, includes the new tokenizer behaviour): ./pgsql-query_digests_stages_test-t → 734/734 ok, RC=0 The 9 previously-failing assertions (#381, #385, #389-#392, #592, #596, #600) all flip to ok with the updated fixture.
When handler_WCD_SS_MCQ_qpo_LargePacket() is reached via the COM_STMT_EXECUTE path, MySQL_Protocol::get_binds_from_pkt() has already aliased stmt_meta->pkt to pkt->ptr (see the FIXME at MySQL_Protocol.cpp:2873). The subsequent RequestEnd() -> Query_Info::end() free()s stmt_meta->pkt (the fix introduced for bug #796 at MySQL_Session.cpp:455), and then LargePacket's own l_free(pkt->size, pkt->ptr) frees the same buffer a second time. With jemalloc this silently corrupts the arena and surfaces later as the SIGSEGV reported in #5639, with the crashing frame landing on the l_free call site (+0x10e into LargePacket). Under ASAN the double-free aborts deterministically with a backtrace whose two free() chains match the report exactly. Fix: capture whether CurrentQuery.stmt_meta aliases pkt before calling RequestEnd() (which clears stmt_meta), and skip the second free when the alias was present. For the COM_QUERY path through LargePacket, stmt_meta is always NULL, so the previous behaviour is preserved there. Also add reg_test_5639_stmt_execute_max_allowed_packet-t which exercises the STMT_EXECUTE > mysql-max_allowed_packet path and asserts the session and process survive.
DNS_Cache, DNS_Cache_Record, DNS_Resolve_Data, the resolver worker, and
the validate_ip / get_connected_peer_ip_from_socket / debug_iplisttostring
helpers used to live inside MySQL_Monitor.{hpp,cpp}. Move them to a new
DNS_Cache.{hpp,cpp} so the same plumbing can back an independent
PgSQL_Monitor cache.
DNS_Cache no longer references GloMyMon directly; counters are wired in
per-instance via set_counters(). monitor_dns_resolver_thread no longer
reads mysql_thread___resolution_family; the desired ai_family is now a
field on DNS_Resolve_Data set by the caller. MySQL_Monitor wires both up
in its constructor and resolver loop respectively; behavior is identical.
No change to the MySQL DNS cache semantics.
Fixes the watchdog asserts under DNS degradation reported in #5768. Symptom: with a PgSQL backend addressed by hostname, an unhealthy resolver made every PQconnectStart() block ~5s on getaddrinfo inside libpq. Under load, all PgSQL_Thread workers spent the bulk of each loop iteration in that synchronous DNS call, atomic_curtime fell behind the watchdog threshold (poll_timeout + 1s), and after restart_on_missing_heartbeats misses the watchdog asserts. MySQL avoids this because mysql-monitor's DNS cache lets MySQL_Connection::connect_start pass an IP directly to libmariadb; PgSQL had no equivalent. This change adds the equivalent for PgSQL, with state fully independent of the MySQL cache: * PgSQL_Monitor gains its own DNS_Cache, force_dns_cache_update flag, and dns_cache_{queried,lookup_success,record_updated} counters. PgSQL_Monitor::dns_lookup / update_dns_cache_from_pgsql_conn / trigger_dns_cache_update mirror the MySQL_Monitor surface. * A new background thread (PgSQL_monitor_dns_cache_pthread) drives PgSQL_Monitor::monitor_dns_cache(), which walks PgHGM->pgsql_servers_to_monitor and feeds the shared DNSResolverWorker pool. Launched alongside PgSQL_monitor_scheduler and lives independently of pgsql-monitor_enabled, matching MySQL. * PgSQL_Connection::connect_start now consults the cache. On a hit it appends `hostaddr=<ip>` to the libpq conninfo while keeping `host=<hostname>`; libpq documents this combo specifically to skip name resolution while preserving TLS hostname verification. On a miss / IP literal / disabled cache we fall back to the previous behavior (libpq does getaddrinfo). * Successful PgSQL connects seed the cache via update_dns_cache_from_pgsql_conn, so the second connect avoids getaddrinfo even before the resolver loop has visited the host. * The pgsql-monitor_local_dns_cache_ttl / _refresh_interval / _resolver_queue_maxsize variables (already accepted by Admin via PgSQL_Thread's variable registry) now actually propagate to pgsql_thread___monitor_local_dns_* globals. The propagation block in PgSQL_Thread::refresh_variables was previously commented out, so the values never reached anywhere. * The runtime stats output (SQL3_GlobalStatus) now exposes the PgSQL_Monitor_dns_cache_* counters alongside the existing PgSQL_Monitor_* counters. DNSResolverWorker is shared infrastructure added to DNS_Cache.hpp/cpp, used here by PgSQL_Monitor. MySQL_Monitor still uses its existing ConsumerThread-based pool; nothing about the MySQL cache changes.
The names were defined in PgSQL_Thread::variables but sat inside a commented-out block in pgsql_thread_variables_names, so Admin's has_variable() returned false and SET pgsql-monitor_local_dns_* was rejected with "Unknown global variable". Move them out of the /* ... */ so the variables actually show up in global_variables and runtime_global_variables and can be tuned at runtime. Verified end-to-end on a smoke run: SET ... LOAD PGSQL VARIABLES TO RUNTIME succeeds, the resolver loop populates the cache, and a client connect through the proxy increments dns_cache_queried / dns_cache_lookup_success (cache hit, libpq skipped getaddrinfo).
Adds pgsql-test_dns_cache-t mirroring the existing MySQL test_dns_cache
test, exercising 16 scenarios:
* pgsql-monitor_local_dns_cache_* variables propagate to runtime
* IP literal does not touch the cache
* Resolvable hostname populates dns_cache_record_updated
* Client connect through the proxy increments dns_cache_queried and
dns_cache_lookup_success on a cache hit
* Hostname with leading/trailing whitespace is trimmed before
resolution
* Unresolvable hostname is queried but produces no record (cache miss
bumps dns_cache_queried only)
* Removing pgsql_servers rows drops the orphaned cache record
* Cache disabled via refresh_interval=0 flatlines all counters
* MySQL DNS cache counters do not move in response to pgsql-side
activity (independence between the two caches)
While writing the test, step 4 (whitespace trim) caught a real bug:
PgSQL_Monitor::monitor_dns_cache was inserting hostnames verbatim from
the SQLite3_result into the resolver queue, so ' example.org ' was
handed to getaddrinfo() unchanged and failed. MySQL_Monitor sidesteps
this by reading via SELECT trim(hostname) FROM monitor_internal.*; the
pgsql variant reads directly from pgsql_servers_to_monitor so it needs
to call trim() in C++. Fix: trim before validate_ip / insert into
the hostnames set. The lookup path already trimmed, so the bookkeeper
key now matches.
Registered the test in groups.json under legacy-g4 and the mysql-*
variant g4 groups, matching how the other pgsql-* tests are grouped.
Fixes for actionable findings on PR #5806: DNS_Cache.cpp / hpp: - lookup(): initialize *ip_count = 0 on the disabled-cache fast path so callers don't observe a stale count from a previous lookup (CodeRabbit). - add_if_not_exist(): only bump counter_record_updated_ when an insert actually happened, and return that signal to callers. Previously a no-op call inflated the cache-update stats (CodeRabbit). - get_connected_peer_ip_from_socket(): drop the malloc, switch to a stack sockaddr_storage, and only assign result on AF_INET / AF_INET6 with a successful inet_ntop. Previously, on an unknown address family, we'd copy the uninitialized ip_addr buffer into result (Gemini + CodeRabbit). - monitor_dns_resolver_thread(): cache_ttl is signed int and at the max configured TTL (7*24*3600*1000 ms) `1000 * cache_ttl` overflows before being added to monotonic_time(). Force the multiply into unsigned long long space with 1000ULL (Gemini). - DNS_Cache::IP_ADDR::counter is now mutable so get_next_ip() (const) can __sync_fetch_and_add it without a const_cast (Gemini). PgSQL_Monitor.cpp: - monitor_dns_cache(): refresh pgsql_thread___monitor_local_dns_* before the loop, not just on the first version-bump. Otherwise the resolver runs its first pass against zero-initialized TTL / refresh interval and starts in the wrong enabled/disabled state until a later config bump (CodeRabbit). test/tap/tests/pgsql-test_dns_cache-t.cpp: - Step 7 ("Cache disabled by refresh_interval=0") was inserting an IP literal (0.0.0.0), which bypasses DNS regardless of the cache state. Switched to a resolvable hostname (example.com) so a regression in the cache-disable path would actually move counters and fail the assertion. Added a third assertion on dns_cache_lookup_success for symmetry. Plan adjusted to 17. Deliberately skipped: - CR comment about mysql_thread___resolution_family change requiring a cache flush: pre-existing MySQL behavior, out of scope for this PR. - CR comment about pthread_attr_init / setstacksize return checks in main.cpp: matches the existing un-checked pattern in MySQL_Monitor::run(). No need to diverge here. - Gemini comment about persistent thread pool vs. per-iteration spawn: MySQL_Monitor::monitor_dns_cache() also recreates the resolver threads each loop iteration; the PgSQL variant follows the same pattern. - Gemini comment about 2048-byte stack size: Thread::start takes ss in KB (lib/thread.cpp:46 `pthread_attr_setstacksize(&attr, ss*1024)`), so 2048 here means 2 MB, matching the MySQL resolver pool.
Address @gemini-code-assist review on PR #5807: the previous patch unconditionally evaluated `atoi(sv)` in the disjunction `(is_mariadb || atoi(sv) < 8)`. If `sv` was NULL, the short-circuit on `is_mariadb` (false) would still reach `atoi(NULL)`, which is undefined behavior per the C standard. In practice `myconn->mysql->server_version` is non-NULL by the time validate_charset() runs (CHANGING_CHARSET is only reachable after the backend handshake has populated server_version), so this is defensive, not a real crash path. Still essentially free, and keeps the guard consistent with the existing NULL check on `strstr(sv, "MariaDB")`. No behavior change for any reachable input.
The same mental defect fixed in lib/MySQL_Variables.cpp:validate_charset()
(commit before this one) exists at four other call sites in core. All
four use a first-character or 3-byte string-prefix comparison against
server_version and consequently misclassify either MariaDB or MySQL 9.x
(or both). Two are latent (MariaDB happens to land on the still-working
branch); the two in MySQL_Monitor.cpp are an active break on MySQL 9.x
Galera/PXC, because `INFORMATION_SCHEMA.GLOBAL_STATUS` was removed in
MySQL 8.4 and MySQL 9.x falls into the else branch that targets it.
Audited sites and fixes:
lib/MySQL_Session.cpp:2607-2615 -- tx_isolation / tx_read_only
Pre-fix: `strncmp(sv,"8",1)==0` chose `transaction_isolation` /
`transaction_read_only`; everything else got `tx_isolation` /
`tx_read_only`. MySQL 9.x fell to the else branch, but `tx_isolation`
was removed in MySQL 8.4; on MySQL 9.x the SET fails. MariaDB also
fell to the else branch -- coincidentally correct, since MariaDB
keeps `tx_isolation` across all versions.
Post-fix: use the modern name only for non-MariaDB MySQL >= 8.0
(any major), so MySQL 8.0/8.4/9.x all get `transaction_isolation`
and MariaDB 10/11 keeps `tx_isolation`.
lib/MySQL_Monitor.cpp:749 (sync MON_GALERA setup)
lib/MySQL_Monitor.cpp:2297 (async MON_GALERA dispatch)
Pre-fix: `strncmp(sv,"5.7",3)==0 || strncmp(sv,"8",1)==0` picked
performance_schema.global_status; everything else used
INFORMATION_SCHEMA.GLOBAL_STATUS. MySQL 9.x fell to the else branch
and queries INFORMATION_SCHEMA.GLOBAL_STATUS, which was removed in
MySQL 8.4 -- monitor query fails on MySQL 9.x Galera/PXC. MariaDB
Galera correctly fell to the else branch and uses
information_schema, which is where MariaDB keeps GLOBAL_STATUS.
Post-fix: use performance_schema for non-MariaDB MySQL >= 5.7 (so
MySQL 5.7, 8.0, 8.4, 9.x all share that path); MariaDB Galera and
legacy MySQL/PXC < 5.7 retain the information_schema path.
Pattern used at all sites (matching the existing fix in
validate_charset):
const char *sv = ...->mysql->server_version;
bool is_mariadb = (sv != NULL && strstr(sv, "MariaDB") != NULL);
int sv_major = (sv != NULL) ? atoi(sv) : 0;
No dedicated test added for the Monitor / SET TRANSACTION paths in this
commit: the Monitor change is exercised by any mariadb10-galera or
mysql{8,9}x-* CI group with Galera/PXC backends, and the
transaction_isolation change by SET-isolation-level coverage in
existing TAP groups. The validate_charset regression test added in
the prior commit on this branch still covers the original #5790 path.
A subagent code review on top of the CodeRabbit + Gemini bot reviews found five additional issues that the bots missed. All five are fixed here. 1. LOAD PGSQL SERVERS TO RUNTIME was waking MySQL's resolver, not PgSQL's. PgSQL_HostGroups_Manager::commit() inherited a MySQL_Monitor::trigger_dns_cache_update() call from a copy of the MySQL HGM, so after a LOAD pgsql hostnames had to wait for the refresh interval (default 60 s) before becoming resolvable — defeating the warm-cache-before-traffic intent that motivated issue #5768. Route to PgSQL_Monitor::trigger_dns_cache_update(). Verified directly: with refresh_interval=60000 ms, inserting a pgsql_servers row + LOAD now produces record_updated=1 within ~4 s instead of waiting a full minute. 2. DNSResolverWorker could deadlock shutdown. monitor_dns_resolver_thread isn't noexcept (vector growth, set_thread_name, etc. can throw); if it threw, the std::promise on DNS_Resolve_Data was never satisfied, the producer's future::get() blocked forever, the resolver loop never reached the shutdown check, and pthread_join in main hung. Wrap the call in try/catch and force-satisfy the promise with a failure result on any uncaught exit. 3. Test step 3 wasn't actually exercising the connect-path lookup it claimed. hammer_proxy() routes through the proxy front-end, which in standard test infra resolves through the default hostgroup — typically configured with an IP literal (see e.g. test/infra/docker-pgsql16-single/conf/proxysql/config.sql). IP literals bypass dns_lookup via validate_ip(), so the dns_cache_queried bumps observed came from the resolver loop alone, not from the connect path. Step 3 would have passed even if PgSQL_Connection::connect_start_DNS_lookup() was wired up wrong. Install a pgsql_query_rules row from inside the test that routes testuser to hostgroup 999, so the proxy actually calls into the cache on client connects. Clean up the rule in the test exit path. 4. Use-after-move in the _dns_cache_update debug log. Both PgSQL_Monitor::_dns_cache_update and MySQL_Monitor::_dns_cache_update logged debug_iplisttostring(ip_address) after std::move'ing ip_address into add_if_not_exist(). The moved-from vector is in a "valid but unspecified" state — typically empty — so the log printed "IP:[]" instead of the IPs. Capture the string before the move. Same bug also fixed in the bookkeeper expired-record requeue path inside PgSQL_Monitor::monitor_dns_cache. 5. monitor_dns_cache had inconsistent indentation inside the `if (hostnames.empty()) { ... } else { ... }` block introduced when I wrapped the original goto-based control flow. The braces were correct but the body sat at the wrong tab depth, making the scope visually misleading and prone to mis-edits. Re-indented. Findings deliberately skipped from the review (rationale in commit body): - "thread_local shared_ptr<DNS_Cache> keeps stale cache alive across GloPgMon recreate" — pre-existing MySQL pattern; GloPgMon is never destroyed mid-process today. - "PgSQL_monitor_dns_cache_pthread can deref GloPgMon during a tight shutdown" — shutdown ordering in main.cpp sets GloPgMon->shutdown before joining the thread and deletes GloPgMon only afterwards, matching the MySQL pattern. - "cache only seeded on ASYNC_CONNECT_SUCCESSFUL" — matches MySQL; pre-existing scope choice. - "validate_ip rejects bracketed IPv6" — pre-existing in MySQL; no current configuration path produces bracketed IPv6 in pgsql_servers. - "Test relies on outbound DNS for example.com" — matches the upstream MySQL test_dns_cache test which uses google.com / yahoo.com / amazon.com. Established convention. - "ai_family hard-coded to AF_UNSPEC for PgSQL" — already commented in code; follow-up if/when pgsql-resolution_family is added.
CI-lint-groups-json requires keys in groups.json to be sorted.
Follow-up to the cluster-2 ParserSQL 1.0.3 bump (this same PR),
addressing review feedback from coderabbit + gemini.
## 1. PgSQL_Session: accept multi-value SET vectors (coderabbit, Major)
The previous commit `5a48004af` taught the adapter to walk every RHS
sibling so PG multi-value lists (e.g. `SET search_path TO "$user",
public`) come through as a `std::vector<std::string>` of every value.
But `lib/PgSQL_Session.cpp:4516` still had `size() > 2` →
`unable_to_parse_set_statement()`, which would have caused 3+ value
PG SETs to lock the hostgroup instead of being processed. Worse,
the downstream value-consumer at the same site used only
`it->second.front()`, silently dropping every value past the first
for valid 2-value SETs too.
Fix:
* Drop the upper bound from the cardinality check (keep `< 1` so
empty value lists still fail loudly).
* Join every value in the vector into `value1` with the
comma-separated form PG itself uses for these parameters
(`a, b, c`). Single-value SETs hit the loop with `size()==1` and
take just `it->second.front()`, preserving the previous behaviour.
The validator / hash-compare / tracker / param_status code below the
join is unchanged — it now receives the full joined value instead of
the truncated first element.
## 2. setparser_parsersql_test: drop INTERVAL test that locked in
incomplete behaviour (subagent correctness review, Major)
`parsersql_pgsql_time_zone` previously asserted that
`SET TIME ZONE INTERVAL '7' HOUR` produces `timezone = "INTERVAL"`,
which is the *current* observable output of ParserSQL — the
expression parser does not yet consume the `'7' HOUR` modifier
chain. Encoding that as expected output would flip the test red
when ParserSQL is later improved to capture the full interval
expression, conflating an improvement with a regression.
Removed the case; left a comment in the fixture explaining the gap
so a future contributor adds the case back together with the
ExpressionParser fix.
## 3. Diag-output cleanup (gemini, cosmetic)
* `%lu` → `%zu` for `size_t` (portability — `size_t` is not always
`unsigned long` on 32-bit / Windows).
* Extract the per-vector " | "-separated diag-string builder into
a single `join_values_for_diag()` helper used by both
`TestParse` (now also dumps full vector on mismatch, where it
previously printed just `vals[0]`) and `TestParsePgsql`.
## Verification
* `./setparser_parsersql_test`: 260/270 ok; the 10 remaining
not-ok are unchanged pre-existing failures (`sql_mode 23-25`,
`Set1_v2 0-1`), nothing new from this commit.
* `pgsql_time_zone` group: 6 assertions (3 cases × 2) — confirms
the INTERVAL case is removed cleanly.
* `pgsql_search_path` group: 16 assertions still all pass — the
multi-value join in PgSQL_Session does not affect the parser-
side test (which goes through the adapter, not the session).
The previous commit introduced local variable declarations with
initializers (`const char *sv = …`, `bool is_mariadb = …`, etc.)
directly inside `case MON_GALERA:` of the `switch` in
`init_async()` at lib/MySQL_Monitor.cpp:742. The sibling
`case MON_CLOSE_CONNECTION:` / `MON_CONNECT:` / `MON_AWS_AURORA:`
labels then jump past these initializations, which GCC rejects:
MySQL_Monitor.cpp:779:14: error: jump to case label
MySQL_Monitor.cpp:781:14: error: jump to case label
MySQL_Monitor.cpp:783:14: error: jump to case label
Wrap the `#else` body of the MON_GALERA case in `{ ... }` so the
declarations live in their own scope and no sibling case can jump
over them. No behavior change.
The async path at lib/MySQL_Monitor.cpp:2299 was already inside its
own `{ ... }` block (added with the original code, not by this
patch), so it compiled fine and needs no change. The
`MySQL_Session.cpp` edits live inside `if / else if` chains with
their own braces, also unaffected.
fix(ci): prevent zombie IN_PROGRESS checks in CI-unit-tests-asan-coverage
fix(session): SET STATEMENT ... FOR detection tolerates any whitespace (PR #5708 follow-up)
fix: pgsql digest fixtures (typecast spacing) + ParserSQL 1.0.3 bump (PG SET TIME ZONE + multi-value lists)
Independent DNS cache for PgSQL (fixes #5768)
fix(core): validate_charset must replace collation 255 on MariaDB backends (#5790)
fix(session): avoid double-free of stmt_meta pkt in LargePacket (#5639)
validate_ip() treated inet_pton(...) != 0 as success, but inet_pton returns 1 / 0 / -1; a -1 (e.g. unsupported family) would be reported as a valid IP. Compare against == 1 instead. The DNS cache counters (dns_cache_queried / dns_cache_lookup_success / dns_cache_record_updated) were declared as plain unsigned long long on MySQL_Monitor. Resolver workers update them atomically via __sync_fetch_and_add, but p_update_metrics() reads them from a different thread without a barrier, racing the writers. Promote to std::atomic<unsigned long long> and switch DNS_Cache to fetch_add / the atomic pointer type so both ends are consistent. Addresses CodeRabbit review feedback on PR #5806.
GloPgMon->shutdown was a plain bool, written by main.cpp's teardown and read by the monitor / DNS resolver loops on worker threads — a data race that can let the loops spin past shutdown or observe a stale value indefinitely. Promote to std::atomic<bool> and switch the readers/writer to acquire/release ordering. PgSQL_Monitor::dns_lookup() was also returning an empty string when GloPgMon / dns_cache_thread weren't initialized yet, ignoring the return_hostname_if_lookup_fails contract. Move the fallback out of the dns_cache_thread branch so startup / shutdown behave like a normal cache miss for callers that asked for the hostname fallback. Addresses CodeRabbit review feedback on PR #5806.
Two regressions surfaced by CodeRabbit on PR #5806: * admin_exec() return values for the DNS-cache runtime config ('SET pgsql-monitor_local_dns_cache_ttl' / 'LOAD PGSQL VARIABLES TO RUNTIME') were silently dropped, so a failed setup cascaded into misleading downstream assertions. BAIL_OUT immediately. * The pgsql_query_rules INSERT hardcoded username='testuser', but hammer_proxy() connects as cl.pgsql_username. When the infra's test user is anything else the rule never matches, hostgroup 999 is bypassed, and the DNS-cache assertions exercise the wrong path. Build the INSERT with cl.pgsql_username instead.
The admin connection used cl.host, which is the frontend endpoint. In infrastructures where the admin interface is on a different host the test would false-fail on the admin connect, unrelated to collation behavior. Use cl.admin_host. Addresses CodeRabbit review feedback on PR #5807.
The previous commit promoted MySQL_Monitor::dns_cache_{queried,
lookup_success,record_updated} to std::atomic<unsigned long long>
but missed several call sites that pass them through implicit copy:
* lib/MySQL_Thread.cpp sprintf("%llu", GloMyMon->dns_cache_*) — varargs
cannot copy std::atomic.
* lib/PgSQL_Thread.cpp sprintf("%llu", GloPgMon->dns_cache_*) — same,
for the PgSQL counters.
* lib/MySQL_Monitor.cpp p_update_counter(..., GloMyMon->dns_cache_*) —
works via implicit operator T(), but explicit .load() is clearer
and matches the rest of the file.
Promote PgSQL_Monitor::dns_cache_* to std::atomic for the same race
that motivated the MySQL side (resolver workers increment, stats
readers and p_update_metrics read from another thread), and add
.load() at every read site so we don't depend on implicit conversion.
Verified with a local 'make -j32 debug' build in the integration
worktree.
PROXYSQLGENAI was a leftover from before the genai plugin carve-out. All genai/MCP/AI/RAG code now lives in plugins/genai/ as a runtime loadable .so. Since the plugin chassis (PROXYSQL40=1) already builds and loads the genai .so, a separate flag serves no purpose. Entrypoint changes (all three -- rhel, suse, deb): - Build: pass PROXYSQL40=1 (not PROXYSQLGENAI=1) to make. - Packaging: copy genai .so alongside mysqlx when PROXYSQL40=1. - protobuf-devel install: gate on PROXYSQL40=1 (needed for mysqlx). Automated packaging verification added to all entrypoints: rpm -qpl / dpkg -c confirms both .so files are in the package. ELF smoke test: validates .so is a valid shared object exporting proxysql_plugin_descriptor_v1 symbol. Fails build on mismatch. Test Makefiles: PROXYSQLGENAI -> PROXYSQL40 for the vec.o link gate (test/tap/tests/Makefile and test/deps/cluster_simulator/Makefile). Updated remaining PROXYSQLGENAI references to PROXYSQL40 in: CLAUDE.md, include/cpp.h, lib/proxy_sqlite3_symbols.cpp, src/main.cpp (3x comments), proxysql.spec, test/infra/control/start-proxysql-isolated.bash, test/infra/control/run-unit-tests-asan-coverage.bash, and 14 genai test file comments.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (83)
📝 WalkthroughWalkthroughThis PR: (1) consolidates v4.0 plugin gating under PROXYSQL40 and adds package-install verification to many CI package jobs (reusable workflow + verifier), (2) introduces a shared DNS cache and resolver workers plus PgSQL monitor integration and a dedicated DNS-cache thread, (3) extends ParserSQL/PG SET parsing to support multi-value RHS, (4) fixes session/collation/version detection and other robustness issues, and (5) updates tests and TAP framework (remove skip_all → BAIL_OUT). ChangesPlugin Chassis Tier Unification & CI package verification
Docker entrypoints & packaging smoke tests
Shared DNS cache and resolver infrastructure
PgSQL monitor, connection integration & thread lifecycle
ParserSQL Multi-Value SET Statement Support
MySQL session robustness & collation handling
TAP test framework & test-suite updates
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Code Review
This pull request refactors the DNS cache infrastructure into a shared class, enabling independent DNS caching for both MySQL and PostgreSQL monitors. It introduces several critical fixes for MySQL 8.4+ and MariaDB compatibility, specifically regarding Galera healthcheck queries, collation ID handling, and whitespace-tolerant SET STATEMENT parsing. Additionally, it resolves a double-free vulnerability in large packet handling and updates the test framework to favor explicit failures over silent skips. Review feedback identifies opportunities to improve the accuracy of build-time error logging, ensure robust return-value checking for network address conversions, and refine the staggering logic used in the resolver threads.
| if nm -D "${plugin_path}" 2>/dev/null | grep -q 'proxysql_plugin_descriptor_v1'; then | ||
| echo " OK ${plugin} (exports proxysql_plugin_descriptor_v1)" | ||
| else | ||
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
There was a problem hiding this comment.
The log message uses WARN but the logic sets ALL_OK=1, which causes the build to fail later at line 201. This should be labeled as FAIL to accurately reflect that it is a fatal error for the packaging process.
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 | |
| echo " FAIL ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
| if nm -D "${plugin}" 2>/dev/null | grep -q 'proxysql_plugin_descriptor_v1'; then | ||
| echo " OK ${plugin} (exports proxysql_plugin_descriptor_v1)" | ||
| else | ||
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
There was a problem hiding this comment.
The log message uses WARN but the logic sets ALL_OK=1, which causes the build to fail later at line 179. This should be labeled as FAIL to accurately reflect that it is a fatal error for the packaging process.
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 | |
| echo " FAIL ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
| if nm -D "${plugin}" 2>/dev/null | grep -q 'proxysql_plugin_descriptor_v1'; then | ||
| echo " OK ${plugin} (exports proxysql_plugin_descriptor_v1)" | ||
| else | ||
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
There was a problem hiding this comment.
The log message uses WARN but the logic sets ALL_OK=1, which causes the build to fail later at line 173. This should be labeled as FAIL to accurately reflect that it is a fatal error for the packaging process.
| echo " WARN ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 | |
| echo " FAIL ${plugin} (no proxysql_plugin_descriptor_v1 symbol)" >&2 |
| if (p->ai_family == AF_INET) { | ||
| struct sockaddr_in* ipv4 = (struct sockaddr_in*)p->ai_addr; | ||
| inet_ntop(p->ai_addr->sa_family, &ipv4->sin_addr, ip_addr, INET_ADDRSTRLEN); | ||
| ips.push_back(ip_addr); | ||
| } | ||
| else { | ||
| struct sockaddr_in6* ipv6 = (struct sockaddr_in6*)p->ai_addr; | ||
| inet_ntop(p->ai_addr->sa_family, &ipv6->sin6_addr, ip_addr, INET6_ADDRSTRLEN); | ||
| ips.push_back(ip_addr); | ||
| } |
There was a problem hiding this comment.
The return value of inet_ntop is not checked. If it fails, an uninitialized or stale ip_addr buffer will be pushed into the ips vector. Additionally, the else block should explicitly check for AF_INET6 and use sizeof(ip_addr) for robustness.
for (auto p = res; p != NULL; p = p->ai_next) {
if (p->ai_family == AF_INET) {
struct sockaddr_in* ipv4 = (struct sockaddr_in*)p->ai_addr;
if (inet_ntop(p->ai_family, &ipv4->sin_addr, ip_addr, sizeof(ip_addr)))
ips.push_back(ip_addr);
}
else if (p->ai_family == AF_INET6) {
struct sockaddr_in6* ipv6 = (struct sockaddr_in6*)p->ai_addr;
if (inet_ntop(p->ai_family, &ipv6->sin6_addr, ip_addr, sizeof(ip_addr)))
ips.push_back(ip_addr);
}
}| rc = pthread_rwlock_unlock(&rwlock_); | ||
|
|
||
| if (item_removed && counter_record_updated_) | ||
| counter_record_updated_->fetch_add(1, std::memory_order_relaxed); | ||
|
|
||
| assert(rc == 0); |
There was a problem hiding this comment.
The assertion for the return code of pthread_rwlock_unlock should be placed immediately after the call to ensure it refers to the correct operation.
| rc = pthread_rwlock_unlock(&rwlock_); | |
| if (item_removed && counter_record_updated_) | |
| counter_record_updated_->fetch_add(1, std::memory_order_relaxed); | |
| assert(rc == 0); | |
| rc = pthread_rwlock_unlock(&rwlock_); | |
| assert(rc == 0); | |
| if (item_removed && counter_record_updated_) | |
| counter_record_updated_->fetch_add(1, std::memory_order_relaxed); |
| if (delay_us > 1000000 || delay_us <= 0) | ||
| delay_us = 10000; |
There was a problem hiding this comment.
|
Wrong base — will recreate from clean v3.0 |
Add verify-package-install.bash that installs the built .deb/.rpm on a clean Docker image of the target distro and runs smoke tests: - proxysql --version catches missing runtime shared library deps - Plugin .so file presence at /usr/lib/proxysql/ (auto-detected) - ELF header validation on installed binaries The script auto-detects whether the package ships plugins by inspecting dpkg -c / rpm -qpl, so it works for both v3.0 (no plugins) and v4.0 genai/mysqlx packages. Wire it into all 28 CI-package-arm64-* workflows (14 genai + 14 base) between the Build and Upload steps. If verification fails, the build job fails and the package never reaches the release draft.
|
Superseded by #5815 — closing, all changes on fix/kill-proxysqlgenai-build-flag-v2 |
|
❌ The last analysis has failed. |
PROXYSQLGENAI was a leftover from before the genai plugin carve-out. All genai/MCP/AI/RAG code now lives in
plugins/genai/as a runtime-loadable.so. Since the plugin chassis (PROXYSQL40=1) already builds and loads the genai.so, a separate flag serves no purpose.What changed
Entrypoints (rhel-compliant, suse-compliant, deb-compliant)
PROXYSQL40=1(notPROXYSQLGENAI=1) to make. All three now use the same EXTRA pattern..soalongside mysqlx whenPROXYSQL40=1— no separatePROXYSQLGENAIgate.PROXYSQL40=1(needed for mysqlx plugin build).Automated package verification (added to all 3 entrypoints)
rpm -qpl/dpkg -cconfirms both.sofiles are present. Fails build if missing..sofiles, verifies they're valid ELF 64-bit shared objects, checks they exportproxysql_plugin_descriptor_v1. Fails build on mismatch.Build/test infrastructure
test/tap/tests/Makefile: changed 2 vec.o link gates fromPROXYSQLGENAItoPROXYSQL40.test/deps/cluster_simulator/Makefile: same change.Documentation cleanup
CLAUDE.md, include/cpp.h, lib/proxy_sqlite3_symbols.cpp, src/main.cpp, proxysql.spec, and test infra scripts updated.
Verification
PROXYSQL40=1 make packagesnow builds and packages both mysqlx and genai.sofiles.PROXYSQLGENAIenv var is no longer consulted anywhere in the build or packaging pipeline.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests / CI