Crash when trying to replicate a ref table that is actually dropped

Steps to reproduce:
```sql
CREATE TABLE ref_table (key int);
SELECT create_reference_table('ref_table');
SET citus.enable_ddl_propagation TO OFF;
DROP TABLE ref_table CASCADE; -- citus_drop_all_shards doesn't drop shards and metadata

SELECT 1 FROM master_add_node('localhost', 9700, groupId => 0); -- a new node without the ref table
CREATE TABLE citus_local_table(id int, other_column int);
SELECT citus_add_local_table_to_metadata('citus_local_table'); -- triggers EnsureReferenceTablesExistOnAllNodes
-- ref_table has placements on both workers, but not on the coordinator
-- EnsureReferenceTablesExistOnAllNodes tries to replicate it to the coordinator
```

DROP TABLE triggers some other drop functions:
* master_remove_distributed_table_metadata_from_workers
* citus_drop_all_shards
* master_remove_partition_metadata

but these functions do an early return without dropping anything, because `enable_ddl_propagation` is set to off.
The actual table is dropped on the coordinator, but the shards remain on the workers. We don't do anything on `pg_dist_partition`, `pg_dist_shard` and `pg_dist_placement` as well. So the metadata actually says that we still have a reference table, which has shard placements on both workers.

When `master_add_node` is called, it introduces a new node which doesn't have the reference table on it.

`citus_add_local_table_to_metadata` calls `EnsureReferenceTablesExistOnAllNodes`, which fetches all reference table records from `pg_dist_partition`, including the dropped one. Then, because there is no placement on the new node for the reference table, as `pg_dist_placement` tells us, it tries to replicate the reference table to the new node, by calling `citus_copy_shard_placement`.

`citus_copy_shard_placement` crashes during the recreation of the table on the target node, because it can't find the name/schema for the given relationId, as that table doesn't exist anymore.

In short, I think we have two separate issues here
1) `DROP TABLE` doesn't drop shards and the metadata when ddl propagation is off. We can maybe leave shards on workers, but  we should delete the `pg_dist_` entries for the table, as we actually don't have that reference/distributed table anymore. This will prevent the crash as there will be no entry in `pg_dist_partition` for the table anymore.
The shards left on the workers can be cleaned up right away ignoring the GUC, or we can create cleanup records for them to make them dropped once the GUC is set to on again. We can also consider removing the GUC in the future.
2) We should add some checks & errors to the `citus_copy_shard_placement` logic so it should never crash for cases like this but gracefully error out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when trying to replicate a ref table that is actually dropped #6592

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crash when trying to replicate a ref table that is actually dropped #6592

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions