Skip to content

Commit

Permalink
Optimize slow query that uses a high amount of temporary disk space t…
Browse files Browse the repository at this point in the history
…o find relations

Resolves dbt-labs#189

The macro postgres_get_relations in relations.sql was extremely slow and used
an extremely high amount of temporary disk space on a system with high numbers
of schemas, tables, and dependencies between database objects (rows in
pg_depend). Slow to the point of not completing in 50 minutes and using more
than 160GB disk space (at which point PostgreSQL ran out of disk space and
aborted the query).

The solution here optimises the query and so it runs in just under 1 second on
my system. It does this by:

- Stripping out CTEs that can be optimisation blockers, often by causing CTEs
  to be materialised to disk (especially in older PostgreSQL, but I suspect in
  recent too in some cases).
- Removing unnecessary filtering on relkind: the join on pg_rewrite I think is
  equivalent to that.
- Not having `select distinct ... from pg_dependent` in the innards of the
  query, and instead having a top level `select distinct` - on my system this
  saved over 45 seconds.
- Excluding self-relations that depend on themselves by using oid rather than
  using the names of tables and schemas.

It also has some style/naming changes:

- Flips the definition of "referenced" and "dependent" in the query to match
  both the definitions in pg_depend, and the code at
https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113
- Re-orders the join to I think a slightly clearer order that "flows" from views
  -> the linking table (pg_depend) to the tables referenced in the views.
- Lowers the abstraction/indirection levels in naming/aliases, using names
  closer to the PostgreSQL catalog tables - this made it easier to write and
  understand, and so I suspect easier to make changes in future (I found I had to
  keep in mind the PostgreSQL definitions more than the output of the query when
  making changes).
  • Loading branch information
michalc committed Jan 18, 2025
1 parent 05f0337 commit c099b27
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 61 deletions.
6 changes: 6 additions & 0 deletions .changes/unreleased/Fixes-20250118-084103.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Fixes
body: Optimize slow query that uses a high amount of temporary disk space to find relations
time: 2025-01-18T08:41:03.022013Z
custom:
Author: michalc
Issue: "189"
86 changes: 25 additions & 61 deletions dbt/include/postgres/macros/relations.sql
Original file line number Diff line number Diff line change
Expand Up @@ -7,68 +7,32 @@
#}

{%- call statement('relations', fetch_result=True) -%}
with relation as (
select
pg_rewrite.ev_class as class,
pg_rewrite.oid as id
from pg_rewrite
),
class as (
select
oid as id,
relname as name,
relnamespace as schema,
relkind as kind
from pg_class
),
dependency as (
select distinct
pg_depend.objid as id,
pg_depend.refobjid as ref
from pg_depend
),
schema as (
select
pg_namespace.oid as id,
pg_namespace.nspname as name
from pg_namespace
where nspname != 'information_schema' and nspname not like 'pg\_%'
),
referenced as (
select
relation.id AS id,
referenced_class.name ,
referenced_class.schema ,
referenced_class.kind
from relation
join class as referenced_class on relation.class=referenced_class.id
where referenced_class.kind in ('r', 'v', 'm')
),
relationships as (
select
referenced.name as referenced_name,
referenced.schema as referenced_schema_id,
dependent_class.name as dependent_name,
dependent_class.schema as dependent_schema_id,
referenced.kind as kind
from referenced
join dependency on referenced.id=dependency.id
join class as dependent_class on dependency.ref=dependent_class.id
where
(referenced.name != dependent_class.name or
referenced.schema != dependent_class.schema)
)
select distinct
dependent_namespace.nspname as dependent_schema,
dependent_class.relname as dependent_name,
referenced_namespace.nspname as referenced_schema,
referenced_class.relname as referenced_name

select
referenced_schema.name as referenced_schema,
relationships.referenced_name as referenced_name,
dependent_schema.name as dependent_schema,
relationships.dependent_name as dependent_name
from relationships
join schema as dependent_schema on relationships.dependent_schema_id=dependent_schema.id
join schema as referenced_schema on relationships.referenced_schema_id=referenced_schema.id
group by referenced_schema, referenced_name, dependent_schema, dependent_name
order by referenced_schema, referenced_name, dependent_schema, dependent_name;
-- Query for views: views are entries in pg_class with an entry in pg_rewrite
from pg_class as dependent_class
join pg_namespace as dependent_namespace on dependent_namespace.oid = dependent_class.relnamespace
join pg_rewrite as dependent_rewrite on dependent_rewrite.ev_class = dependent_class.oid

-- ... and via pg_depend
join pg_depend on pg_depend.objid = dependent_rewrite.oid

-- ... we can find the tables they query from in pg_class
join pg_class as referenced_class on referenced_class.oid = pg_depend.refobjid
join pg_namespace as referenced_namespace on referenced_namespace.oid = referenced_class.relnamespace

-- ... and we exclude system catalogs, and exclude views depending on themselves
where
dependent_class.oid != referenced_class.oid
and dependent_namespace.nspname != 'information_schema' and dependent_namespace.nspname not like 'pg\_%'
and referenced_namespace.nspname != 'information_schema' and referenced_namespace.nspname not like 'pg\_%'

order by
dependent_schema, dependent_name, referenced_schema, referenced_name;

{%- endcall -%}

Expand Down

0 comments on commit c099b27

Please sign in to comment.