-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Extremely slow query to determine relations that uses all disk space (160GB) #189
Comments
So I think this is equivalent, and runs in just under 1 second on my system select distinct
dependent_namespace.nspname as dependent_schema,
dependent_class.relname as dependent_name,
referenced_namespace.nspname as referenced_schema,
referenced_class.relname as referenced_name
-- Query for views: views are entries in pg_class with an entry in pg_rewrite
from pg_class as dependent_class
join pg_namespace as dependent_namespace on dependent_namespace.oid = dependent_class.relnamespace
join pg_rewrite as dependent_rewrite on dependent_rewrite.ev_class = dependent_class.oid
-- ... and via pg_depend
join pg_depend on pg_depend.objid = dependent_rewrite.oid
-- ... we can find the tables they query from in pg_class
join pg_class as referenced_class on referenced_class.oid = pg_depend.refobjid
join pg_namespace as referenced_namespace on referenced_namespace.oid = referenced_class.relnamespace
-- ... and we exclude system catalogs, and exclude views depending on themselves
where
dependent_class.oid != referenced_class.oid
and dependent_namespace.nspname != 'information_schema' and dependent_namespace.nspname not like 'pg\_%'
and referenced_namespace.nspname != 'information_schema' and referenced_namespace.nspname not like 'pg\_%'
order by
dependent_schema, dependent_name, referenced_schema, referenced_name; The main difference that gives the performance benefit is the stripping out of CTEs: although not as bad as they once were I think, they can still be optimisation blockers I find (both by the query planner, and manually when reasoning about the query). Also not having a the Also removed filtering on relkind that I think is unnecessary because of the joining on pg_rewrite There are a few style/naming differences as well, which includes swapping the names referenced and dependent, to bring it in line with the definition in pg_depend, and the code in the following that has dependent first dbt-postgres/dbt/adapters/postgres/impl.py Line 113 in 05f0337
The explain for the above is:
|
Resolves dbt-labs#189 The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query). The solution here optimises the query and so it runs in just under 1 second on my system. It does this by: - Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases). - Removing unnecessary filtering on relkind: the join on pg_rewrite I think is equivalent to that. - Not having `select distinct ... from pg_dependent` in the innards of the query, and instead having a top level `select distinct` - on my system this saved over 45 seconds. - Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas. It also has some style/naming changes: - Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113 - Re-orders the join to I think a slightly clearer order that "flows" from views -> the linking table (pg_depend) to the tables referenced in the views. - Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).
Resolves dbt-labs#189 The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query). The solution here optimises the query and so it runs in just under 1 second on my system. It does this by: - Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases). - Removing unnecessary filtering on relkind: the join on pg_rewrite I think is equivalent to that. - Not having `select distinct ... from pg_dependent` in the innards of the query, and instead having a top level `select distinct` - on my system this saved over 45 seconds. - Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas. It also has some style/naming changes: - Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113 - Re-orders the join to I think a slightly clearer order that "flows" from views -> the linking table (pg_depend) to the tables referenced in the views. - Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).
…o find relations Resolves dbt-labs#189 The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query). The solution here optimises the query and so it runs in just under 1 second on my system. It does this by: - Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases). - Removing unnecessary filtering on relkind: the join on pg_rewrite I think is equivalent to that. - Not having `select distinct ... from pg_dependent` in the innards of the query, and instead having a top level `select distinct` - on my system this saved over 45 seconds. - Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas. It also has some style/naming changes: - Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113 - Re-orders the join to I think a slightly clearer order that "flows" from views -> the linking table (pg_depend) to the tables referenced in the views. - Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).
Is this a new bug?
Current Behavior
When running
dbt --log-level debug run
(after following https://docs.getdbt.com/guides/manual-install?step=1, but for PostgreSQL), I see it gets stuck on this query:After ~50 minutes, it then errors:
And I see in our database metrics that all the local disk space on the instance was used (I have now done this multiple times) - in AWS terms the FreeLocalStorage drops to zero.
We do have I guess quite a... full... database - terabytes of data, thousands of tables, some of them with a few thousand partitions, a few thousand roles, and I suspect millions of rows in pg_auth_members if that makes a difference.
Running explain on the query results in this:
Expected Behavior
No error and
dbt run
completesSteps To Reproduce
dbt --log-level debug run
(I suspect this won't reproduce in most setups that have far less data in the database
Relevant log output
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: