-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365
Comments
@ashleemtib Thanks for identifying the problem, offering a clear write-up, and even opening a PR with some proposed code to get to a solution! I have a few questions about the user experience we should be aiming for. I'm also not sure that we'll be able to handle this pseudo-column aliasing behind-the-scenes in some cases. As a general principle, I'd rather opt for more-explicit, less-ergonomic code, over an abstraction that risks leaking. As far as the second alternative you've proposed:
Does this work if the base model is materialized as a |
Sorry I'm late to the party here ;) I need to make sure I understand how these columns are used, can you please confirm I got this right? Let's say I want to materialize this model: SELECT
Column1,
Column2,
_PARTITIONDATE,
_PARTITIONTIME
FROM {{ref('upstream_model_A')}} The thing is if I define tests on these columns, some tests will break. The reasons are:
My naive take would be to manually update the model to: SELECT
Column1,
Column2,
_PARTITIONDATE AS my_model_PARTITIONDATE,
_PARTITIONTIME AS my_model__PARTITIONTIME
FROM {{ref('upstream_model_A')}} This is your second alternative:
Which would work, but is not possible for sources (since there is no SQL file). Then what I would really want here is an sources:
- name: <source_name>
tables:
- name: <table_name>
columns:
- name: <column_name>
description: <markdown_string>
data_type: <string>
alias/identifier: <string>
- name: <another_column>
... That's straightforward at the source level, but what does that mean if the model is a table. Are we overriding the name in the SQL query there? The workaround here being to create a model on top of it, like a view, to hosts the tests. But that's more code and more objects, so not the best. |
@Fleid your understanding is correct! It makes a lot of sense to use your "naive" take for aliasing the column in the model itself so this error does not occur, and that is what we are doing in our dbt project currently; however, this issue pertains specifically to testing on dbt sources, not models. I don't think I explicitly stated that, so my bad! That being said, since we can't alias the column names in the sources, we'd need a way to do that. I like the idea of adding identifier to the yml to be used by the test, and I think it's the only solution we need in this case unless you all think it would be better to expand this to models as well. Thank you for your response! |
I like that idea, I'm not sure how feasible it is but I like it. After all, "sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools" (from here). Let's name and describe columns and tables. I'll create a new issue over to |
There : dbt-labs/dbt-core#6929 |
Is this your first time submitting a feature request?
Describe the feature
Background:
Google has a concept of metadata "pseudo columns" which are columns that can be queried in BigQuery, but do not materialize in the table itself.
Some examples of these columns are
_PARTITIONDATE, _PARTITIONTIME, _FILE_NAME, _TABLE_SUFFIX
...etc, which always start with an underscore._FILE_NAME
for example, will be the name of the file being read in from Google Cloud Storage when an external data source is configured.Issue:
Because these pseudo columns start with underscore, applying dbt tests to them is difficult.
For example, dbt_utils.unique_combination_of_columns and the default dbt uniqueness test select columns as is and create the dbt__test_audit table schema based on the selected columns.
Since columns with leading underscores are not allowed in BigQuery, applying uniqueness tests to these columns as is throws an error. The columns need to be aliased in order for test results to be stored in dbt__test_audit tables.
Describe alternatives you've considered
I've considered:
Writing a macro within my organizations dbt project to perform this aliasing, however because the aliasing would be used in tests defined in dbt-core and external packages, this doesn't feel like the best solution. In addition, being able to test on pseudo columns will likely benefit other dbt-bigquery users.
Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.
Who will this benefit?
All dbt-bigquery users that use external data sources and want uniqueness testing on their source layer.
Are you interested in contributing this feature?
Yes. I will be submitting a PR for a potential solution!
Anything else?
No response
The text was updated successfully, but these errors were encountered: