Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

Closed
3 tasks done
ashleemtib opened this issue Oct 26, 2022 · 5 comments
Closed
3 tasks done
Labels
help_wanted Extra attention is needed triage:awaiting-response type:enhancement New feature or request

Comments

@ashleemtib
Copy link

ashleemtib commented Oct 26, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Background:
Google has a concept of metadata "pseudo columns" which are columns that can be queried in BigQuery, but do not materialize in the table itself.

Some examples of these columns are _PARTITIONDATE, _PARTITIONTIME, _FILE_NAME, _TABLE_SUFFIX...etc, which always start with an underscore. _FILE_NAME for example, will be the name of the file being read in from Google Cloud Storage when an external data source is configured.

Issue:
Because these pseudo columns start with underscore, applying dbt tests to them is difficult.

For example, dbt_utils.unique_combination_of_columns and the default dbt uniqueness test select columns as is and create the dbt__test_audit table schema based on the selected columns.

Since columns with leading underscores are not allowed in BigQuery, applying uniqueness tests to these columns as is throws an error. The columns need to be aliased in order for test results to be stored in dbt__test_audit tables.

Describe alternatives you've considered

I've considered:

  • Writing a macro within my organizations dbt project to perform this aliasing, however because the aliasing would be used in tests defined in dbt-core and external packages, this doesn't feel like the best solution. In addition, being able to test on pseudo columns will likely benefit other dbt-bigquery users.

  • Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Who will this benefit?

All dbt-bigquery users that use external data sources and want uniqueness testing on their source layer.

Are you interested in contributing this feature?

Yes. I will be submitting a PR for a potential solution!

Anything else?

No response

@github-actions github-actions bot changed the title [Feature] Add adapter-specific aliasing for Google pseudo columns [CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns Oct 26, 2022
@jtcohen6
Copy link
Contributor

@ashleemtib Thanks for identifying the problem, offering a clear write-up, and even opening a PR with some proposed code to get to a solution!

I have a few questions about the user experience we should be aiming for. I'm also not sure that we'll be able to handle this pseudo-column aliasing behind-the-scenes in some cases. As a general principle, I'd rather opt for more-explicit, less-ergonomic code, over an abstraction that risks leaking.

As far as the second alternative you've proposed:

Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Does this work if the base model is materialized as a view (or ephemeral)? I get that it adds an extra step into the modeling layer, but it at least avoids the need to store the data twice.

@jtcohen6 jtcohen6 added help_wanted Extra attention is needed and removed triage:product labels Oct 31, 2022
@Fleid
Copy link
Contributor

Fleid commented Jan 28, 2023

Sorry I'm late to the party here ;)

I need to make sure I understand how these columns are used, can you please confirm I got this right?

Let's say I want to materialize this model:

SELECT
  Column1,
  Column2,
  _PARTITIONDATE,
  _PARTITIONTIME
FROM {{ref('upstream_model_A')}}

The thing is if I define tests on these columns, some tests will break. The reasons are:

  • these tests will generate audit tables, re-using the column names of the model (Column1, Column2, _PARTITIONDATE and _PARTITIONTIME)
  • _PARTITIONDATE and _PARTITIONTIME are reserved names that can't be used as user defined column names

My naive take would be to manually update the model to:

SELECT
  Column1,
  Column2,
  _PARTITIONDATE AS my_model_PARTITIONDATE,
  _PARTITIONTIME AS my_model__PARTITIONTIME
FROM {{ref('upstream_model_A')}}

This is your second alternative:

Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Which would work, but is not possible for sources (since there is no SQL file). Then what I would really want here is an identifier property at the column level, to alias the column at the YAML level. @jtcohen6 is that something that was ever considered?

sources:
  - name: <source_name>
    tables:
    - name: <table_name>
      columns:
        - name: <column_name>
          description: <markdown_string>
          data_type: <string>
          alias/identifier: <string>
        - name: <another_column>
          ...

That's straightforward at the source level, but what does that mean if the model is a table. Are we overriding the name in the SQL query there?

The workaround here being to create a model on top of it, like a view, to hosts the tests. But that's more code and more objects, so not the best.

@ashleemtib
Copy link
Author

@Fleid your understanding is correct! It makes a lot of sense to use your "naive" take for aliasing the column in the model itself so this error does not occur, and that is what we are doing in our dbt project currently; however, this issue pertains specifically to testing on dbt sources, not models. I don't think I explicitly stated that, so my bad!

That being said, since we can't alias the column names in the sources, we'd need a way to do that. I like the idea of adding identifier to the yml to be used by the test, and I think it's the only solution we need in this case unless you all think it would be better to expand this to models as well. Thank you for your response!

@Fleid
Copy link
Contributor

Fleid commented Feb 9, 2023

I like that idea, I'm not sure how feasible it is but I like it.

After all, "sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools" (from here). Let's name and describe columns and tables.

I'll create a new issue over to dbt-core, so @jtcohen6 can't ignore it, and close this one :)

@Fleid
Copy link
Contributor

Fleid commented Feb 9, 2023

There : dbt-labs/dbt-core#6929

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Extra attention is needed triage:awaiting-response type:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants