[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

ashleemtib · 2022-10-26T20:31:33Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Background:
Google has a concept of metadata "pseudo columns" which are columns that can be queried in BigQuery, but do not materialize in the table itself.

Some examples of these columns are _PARTITIONDATE, _PARTITIONTIME, _FILE_NAME, _TABLE_SUFFIX...etc, which always start with an underscore. _FILE_NAME for example, will be the name of the file being read in from Google Cloud Storage when an external data source is configured.

Issue:
Because these pseudo columns start with underscore, applying dbt tests to them is difficult.

For example, dbt_utils.unique_combination_of_columns and the default dbt uniqueness test select columns as is and create the dbt__test_audit table schema based on the selected columns.

Since columns with leading underscores are not allowed in BigQuery, applying uniqueness tests to these columns as is throws an error. The columns need to be aliased in order for test results to be stored in dbt__test_audit tables.

Describe alternatives you've considered

I've considered:

Writing a macro within my organizations dbt project to perform this aliasing, however because the aliasing would be used in tests defined in dbt-core and external packages, this doesn't feel like the best solution. In addition, being able to test on pseudo columns will likely benefit other dbt-bigquery users.
Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Who will this benefit?

All dbt-bigquery users that use external data sources and want uniqueness testing on their source layer.

Are you interested in contributing this feature?

Yes. I will be submitting a PR for a potential solution!

Anything else?

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-10-31T03:09:46Z

@ashleemtib Thanks for identifying the problem, offering a clear write-up, and even opening a PR with some proposed code to get to a solution!

I have a few questions about the user experience we should be aiming for. I'm also not sure that we'll be able to handle this pseudo-column aliasing behind-the-scenes in some cases. As a general principle, I'd rather opt for more-explicit, less-ergonomic code, over an abstraction that risks leaking.

As far as the second alternative you've proposed:

Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Does this work if the base model is materialized as a view (or ephemeral)? I get that it adds an extra step into the modeling layer, but it at least avoids the need to store the data twice.

Fleid · 2023-01-28T00:35:43Z

Sorry I'm late to the party here ;)

I need to make sure I understand how these columns are used, can you please confirm I got this right?

Let's say I want to materialize this model:

SELECT
  Column1,
  Column2,
  _PARTITIONDATE,
  _PARTITIONTIME
FROM {{ref('upstream_model_A')}}

The thing is if I define tests on these columns, some tests will break. The reasons are:

these tests will generate audit tables, re-using the column names of the model (Column1, Column2, _PARTITIONDATE and _PARTITIONTIME)
_PARTITIONDATE and _PARTITIONTIME are reserved names that can't be used as user defined column names

My naive take would be to manually update the model to:

SELECT
  Column1,
  Column2,
  _PARTITIONDATE AS my_model_PARTITIONDATE,
  _PARTITIONTIME AS my_model__PARTITIONTIME
FROM {{ref('upstream_model_A')}}

This is your second alternative:

Aliasing pseudo columns in base models and materializing them there. Tests can then be run on the base model instead of on the source layer. The issue with that solution is we'd essentially be replicating data simply to add a uniqueness test.

Which would work, but is not possible for sources (since there is no SQL file). Then what I would really want here is an identifier property at the column level, to alias the column at the YAML level. @jtcohen6 is that something that was ever considered?

sources:
  - name: <source_name>
    tables:
    - name: <table_name>
      columns:
        - name: <column_name>
          description: <markdown_string>
          data_type: <string>
          alias/identifier: <string>
        - name: <another_column>
          ...

That's straightforward at the source level, but what does that mean if the model is a table. Are we overriding the name in the SQL query there?

The workaround here being to create a model on top of it, like a view, to hosts the tests. But that's more code and more objects, so not the best.

ashleemtib · 2023-02-07T14:06:04Z

@Fleid your understanding is correct! It makes a lot of sense to use your "naive" take for aliasing the column in the model itself so this error does not occur, and that is what we are doing in our dbt project currently; however, this issue pertains specifically to testing on dbt sources, not models. I don't think I explicitly stated that, so my bad!

That being said, since we can't alias the column names in the sources, we'd need a way to do that. I like the idea of adding identifier to the yml to be used by the test, and I think it's the only solution we need in this case unless you all think it would be better to expand this to models as well. Thank you for your response!

Fleid · 2023-02-09T21:37:51Z

I like that idea, I'm not sure how feasible it is but I like it.

After all, "sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools" (from here). Let's name and describe columns and tables.

I'll create a new issue over to dbt-core, so @jtcohen6 can't ignore it, and close this one :)

Fleid · 2023-02-09T21:56:00Z

There : dbt-labs/dbt-core#6929

ashleemtib added type:enhancement New feature or request triage:product labels Oct 26, 2022

github-actions bot changed the title ~~[Feature] Add adapter-specific aliasing for Google pseudo columns~~ [CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns Oct 26, 2022

ashleemtib mentioned this issue Oct 26, 2022

Add aliasing method for pseudo columns #366

Closed

6 tasks

jtcohen6 added help_wanted Extra attention is needed and removed triage:product labels Oct 31, 2022

Fleid added the triage:awaiting-response label Jan 30, 2023

Fleid closed this as completed Feb 9, 2023

Fleid mentioned this issue Feb 9, 2023

[CT-2075] [Feature] Add column aliasing (identifiers) for source tables dbt-labs/dbt-core#6929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

ashleemtib commented Oct 26, 2022 •

edited

Loading

jtcohen6 commented Oct 31, 2022

Fleid commented Jan 28, 2023

ashleemtib commented Feb 7, 2023

Fleid commented Feb 9, 2023

Fleid commented Feb 9, 2023

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

[CT-1415] [Feature] Add adapter-specific aliasing for Google pseudo columns #365

Comments

ashleemtib commented Oct 26, 2022 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Oct 31, 2022

Fleid commented Jan 28, 2023

ashleemtib commented Feb 7, 2023

Fleid commented Feb 9, 2023

Fleid commented Feb 9, 2023

ashleemtib commented Oct 26, 2022 •

edited

Loading