Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow snowflake__get_relation_last_modified to references models or sources #1054

Closed
3 tasks done
sp-tkerlavage opened this issue May 19, 2024 · 6 comments
Closed
3 tasks done
Labels

Comments

@sp-tkerlavage
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Not sure if this is a bug or expected behavior, so suggesting this as a feature.

Not so important context: I have an edge case where I need to track the freshness of external tables. I think the modified_at in the snowflake information_schema for external tables is not reliable. It tells you when the external table itself was last modified by a refresh of DDL. This is often not the same as when the last file actually came into the external source. So I have a custom model that I use to track the freshness of the external table based on the actual files in the S3 bucket.

Currently if you override the snowflake__get_relation_last_modified macro with a custom implementation that relies on a {{source()}} you get the following warning:

02:36:09  Pulling freshness from warehouse metadata tables for 395 sources
02:36:10  Metadata freshness could not be computed in batch: Runtime Error
  'MacroManifest' object has no attribute 'resolve_source'

You can easily reproduce this by just tweaking the current implementation of the macro to use a source

sources:
  - name: information_schema
    database: "{{ target.database.lower() }}"
    schema: information_schema
    tables:
      - name: columns
      - name: tables
{% macro snowflake__get_relation_last_modified(information_schema, relations) -%}

  {%- call statement('last_modified', fetch_result=True) -%}
        select table_schema as schema,
               table_name as identifier,
               last_altered as last_modified,
               {{ current_timestamp() }} as snapshotted_at
        from {{ source('information_schema', 'tables') }}
        where (
          {%- for relation in relations -%}
            (upper(table_schema) = upper('{{ relation.schema }}') and
             upper(table_name) = upper('{{ relation.identifier }}')){%- if not loop.last %} or {% endif -%}
          {%- endfor -%}
        )
  {%- endcall -%}

  {{ return(load_result('last_modified')) }}

{% endmacro %}

You will get a similar error if you use a ref as well:

'MacroManifest' object has no attribute 'resolve_ref'

Describe alternatives you've considered

A simple workaround is to just hardcode the name of the table that I want to use as the base of my freshness test. This works fine.

I only discovered this issue when I checked if batch freshness would work and found that it did not because the snowflake__get_relation_last_modified macro I had was referencing a model using ref()

Who will this benefit?

Anyone that overrides the snowflake__get_relation_last_modified macro to drive a custom implementation of freshness tests.

Are you interested in contributing this feature?

No response

Anything else?

No response

@sp-tkerlavage
Copy link
Author

sp-tkerlavage commented May 19, 2024

Just to add context to this--in the case where we might want to ref() a model for freshness results, I dont really expect dbt to know it should run the model that I'm using to track freshness before the freshness tests. We just have a step in our process that executes the model first, and then our freshness tests are executed in a subsequent step.

@dataders
Copy link
Contributor

Not so important context: I have an edge case where I need to track the freshness of external tables. I think the modified_at in the snowflake information_schema for external tables is not reliable. It tells you when the external table itself was last modified by a refresh of DDL. This is often not the same as when the last file actually came into the external source. So I have a custom model that I use to track the freshness of the external table based on the actual files in the S3 bucket.

I'd say this is quite important context! The reason being that, today, we don't support freshness checks on External Tables. This might be something we support in the future, especially given that we're beginning the design work to make External Tables first-class citizens in dbt (dbt-labs/dbt-adapters#92).

If you have a way to implement freshness checks on External Tables, I'd love to see it! Perhaps it could be written up as an enhancement issue for the dbt-external-tables package.

I have to say, we did not anticipate users wanting to override the get_relation_last_modified macro. The error message you're seeing is a result of that. The TL;DR is that the source freshness command doesn't have the ability to resolve ref and source macros like a dbt run would. So you were right to open this up as an enhancement.

@dataders
Copy link
Contributor

dataders commented May 23, 2024

@sp-tkerlavage I'm also now seeing your comments on #785: #785 (comment) & #785 (comment)

Given that context, I think it's worth saying that your end-goal is that you'd like the following:

dbt-snowflake should support metadata freshness of Snowflake external tables. The implementation would involve parsing the external tables's stage and location fields to SELECTing the corresponding last_modified metadata from the DIRECTORY({ stage })

I see two ways that dbt could this scenario:

  1. [Feature] metadata-based freshness checks for external tables #1061 dbt-snowflake supports this out-of-the box
  2. [CT-3194] [Feature] add another way to calculate source freshness - via a SQL query dbt-core#8797: i.e. dbt supports the ability for users to define custom freshness via a provided SQL query

The issue that you've highlighted here is that you encountered an error while attempting to implement (1) yourself by overriding the snowflake__get_relation_last_modified(). I commend you @sp-tkerlavage for trying to hack out support for yourself and I'm sorry you encountered an error.

I think the real next step is that we have the discussion about (1) in #1061 (that I just created). Accordingly, I'm going to close this issue in favor of that one.

In that issue, we can spell out the exact use case and gauge the priority of supporting the feature request. Perhaps during this discovery work we discover an alternative workaround to get you what you need.

@sp-tkerlavage
Copy link
Author

sp-tkerlavage commented May 23, 2024

Edit looks like we commented right around the same time! Can most likely ignore the below as its essentially just restated.

Yes I was able to get freshness tests working for external tables in my project. I use Snowflake's DIRECTORY tables which is an optional object on a Snowflake External Stage. Looks like I outlined the details here:
dbt-labs/dbt-core#8797 (comment)

The query to determine freshness would likely be different for everyones implementation. In my case for simplicity I have a single main external stage that has the directory enabled. The bucket itself is structured like: source/table/yyyy/mm/dd/hh. So I'm parsing the path to determine what the freshness for each table is (not just the time of the last file across all tables). But it would require a different query depending on the layout of the bucket, and/or if you utilize multiple external stages/directories. Can give more details about my implementation if needed but I think thats the basic gist of it.

@dataders
Copy link
Contributor

@sp-tkerlavage thanks for all the context!

I'm going to close this issue in support of #1061, which is the actual desired end state. But everything you've shared here will be helpful if we decide to implement.

@dataders dataders closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
@sp-tkerlavage
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants