Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental update #798

Open
dmpetrov opened this issue Jan 7, 2025 · 0 comments
Open

Incremental update #798

dmpetrov opened this issue Jan 7, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Jan 7, 2025

Description

User can do incremental update manually like:

def my_embedding(file: File) -> list[float]:
    return [...]

dc = DataChain.from_storage("s3://bkt/dir1/*.jpg")
# Create 1st version
dc = dc.map(emd=my_embedding).save("image_emb")

...
# update

new = DataChain.from_storage("s3://bkt/dir1/*.jpg")
old = DataChain.from_dataset("image_emb")
diff = new.diff(old).map(emd=my_embedding)

# Create 2nd version
res = old.union(diff).save("image_emb")

It would be great if this can be supported out of the box. Users could then update datasets directly from the UI.

def my_embedding(file: File) -> list[float]:
    return [...]

# Create 1st version
dc = DataChain.incremental_dataset("s3://bkt/dir1/*.jpg", my_embedding, "image_emb")
...
# Update to 2st version
dc = dc.update()

Challenges:

  • We need to preserve the environment since we use a custom python code my_embedding(). So, Inline project meta #776 might be a prerequisite.
  • Not every dataset is updatable. Do we need a special type (hopefully not)?
  • It might be better to generalize this story to updatable-pipelines and solve this, more general case.
@dmpetrov dmpetrov added the enhancement New feature or request label Jan 7, 2025
@ilongin ilongin self-assigned this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants