Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking file provenance #3712

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
30f54ca
initial version of dynamic file list classes
astro-friedel May 13, 2024
69d8f02
integrated dynamic file into output file handling
astro-friedel May 21, 2024
882e3ba
data flow kernel changes to accommodate dynamic file lists
astro-friedel Jun 7, 2024
ce369aa
Merge remote-tracking branch 'upstream/master' into fixing_dynamic_fi…
astro-friedel Jun 7, 2024
7138adc
Auto stash before checking out "HEAD"
astro-friedel Jun 7, 2024
5bff70f
creation of file tale in the monitoring
astro-friedel Jun 7, 2024
6025691
added initial file provenance data in database
astro-friedel Jun 14, 2024
efc3b14
fixed error where uuid's were not strings
astro-friedel Jun 17, 2024
222166a
fixed typos in names
astro-friedel Jun 17, 2024
92597f6
initial working version
astro-friedel Jun 18, 2024
8b922d9
Merge branch 'fixing_dynamic_file_inputs_and_outputs' into trackingFi…
astro-friedel Jun 27, 2024
632890b
added flask-wtf to monitoring requirements for form processing
astro-friedel Jun 27, 2024
17e5c43
added file size and md5sum tracking for files
astro-friedel Jun 27, 2024
d8df5fe
fixed issue with clean_copy in dynamic files
astro-friedel Jun 27, 2024
b16cad6
added initial provenance interface to flask pages
astro-friedel Jun 27, 2024
0275b28
indentation fix
astro-friedel Jul 1, 2024
3a1238b
fixed database code for provenance tracking
astro-friedel Jul 1, 2024
bb013fe
added environment tracking to monitoring
astro-friedel Jul 9, 2024
bc8247a
Merge remote-tracking branch 'upstream/master' into trackingFileProve…
astro-friedel Jul 31, 2024
45af5f9
added file provenance tracking as an option to monitoring framework
astro-friedel Jul 31, 2024
cd99828
better reporting on environment
astro-friedel Jul 31, 2024
558d170
ensure that files are tagged with the task id that generated them, no…
astro-friedel Jul 31, 2024
05caec8
get the task reporting the environment correctly
astro-friedel Jul 31, 2024
8f212ba
only provide file link if files were actually used in the workflow
astro-friedel Jul 31, 2024
3ade95a
only provide file link if there were files
astro-friedel Jul 31, 2024
7501cc3
properly report environment with file details
astro-friedel Jul 31, 2024
66238e5
properly format and report files
astro-friedel Jul 31, 2024
00ffa6f
make header responsive to url
astro-friedel Jul 31, 2024
da73f91
fix bug in file size reporting
astro-friedel Jul 31, 2024
76b8008
documentation on file provenance
astro-friedel Jul 31, 2024
93b17b0
fix bug in format
astro-friedel Jul 31, 2024
1e004a6
get the correct timestamp for the file
astro-friedel Sep 17, 2024
8dde82c
remove unneeded prints
astro-friedel Sep 17, 2024
cb550ee
auto determine file size, md5sum, timestamp if possible
astro-friedel Sep 17, 2024
5ebd009
refactor variable
astro-friedel Sep 17, 2024
baf2332
make sure dfk is propagated from dynamic file list to children
astro-friedel Sep 17, 2024
79211bc
documentation and annotation cleanup
astro-friedel Sep 17, 2024
825842f
cleanup
astro-friedel Sep 17, 2024
117e66d
Merge remote-tracking branch 'upstream/master' into trackingFileProve…
astro-friedel Sep 17, 2024
8c9a2a0
backed out DynamicFile stuff so that this branch is pure file tracking
astro-friedel Nov 12, 2024
eff8ab6
Merge branch 'master' into trackingFileProvenance
astro-friedel Nov 12, 2024
5ca48cf
Merge branch 'master' into trackingFileProvenance
astro-friedel Nov 27, 2024
9a05b2c
reorganized to group similar codes together
astro-friedel Nov 27, 2024
14aac2b
fixed message format
astro-friedel Nov 27, 2024
585fd03
fixed some typos
astro-friedel Nov 27, 2024
19f7747
updates to include misc info table
astro-friedel Nov 27, 2024
27f6391
updated docs
astro-friedel Nov 27, 2024
97ade30
fixed bug for remote files
astro-friedel Nov 27, 2024
33be080
test for provenance framework
astro-friedel Nov 27, 2024
07c2e45
flake8 fixes
astro-friedel Nov 27, 2024
97108e1
fixed missing line in docs
astro-friedel Nov 27, 2024
a837f08
removed extraneous ignores
astro-friedel Dec 3, 2024
6bef04f
reverted removal of trailing white spaces
astro-friedel Dec 3, 2024
5057d19
fixes per review comments
astro-friedel Dec 3, 2024
89d5e0a
ensure that md5sum is only calculated when file provenance tracking i…
astro-friedel Dec 3, 2024
c653cbc
fixes based on review comments
astro-friedel Dec 3, 2024
7efebad
added dfk as a required parameter to DataFuture
astro-friedel Dec 3, 2024
d6e7e5b
make sure file md5sum is only calculated
astro-friedel Dec 3, 2024
1fcdbc6
added full path and parsing for path for file database entries
astro-friedel Dec 3, 2024
b443cbb
fixed typos and tests
astro-friedel Dec 3, 2024
69cfc7b
put back required SECRET_KEY so that the file search form works
astro-friedel Dec 3, 2024
0316cf9
isort fixes
astro-friedel Dec 3, 2024
af51f0e
Merge branch 'Parsl:master' into trackingFileProvenance
astro-friedel Dec 3, 2024
9ed699d
removed unneeded import
astro-friedel Dec 3, 2024
ce609cc
mypy fixes
astro-friedel Dec 3, 2024
d646aaa
Merge remote-tracking branch 'upstream/master'
astro-friedel Dec 10, 2024
53f323d
fixed incorrect variable name
astro-friedel Dec 10, 2024
9444f42
Merge branch 'master' into trackingFileProvenance
astro-friedel Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
added dfk as a required parameter to DataFuture
  • Loading branch information
astro-friedel committed Dec 3, 2024
commit 7efebaddd4b85285cfadc3d52721c1e58af17bd7
8 changes: 5 additions & 3 deletions parsl/app/futures.py
Original file line number Diff line number Diff line change
@@ -5,12 +5,15 @@
import logging
from os import stat
from concurrent.futures import Future
from typing import Optional, Any
from typing import TYPE_CHECKING, Optional
from datetime import datetime, timezone
import typeguard

from parsl.data_provider.files import File

if TYPE_CHECKING:
from parsl.dataflow.dflow import DataFlowKernel

logger = logging.getLogger(__name__)


@@ -50,8 +53,7 @@ def parent_callback(self, parent_fu):
self.data_flow_kernel.register_as_output(self.file_obj, self.app_fut.task_record)

@typeguard.typechecked
def __init__(self, fut: Future, file_obj: File, tid: Optional[int] = None, app_fut: Optional[Future] = None,
dfk: Optional[Any] = None) -> None:
def __init__(self, fut: Future, file_obj: File, dfk: "DataFlowKernel", tid: Optional[int] = None, app_fut: Optional[Future] = None) -> None:
"""Construct the DataFuture object.

If the file_obj is a string convert to a File.
2 changes: 1 addition & 1 deletion parsl/data_provider/data_manager.py
Original file line number Diff line number Diff line change
@@ -63,7 +63,7 @@ def optionally_stage_in(self, input, func, executor):
# replace the input DataFuture with a new DataFuture which will complete at
# the same time as the original one, but will contain the newly
# copied file
input = DataFuture(input, file, tid=input.tid)
input = DataFuture(input, file, dfk=self.dfk, tid=input.tid)
elif isinstance(input, File):
file = input.cleancopy()
input = file
6 changes: 3 additions & 3 deletions parsl/dataflow/dflow.py
Original file line number Diff line number Diff line change
@@ -956,10 +956,10 @@ def stageout_one_file(file: File, rewritable_func: Callable):
stageout_fut = self.data_manager.stage_out(f_copy, executor, app_fut)
if stageout_fut:
logger.debug("Adding a dependency on stageout future for {}".format(repr(file)))
df = DataFuture(stageout_fut, file, tid=app_fut.tid, app_fut=app_fut, dfk=self)
df = DataFuture(stageout_fut, file, dfk=self, tid=app_fut.tid, app_fut=app_fut)
else:
logger.debug("No stageout dependency for {}".format(repr(file)))
df = DataFuture(app_fut, file, tid=app_fut.tid, app_fut=app_fut, dfk=self)
df = DataFuture(app_fut, file, dfk=self, tid=app_fut.tid, app_fut=app_fut)

# this is a hook for post-task stageout
# note that nothing depends on the output - which is maybe a bug
@@ -968,7 +968,7 @@ def stageout_one_file(file: File, rewritable_func: Callable):
return rewritable_func, f_copy, df
else:
logger.debug("Not performing output staging for: {}".format(repr(file)))
return rewritable_func, file, DataFuture(app_fut, file, tid=app_fut.tid, app_fut=app_fut, dfk=self)
return rewritable_func, file, DataFuture(app_fut, file, dfk=self, tid=app_fut.tid, app_fut=app_fut)

for idx, file in enumerate(outputs):
func, outputs[idx], o = stageout_one_file(file, func)