Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Falcon data provider #2955

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Conversation

KiwiFac3
Copy link
Contributor

Description

Integration Falcon as a Parsl data provider.

Changed Behaviour

None of the current behaviours should change

Type of change

Choose which options apply, and delete the ones which do not apply.

  • New feature

try:
subprocess.run(sender_command, check=True)
except subprocess.CalledProcessError as e:
logger.debug(f"Command failed with exit code {e.returncode}: {e.stderr}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should re-raise this exception so that it propagates into the rest of parsl? (so that a failed transfer then causes a task failure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This have been addressed now.

"host": self.host_ip,
"port": directory.query,
"directory_path": directory.path
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this block and the below sender block and how it is using the user supplied file URL, especially to do with hostnames and port numbers - mostly because I don't know much about Falcon.

I think this sends a command to remote host directory.netloc port 5555, telling it to deliver the directory directory.path that exists on that remote host; and then in a few lines time, it starts a receiver listening on local IP address host_ip port directory.query to receive the data and put it in working_dir.

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following code needs to be running on the source host https://github.com/KiwiFac3/Datamover2Parsl/blob/main/Sender/sender.py so the communication is between the part you highlighted and this code in order to send the correct files to the destination host. If it still unclear please let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so what I'm a bit confused about, specifically:

if you are specifying the local host port as part of the provider initializer, why doesn't the local port number get specified in the same way as part of the constructor? A URL should name where the data is remotely but a falcon: URL seems to name a local port too (in directory.query) which seems out of place.

There's a URL mechanism for specifying a port, as part of directory.netloc, parsed out as:

o.netloc
'docs.python.org:80'
o.hostname
'docs.python.org'
o.port

https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse

Naively I'd expect a port specified that way to override the hard coded 5555 port number of the remote end - (although right now it would break things, I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the source host doesn't have the information that you specified including the local host port that is part of the provider initializer. In order to make it know for the source host we communicate the needed information for the transfer through zmq using a predefined port (in my case 5555) which starts the transfer (which takes considerably more time than the communication) on the local host port that is part of the provider initializer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it could get the local host port from a variable like self.local_port specified in the initializer, right? same as the hostname? I don't understand why the local hostname and local port need to be treated differently in how they are supplied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you cant have the same port for the transfer for multiple hosts that's why we supply the port through query. However the zmq port have to be fixed and pre-determined in order to actually receive the information. I cannot convey the port number supplied through the provider initializer without that connection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I think that means it is up to the end user to manage the selection of local ports on their submitting machine (for example in their head or in their own workflow code) and to make sure the right ports get added onto the end of the right falcon: URLs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to describe that in the docstring (https://github.com/Parsl/parsl/pull/2955/files#r1392354365) so that users can understand how they are meant to use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be addressed in my latest commit


class FalconStaging(Staging, RepresentationMixin):
"""
In this function the File object is used to represent the directory being tranfered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make the docstring documentation for this class appear by adding the class name into docs/reference.rst. There is a data management / file staging section in there where you will find the other staging providers such as parsl.data_provider.globus.GlobusStaging

I think this docstring should then describe:

  • what this provider is from the perspective of a Parsl user who does not know what Falcon is: one sentence describing what it is for, and a URL to learn more (for example, a Falcon getting started guide or something else hands-on)

  • what falcon: URLs should look like and what they mean.

@benclifford
Copy link
Collaborator

I would be interested in a simple test in the CI that this code basically works - for example can you start a receiver and sender on the localhost and run a small transfer?

@KiwiFac3
Copy link
Contributor Author

I would be interested in a simple test in the CI that this code basically works - for example can you start a receiver and sender on the localhost and run a small transfer?

Tests have been conducted on both local hosts and on different hosts, I am happy to share the performance improvements.

@benclifford
Copy link
Collaborator

I would be interested in a simple test in the CI that this code basically works - for example can you start a receiver and sender on the localhost and run a small transfer?

Tests have been conducted on both local hosts and on different hosts, I am happy to share the performance improvements.

I'm interested in tests that tell me that this code works and continues to work for (eg.) the next year as people make changes to the Parsl codebase and as Falcon itself changes: so that a user coming to use this provider can have some basic expectation that it won't be broken.

We don't really do any automated performance tests (although I'd like that one day...)

@KiwiFac3
Copy link
Contributor Author

I would be interested in a simple test in the CI that this code basically works - for example can you start a receiver and sender on the localhost and run a small transfer?

Tests have been conducted on both local hosts and on different hosts, I am happy to share the performance improvements.

I'm interested in tests that tell me that this code works and continues to work for (eg.) the next year as people make changes to the Parsl codebase and as Falcon itself changes: so that a user coming to use this provider can have some basic expectation that it won't be broken.

We don't really do any automated performance tests (although I'd like that one day...)

https://github.com/KiwiFac3/Datamover2Parsl/blob/main/Test/Falcon.py this is the file I used for testing it works with no issues and I tried to keep my code as close to the other data-provider to insure no changes to Parsl codebase will affect my integration. I did present the results of the test in ParslFest I can re-share them if you would like.

@benclifford
Copy link
Collaborator

I still really would like the documentation to refer to somewhere where people can get the Falcon software, read more about it, etc - imagine yourself as a person browsing through the Parsl documentation.

@KiwiFac3
Copy link
Contributor Author

I still really would like the documentation to refer to somewhere where people can get the Falcon software, read more about it, etc - imagine yourself as a person browsing through the Parsl documentation.

I just added a more detailed documentation, however I would like your feedback if there is any aspect that I need to touch on

@benclifford benclifford changed the title Falcon integration Add Falcon data provider Nov 27, 2023
'directory_path' --method probe"
- sleep for 0.1 seconds (in order for the receiver instance to run before sender instance)
e.g.
zmq_context = zmq.Context()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import zmq first

- Setup your virtual environments on both source and destination server
and install required python packages.
- On the destination server, you can stage data files using FalconStaging.
The following example shows creation of a Flacon-accessible file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: Flacon

@@ -65,7 +66,8 @@ def __repr__(self) -> str:
f"scheme={self.scheme} " \
f"netloc={self.netloc} " \
f"path={self.path} " \
f"filename={self.filename}"
f"filename={self.filename}" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a following space to separate it from query:

<File at 0x7f442e7bf690 url=falcon://localhost/tmp/file.txt?8000 scheme=falcon netloc=localhost path=/tmp/file.txt filename=file.txtquery=8000>

print("Received JSON data:", json_data)

# Parse the JSON data
data = json.loads(json_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import json first

zmq_context.term()

sender_command = ["falcon", "receiver", "--host", self.host_ip, "--port", directory.query, "--data_dir",
working_dir]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gives a pretty ugly exception if the user hasn't specified working_dir on the relevant executor.

zmq_socket.close()
zmq_context.term()

sender_command = ["falcon", "receiver", "--host", self.host_ip, "--port", directory.query, "--data_dir",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a receiver command, not a sender command so you might want to rename this variable for people want to read this code

"""
To run Falcon data_provider
- Setup your virtual environments on both source and destination server
and install required python packages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"install required python packages" might look better as an explicit pip command to type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following requirements.txt for this, should I just like it as a reference? https://github.com/KiwiFac3/Datamover2Parsl/blob/main/requirments.txt

@benclifford
Copy link
Collaborator

@KiwiFac3 I added a few comments here - you could probably easily address them before merge.

Right now I'm trying to run this myself to see how it works (which is how I ended up with most of my recent comments) -- the dependency stack underneath is a bit awkward on top of my current environment, to do with scipy/numpy/pandas versions, but let me see how that goes.

@benclifford
Copy link
Collaborator

I get this:

$ falcon 
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
Traceback (most recent call last):
  File "/home/benc/parsl/virtualenv-3.11/bin/falcon", line 5, in <module>
    from falcon.__main__ import main
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/falcon/__main__.py", line 17, in <module>
    from falcon.search import Optimizer
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/falcon/search.py", line 1, in <module>
    from skopt.space import Integer
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/skopt/__init__.py", line 42, in <module>
    from . import acquisition
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/skopt/acquisition.py", line 4, in <module>
    from scipy.stats import norm
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/stats/__init__.py", line 608, in <module>
    from ._stats_py import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/stats/_stats_py.py", line 39, in <module>
    from scipy.spatial.distance import cdist
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/spatial/__init__.py", line 110, in <module>
    from ._kdtree import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/spatial/_kdtree.py", line 4, in <module>
    from ._ckdtree import cKDTree, cKDTreeNode
  File "_ckdtree.pyx", line 12, in init scipy.spatial._ckdtree
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/sparse/__init__.py", line 274, in <module>
    from ._csr import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/sparse/_csr.py", line 11, in <module>
    from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
ImportError: numpy.core.multiarray failed to import

which feels like some kind of package versioning problem between scipy and numpy and skopt -- can you send me the exact versions you have installed in your environment? @KiwiFac3

@KiwiFac3
Copy link
Contributor Author

I get this:

$ falcon 
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
Traceback (most recent call last):
  File "/home/benc/parsl/virtualenv-3.11/bin/falcon", line 5, in <module>
    from falcon.__main__ import main
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/falcon/__main__.py", line 17, in <module>
    from falcon.search import Optimizer
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/falcon/search.py", line 1, in <module>
    from skopt.space import Integer
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/skopt/__init__.py", line 42, in <module>
    from . import acquisition
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/skopt/acquisition.py", line 4, in <module>
    from scipy.stats import norm
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/stats/__init__.py", line 608, in <module>
    from ._stats_py import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/stats/_stats_py.py", line 39, in <module>
    from scipy.spatial.distance import cdist
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/spatial/__init__.py", line 110, in <module>
    from ._kdtree import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/spatial/_kdtree.py", line 4, in <module>
    from ._ckdtree import cKDTree, cKDTreeNode
  File "_ckdtree.pyx", line 12, in init scipy.spatial._ckdtree
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/sparse/__init__.py", line 274, in <module>
    from ._csr import *
  File "/home/benc/parsl/virtualenv-3.11/lib/python3.11/site-packages/scipy/sparse/_csr.py", line 11, in <module>
    from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
ImportError: numpy.core.multiarray failed to import

which feels like some kind of package versioning problem between scipy and numpy and skopt -- can you send me the exact versions you have installed in your environment? @KiwiFac3

I am using the following versions https://github.com/KiwiFac3/Datamover2Parsl/blob/main/requirments.txt

@KiwiFac3
Copy link
Contributor Author

@KiwiFac3 I added a few comments here - you could probably easily address them before merge.

Right now I'm trying to run this myself to see how it works (which is how I ended up with most of my recent comments) -- the dependency stack underneath is a bit awkward on top of my current environment, to do with scipy/numpy/pandas versions, but let me see how that goes.

I believe I addressed all the comments let me know if there is anything else I should address.

@benclifford
Copy link
Collaborator

I am using the following versions https://github.com/KiwiFac3/Datamover2Parsl/blob/main/requirments.txt

ok I'll try to work out what's happening in my environment compared to that and let you know how it goes.

@KiwiFac3
Copy link
Contributor Author

@benclifford I made a slight change to the code so it wouldn't have issues in certain corner cases, please let me know if any changes are required.

@khk-globus
Copy link
Collaborator

khk-globus commented Jan 26, 2024

I made a slight change to the code so it wouldn't have issues in certain corner cases, please let me know if any changes are required.

I'm sure the change is fine, but this PR still has some general work to go before merging. The most important obstacle in my mind is a missing set of automated tests in CI. That is, responding to:

Tests have been conducted on both local hosts and on different hosts, I am happy to share the performance improvements.

I'm happy to see the performance improvements, but that's not the point of CI. CI ("Continuous Integration") is about robustness, and robustness for the next dev who isn't aware of the intricacies of your code. CI — a set of individual test cases — runs for every push to a PR, and also periodically against the master branch. If When the tests fail, we can get insight into what broke and when.

Without a set of test cases that verify this PR's additions, we won't discover a misassumption by a future dev until possibly months later (perhaps only when a user complains). Thus, before this goes in, I want to see a few unit tests that verify different important aspects of Falcon in Parsl. For test examples, please see parsl/tests/ and look to the Makefile for how to run them (e.g., make test).

@KiwiFac3
Copy link
Contributor Author

I made a slight change to the code so it wouldn't have issues in certain corner cases, please let me know if any changes are required.

I'm sure the change is fine, but this PR still has some general work to go before merging. The most important obstacle in my mind is a missing set of automated tests in CI. That is, responding to:

Tests have been conducted on both local hosts and on different hosts, I am happy to share the performance improvements.

I'm happy to see the performance improvements, but that's not the point of CI. CI ("Continuous Integration") is about robustness, and robustness for the next dev who isn't aware of the intricacies of your code. CI — a set of individual test cases — runs for every push to a PR, and also periodically against the master branch. If When the tests fail, we can get insight into what broke and when.

Without a set of test cases that verify this PR's additions, we won't discover a misassumption by a future dev until possibly months later (perhaps only when a user complains). Thus, before this goes in, I want to see a few unit tests that verify different important aspects of Falcon in Parsl. For test examples, please see parsl/tests/ and look to the Makefile for how to run them (e.g., make test).

Should I create those CI tests, if so is there an example on any previous ones?

@khk-globus
Copy link
Collaborator

Should I create those CI tests,

Yes. That is typically under the purview of the PR author.

For test examples, please see parsl/tests/ and look to the Makefile for how to run them (e.g., make test).

if so is there an example on any previous ones?

See the aforementioned.

Given your question, let me point out that CI ("continuous deployment") is "merely" automated testing. GitHub has this builtin via the .github/workflows/ directory. For Parsl, the current setup is that tests are run against all pull requests:

# .github/workflows/ci.yaml
name: Parsl

on:
  pull_request:
    types:
      - opened
      - synchronize

[...rest of ci.yaml file...]

I'll not delve into those details here, but instead observe that CI creates a consistent environment and run tests in a consistent way every time they run (e.g., on every push to a pull request). In particular, one can run the tests locally as well, which is much more efficient while developing.

If the above is not enough of a pointer, please reach out for help. (Slack might be a more expedient avenue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants