Merge pull request #1 from ssl-hep/async_implementation

First Implementation
ssl-hep · Jun 12, 2021 · e5021a4 · e5021a4
2 parents aa5e8d2 + ee5bee1
commit e5021a4
Show file tree

Hide file tree

Showing 18 changed files with 957 additions and 8 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,5 @@
+[flake8]
+max-line-length = 99
+exclude =
+    __pycache__
+
diff --git a/.github/ISSUE_TEMPLATE/bug.md b/.github/ISSUE_TEMPLATE/bug.md
@@ -0,0 +1,16 @@
+---
+name: Bug Report
+about: Report a bug
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+## Description
+
+- What you see, what you expect to see
+
+## Reproduction
+
+- How to reproduce the result.
diff --git a/.github/ISSUE_TEMPLATE/user_story.md b/.github/ISSUE_TEMPLATE/user_story.md
@@ -0,0 +1,20 @@
+---
+name: User Story
+about: A feature from an end-user perspective
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+## Story
+
+_As a <blah> I want to <blah> so I can <blah>_
+
+## Acceptance Criteria
+
+1. How will we know when this story is complete
+
+## Assumptions
+
+1. List assumptions behind this story
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -0,0 +1,42 @@
+name: CI/CD
+
+on:
+  push:
+    branches:
+      - "*"
+    tags:
+      - "*"
+  pull_request:
+
+jobs:
+  test:
+    strategy:
+      matrix:
+        python-version: [3.6, 3.7, 3.8, 3.9]
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@master
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install poetry
+      uses: Gr1N/setup-poetry@v4
+    - name: Install dependencies
+      run: |
+        poetry install
+        pip list
+    - name: Lint with Flake8
+      run: |
+        poetry run flake8
+    - name: Test with pytest
+      run: |
+        poetry run coverage run -m --source=src pytest tests
+        poetry run coverage xml
+    - name: Report coverage using codecov
+      if: github.event_name == 'push' && matrix.python-version == 3.8
+      uses: codecov/codecov-action@v1
+      with:
+        file: ./coverage.xml # optional
+        flags: unittests # optional
diff --git a/.github/workflows/pypi.yaml b/.github/workflows/pypi.yaml
@@ -0,0 +1,28 @@
+name: Push to PyPI
+
+on:
+  release:
+    types: [released, prereleased]
+
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Get the version
+      id: get_version
+      run: echo ::set-output name=VERSION::${GITHUB_REF#refs/tags/}
+    - name: Install poetry
+      uses: Gr1N/setup-poetry@v4
+    - name: Build
+      run: |
+        poetry version ${{ steps.get_version.outputs.VERSION }}
+    - name: Build and publish to pypi
+      uses: JRubics/[email protected]
+      with:
+        pypi_token: ${{ secrets.pypi_password_sx_did_lib }}
diff --git a/README.md b/README.md
@@ -1,2 +1,130 @@
 # ServiceX_DID
+
  ServiceX DID Library
+
+## Introduction
+
+ServiceX DID finders take a dataset name and turn them into files to be transformed. They interact with ServiceX via the Rabbit MQ message broker. As such, the integration into ServiceX and the API must be the same for all DID finders. This library abstracts away some of that interaction so that the code that interacts with ServiceX can be separated from the code that translates a dataset identifier into a list of files.
+
+The DID Finder author need write only a line of initialization code and then a small routine that translates a DID into a list of files.
+
+## Usage
+
+A very simple [demo](https://github.com/ssl-hep/ServiceX_DID_Finder_Demo) has been created to show how to make a basic DID finder for ServiceX. You can use that as a starting point if authoring one.
+
+Create an async callback method that `yield`s file info dictionaries. For example:
+
+```python
+    async def my_callback(did_name: str, info: Dict[str, Any]):
+        for i in range(0, 10):
+            yield {
+                'file_path': f"root://atlas-experiment.cern.ch/dataset1/file{i}.root",
+                'adler32': b183712731,
+                'file_size': 0,
+                'file_events': 0,
+            }
+```
+
+The arguments to the method are straight forward:
+
+* `did_name`: the name of the DID that you should look up. It has the schema stripped off (e.g. if the user sent ServiceX `rucio://dataset_name_in_rucio`, then `did_name` will be `dataset_name_in_rucio`)
+* `info` contains a dict of various info about the request that asked for this DID:
+  * `request-id` The request id that has this DID associated. For logging.
+
+Yield the results as you find them - ServiceX will actually start processing the files before your DID lookup is finished if you do this. The fields you need to pass back to the library are as follows:
+
+* `file_path`: A URI that a transformer in ServiceX can access to get at the file. Often these are either `root://` or `http://` schema URI's.
+* `adler32`: A CRC number for the file. This CRC is calculated in a special way by rucio and is not used. Leave as 0 if you do not know it.
+* `file_size`: Number of bytes of the file. Used to calculate statistics. Leave as zero if you do not know it (or it is expensive to look up).
+* `file_events`: Number of events in the file. Used to calculate statistics. Leave as zero if you do not know it (or it is expensive to look up).
+
+Once the callback is working privately, it is time to build a container. ServiceX will start the container and pass, one way or the other, a few arguments to it. Several arguments are required for the DID finder to work (like `rabbitUrl`). The library will automatically parse these during initialization.
+
+To initialize the library and start listening for and processing messages from RabbitMQ, you must initialize the library. In all cases, the call to `start_did_finder` will not return.
+
+If you do not have to process any command line arguments to configure the service, then the following is good enough:
+
+```python
+start_did_finder('my_finder', my_callback)
+```
+
+The first argument is the name of the schema. The user will use `my_finder://dataset_name` to access this DID lookup (and `my_callback` is called with `dataset_name` as `did_name`). Please make sure everything is lower case: schema in URI's are not case sensitive.
+
+The second argument is the call back.
+
+If you do need to configure your DID finder from command line arguments, then use the python `argparse` library, being sure to hook in the did finder library as follows (this code is used to start the rucio finder):
+
+```python
+    # Parse the command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--site', dest='site', action='store',
+                        default=None,
+                        help='XCache Site)')
+    parser.add_argument('--prefix', dest='prefix', action='store',
+                        default='',
+                        help='Prefix to add to Xrootd URLs')
+    parser.add_argument('--threads', dest='threads', action='store',
+                        default=10, type=int, help="Number of threads to spawn")
+    add_did_finder_cnd_arguments(parser)
+
+    args = parser.parse_args()
+
+    site = args.site
+    prefix = args.prefix
+    threads = args.threads
+    logger.info("ServiceX DID Finder starting up: "
+                f"Threads: {threads} Site: {site} Prefix: {prefix}")
+
+    # Initialize the finder
+    did_client = DIDClient()
+    replica_client = ReplicaClient()
+    rucio_adapter = RucioAdapter(did_client, replica_client)
+
+    # Run the DID Finder
+    try:
+        logger.info('Starting rucio DID finder')
+
+        async def callback(did_name, info):
+            async for f in find_files(rucio_adapter, site, prefix, threads,
+                                      did_name, info):
+                yield f
+
+        start_did_finder('rucio',
+                         callback,
+                         parsed_args=args)
+
+    finally:
+        logger.info('Done running rucio DID finder')
+```
+
+In particular note:
+
+1. The call to `add_did_finder_cnd_arguments` to setup the arguments required by the finder library.
+2. Parsing of the arguments using the usual `parse_args` method
+3. Passing the parsed arguments to `start_did_finder`.
+
+Another pattern in the above code that one might find useful - a thread-safe way of passing global arguments into the callback. Given Python's Global Interpreter Lock, this is probably not necessary.
+
+### Proper Logging
+
+In the end, all DID finders for ServiceX will run under Kubernetes. ServiceX comes with a built in logging mechanism. If anything is to be logged it should use the log system using the python standard `logging` module, with some extra information. For example, here is how to log a message from your callback function:
+
+```python
+    import logger
+    __log = logger.getLogger(__name__)
+    async def my_callback(did_name: str, info: Dict[str, Any]):
+        __log.info(f'Looking up dataset {did_name}.',
+                     extra={'requestId': info['request-id']})
+
+        for i in range(0, 10):
+            yield {
+                'file_path': f"root://atlas-experiment.cern.ch/dataset1/file{i}.root",
+                'adler32': b183712731,
+                'file_size': 0,
+                'file_events': 0,
+            }
+```
+
+Note the parameter `request-id`: this marks the log messages with the request id that triggered this DID request. This will enable the system to track all log messages across all containers connected with this particular request id - making debugging a lot easier.
+
+The `start_did_finder` will configure the python root logger properly to dump messages with a request ID in them.
diff --git a/README.rst b/README.rst
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,15 +1,20 @@
 [tool.poetry]
-name = "ServiceX_DID"
+name = "ServiceX_DID_Finder_lib"
 version = "1.0.0a1"
 description = "ServiceX DID Library Routines"
 authors = ["Gordon Watts <[email protected]>"]
 
 [tool.poetry.dependencies]
-python = "^3.8"
+python = "^3.6"
+pika = "1.1.0"
+make-it-sync = "^1.0.0"
+requests = "^2.25.0"
 
 [tool.poetry.dev-dependencies]
 pytest = "^5.2"
 flake8 = "^3.9.1"
+pytest-mock = "^3.5.1"
+coverage = "^5.5"
 
 [build-system]
 requires = ["poetry-core>=1.0.0"]

diff --git a/src/servicex_did/__init__.py b/src/servicex_did/__init__.py
diff --git a/src/servicex_did_finder_lib/__init__.py b/src/servicex_did_finder_lib/__init__.py
@@ -0,0 +1,4 @@
+__version__ = '1.0.0a1'
+
+from .communication import start_did_finder, \
+    add_did_finder_cnd_arguments  # NOQA