Skip to content

Commit

Permalink
Add documentation for typical use cases of openpmd-pipe (#1578)
Browse files Browse the repository at this point in the history
* Add documentation for use cases of openpmd-pipe

* Update docs/source/analysis/pipe.rst

* Move this documentation to cli.rst

* Revert "Update docs/source/analysis/pipe.rst"

This reverts commit 993b225.

* Revert "Add documentation for use cases of openpmd-pipe"

This reverts commit e3e4336.

* Headers --> paragraphs

---------

Co-authored-by: Axel Huebl <[email protected]>
  • Loading branch information
franzpoeschel and ax3l authored Dec 22, 2023
1 parent e668e86 commit 7296948
Showing 1 changed file with 134 additions and 13 deletions.
147 changes: 134 additions & 13 deletions docs/source/utilities/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,145 @@ With some ``pip``-based python installations, you might have to run this as a mo

Redirect openPMD data from any source to any sink.

The script can be used in parallel via MPI.
Datasets will be split into chunks of equal size to be loaded and written by the single processes.
Any Python-enabled openPMD-api installation with enabled CLI tools comes with a command-line tool named ``openpmd-pipe``.
Naming and use are inspired from the `piping concept <https://en.wikipedia.org/wiki/Pipeline_(Unix)>`__ known from UNIX shells.

Possible uses include:
With some ``pip``-based python installations, you might have to run this as a module:

* Conversion of a dataset between two openPMD-based backends, such as ADIOS and HDF5.
* Decompression and compression of a dataset.
* Capture of a stream into a file.
* Template for simpler loosely-coupled post-processing scripts.
.. code-block:: bash
The syntax of the command line tool is printed via:
python3 -m openpmd_api.pipe --help
.. code-block:: bash
The fundamental idea is to redirect data from an openPMD data source to another openPMD data sink.
This concept becomes useful through the openPMD-api's ability to use different backends in different configurations; ``openpmd-pipe`` can hence be understood as a translation from one I/O configuration to another one.

openpmd-pipe --help

With some ``pip``-based python installations, you might have to run this as a module:
.. note::

.. code-block:: bash
``openpmd-pipe`` is (currently) optimized for streaming workflows in order to minimize the number of back-and-forth communications between writer and reader.
All data load operations are issued in a single ``flush()`` per iteration.
Data is loaded directly loaded into backend-provided buffers of the writer (if supported by the writer), where again only one ``flush()`` per iteration is used to put data to disk again.
This means that the peak memory usage will be roughly equivalent to the data size of each single iteration.

python3 -m openpmd_api.pipe --help
The reader Series is configured by the parameters ``--infile`` and ``--inconfig`` which are both forwarded to the ``filepath`` and ``options`` parameters of the ``Series`` constructor.
The writer Series is likewise controlled by ``--outfile`` and ``--outconfig``.

Use of MPI is controlled by the ``--mpi`` and ``--no-mpi`` switches.
If left unspecified, MPI will be used automatically if the MPI size is greater than 1.

.. note::

Required parameters are ``--infile`` and ``--outfile``. Otherwise also refer to the output of ``--openpmd-pipe --help``.

When using MPI, each dataset will be sliced into roughly equally-sized hyperslabs along the dimension with highest item count for load distribution across worker ranks.

If you are interested in further chunk distribution strategies (e.g. node-aware distribution, chunking-aware distribution) that are used/tested on development branches, feel free to contact us, e.g. on GitHub.

The remainder of this page discusses a select number of use cases and examples for the ``openpmd-pipe`` tool.


Conversion between backends
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Converting from ADIOS2 to HDF5:

.. code:: bash
$ openpmd-pipe --infile simData_%T.bp --outfile simData_%T.h5
Converting from the ADIOS2 BP3 engine to the (newer) ADIOS2 BP5 engine:

.. code:: bash
$ openpmd-pipe --infile simData_%T.bp --outfile simData_%T.bp5
# or e.g. via inline TOML specification (also possible: JSON)
$ openpmd-pipe --infile simData_%T.bp --outfile output_folder/simData_%T.bp \
--outconfig 'adios2.engine.type = "bp5"'
# the config can also be read from a file, e.g. --outconfig @cfg.toml
# or --outconfig @cfg.json
Converting between iteration encodings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Converting to group-based iteration encoding:

.. code:: bash
$ openpmd-pipe --infile simData_%T.h5 --outfile simData.h5
Converting to variable-based iteration encoding (not yet feature-complete):

.. code:: bash
# e.g. specified via inline JSON
$ openpmd-pipe --infile simData_%T.bp --outfile simData.bp \
--outconfig '{"iteration_encoding": "variable_based"}'
Capturing a stream
^^^^^^^^^^^^^^^^^^

Since the openPMD-api also supports streaming/staging I/O transports from ADIOS2, ``openpmd-pipe`` can be used to capture a stream in order to write it to disk.
In the ADIOS2 `SST engine <https://adios2.readthedocs.io/en/latest/engines/engines.html#sst-sustainable-staging-transport>`_, a stream can have any number of readers.
This makes it possible to intercept a stream in a data processing pipeline.

.. code:: bash
$ cat << EOF > streamParams.toml
[adios2.engine.parameters]
DataTransport = "fabric"
OpenTimeoutSecs = 600
EOF
$ openpmd-pipe --infile streamContactFile.sst --inconfig @streamParams.toml \
--outfile capturedStreamData_%06T.bp
# Just loading and discarding streaming data, e.g. for performance benchmarking:
$ openpmd-pipe --infile streamContactFile.sst --inconfig @streamParams.toml \
--outfile null.bp --outconfig 'adios2.engine.type = "nullcore"'
Defragmenting a file
^^^^^^^^^^^^^^^^^^^^
Due to the file layout of ADIOS2, especially mesh-refinement-enabled simulation codes can create file output that is very strongly fragmented.
Since only one ``load_chunk()`` and one ``store_chunk()`` call is issued per MPI rank, per dataset and per iteration, the file is implicitly defragmented by the backend when passed through ``openpmd-pipe``:
.. code:: bash
$ openpmd-pipe --infile strongly_fragmented_%T.bp --outfile defragmented_%T.bp
Post-hoc compression
^^^^^^^^^^^^^^^^^^^^
The openPMD-api can be directly used to compress data already when originally creating it.
When however intending to compress data that has been written without compression enabled, ``openpmd-pipe`` can help:
.. code:: bash
$ cat << EOF > compression_cfg.json
{
"adios2": {
"dataset": {
"operators": [
{
"type": "blosc",
"parameters": {
"clevel": 1,
"doshuffle": "BLOSC_BITSHUFFLE"
}
}
]
}
}
}
EOF
$ openpmd-pipe --infile not_compressed_%T.bp --outfile compressed_%T.bp \
--outconfig @compression_cfg.json
Starting point for custom transformation and analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``openpmd-pipe`` is a Python script that can serve as basis for custom extensions, e.g. for adding, modifying, transforming or reducing data. The typical use case would be as a building block in a domain-specific data processing pipeline.

0 comments on commit 7296948

Please sign in to comment.