Skip to content

Commit

Permalink
Helper documentation for hipscat conversion (#424)
Browse files Browse the repository at this point in the history
  • Loading branch information
delucchi-cmu authored Nov 1, 2024
1 parent 5f5f3cb commit 57aaa9d
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 0 deletions.
137 changes: 137 additions & 0 deletions docs/guide/hipscat_conversion.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
Converting from HiPSCat
===============================================================================

This page discusses topics around setting up a pipeline to generate a new valid
HATS catalog from an existing HiPSCat catalog. There are several breaking
incompatibilities between these two versions of the format, and users should
migrate catalogs to the new format before using HATS or LSDB later than v0.4.

At a minimum, you need arguments that include where to find the original files,
and where to put the output files. A minimal arguments block will look something like:

.. code-block:: python
import hats_import.pipeline as runner
from hats_import.hipscat_conversion.arguments import ConversionArguments
args = ConversionArguments(
input_catalog_path="./hipscat_catalogs/my_catalog",
output_path="./hats_catalogs/",
output_artifact_name="my_catalog",
)
runner.pipeline(args)
More details on each of these parameters is provided in sections below.

For the curious, see the API documentation for
:py:class:`hats_import.hipscat_conversion.arguments.ConversionArguments`,
and its superclass :py:class:`hats_import.runtime_arguments.RuntimeArguments`.

Dask setup
-------------------------------------------------------------------------------

We will either use a user-provided dask ``Client``, or create a new one with
arguments:

``dask_tmp`` - ``str`` - directory for dask worker space. this should be local to
the execution of the pipeline, for speed of reads and writes. For much more
information, see :doc:`/catalogs/temp_files`

``dask_n_workers`` - ``int`` - number of workers for the dask client. Defaults to 1.

``dask_threads_per_worker`` - ``int`` - number of threads per dask worker. Defaults to 1.

If you find that you need additional parameters for your dask client (e.g are creating
a SLURM worker pool), you can instead create your own dask client and pass along
to the pipeline, ignoring the above arguments. This would look like:

.. code-block:: python
from dask.distributed import Client
from hats_import.pipeline import pipeline_with_client
args = ConversionArguments(...)
with Client('scheduler:port') as client:
pipeline_with_client(args, client)
If you're running within a ``.py`` file, we recommend you use a ``main`` guard to
potentially avoid some python threading issues with dask:

.. code-block:: python
from hats_import.pipeline import pipeline
def conversion_pipeline():
args = ConversionArguments(...)
pipeline(args)
if __name__ == '__main__':
conversion_pipeline()
Input Catalog
-------------------------------------------------------------------------------

For this pipeline, you will need to have already transformed your catalog into
HiPSCat parquet format. Provide the path to the catalog data with the argument
``input_catalog_path``.

Progress Reporting
-------------------------------------------------------------------------------

By default, we will display some progress bars during pipeline execution. To
disable these (e.g. when you expect no output to standard out), you can set
``progress_bar=False``.

There are several stages to the pipeline execution, and you can expect progress
reporting to look like the following:

.. code-block::
:class: no-copybutton
Mapping : 100%|██████████| 2352/2352 [9:25:00<00:00, 14.41s/it]
Reducing : 100%|██████████| 2385/2385 [00:43<00:00, 54.47it/s]
Finishing: 100%|██████████| 4/4 [00:03<00:00, 1.15it/s]
For very long-running pipelines (e.g. multi-TB inputs), you can get an
email notification when the pipeline completes using the
``completion_email_address`` argument. This will send a brief email,
for either pipeline success or failure.

Output
-------------------------------------------------------------------------------

Where?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You must specify a name for the new HATS table, using ``output_artifact_name``.
It's totally fine to simply use the name of the original input HiPSCat catalog,
so long as they will be written to different directories.

You must specify where you want your HATS table to be written, using
``output_path``. This path should be the base directory for your catalogs, as
the full path for the HATS table will take the form of ``output_path/output_artifact_name``.

If there is already catalog data in the indicated directory, you can
force new data to be written in the directory with the ``overwrite`` flag. It's
preferable to delete any existing contents, however, as this may cause
unexpected side effects.

If you're writing to cloud storage, or otherwise have some filesystem credential
dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.

In addition, you can specify directories to use for various intermediate files:

- dask worker space (``dask_tmp``)
- sharded parquet files (``tmp_dir``)

Most users are going to be ok with simply setting the ``tmp_dir`` for all intermediate
file use. For more information on these parameters, when you would use each,
and demonstrations of temporary file use see :doc:`/catalogs/temp_files`

What next?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can validate that your new HATS catalog meets both the HATS/LSDB expectations,
as well as your own expectations of the data contents. You can follow along with the
`Manual catalog verification <https://docs.lsdb.io/en/stable/tutorials/manual_verification.html>`_.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ container class that you will need to instantiate and populate with relevant arg
See dataset-specific notes on arguments:

* :doc:`catalogs/arguments` (most common)
* :doc:`guide/hipscat_conversion`
* :doc:`guide/margin_cache`
* :doc:`guide/index_table`

Expand Down Expand Up @@ -79,6 +80,7 @@ threading issues with dask:
:caption: Catalogs

catalogs/arguments
guide/hipscat_conversion
catalogs/temp_files
catalogs/public/index

Expand Down

0 comments on commit 57aaa9d

Please sign in to comment.