Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert dataset dir and use general ra/dec column names. #377

Merged
merged 2 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions docs/guide/directory_scheme.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,16 @@ structure:
:class: no-copybutton

__ /path/to/catalogs/<catalog_name>/
|__ _common_metadata
|__ _metadata
|__ partition_info.csv
|__ properties
|__ Norder=1/
| |__ Dir=0/
| |__ Npix=0.parquet
| |__ Npix=1.parquet
|__ Norder=J/
|__ Dir=10000/
|__ Npix=K.parquet
|__ Npix=M.parquet
|__ dataset/
|__ _common_metadata
|__ _metadata
|__ Norder=1/
| |__ Dir=0/
| |__ Npix=0.parquet
| |__ Npix=1.parquet
|__ Norder=J/
|__ Dir=10000/
|__ Npix=K.parquet
|__ Npix=M.parquet
7 changes: 0 additions & 7 deletions docs/guide/pixel_math.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,10 +192,3 @@ Given a set of ra and dec coordinates as well as the healpixel, healpix order, m
We take a chunk of our points and generate a set matrix of points with a shape (size_of_chunk, number_of_boundary_points) with all of our data points repeated along the first axis. Then, we generate a matrix of the same shape with the boundary points repeated as well and then put both of those matrices into a `SkyCoord` so that we can calculate the separations between all of them. We also find all of the distances between adjacent boundary points, so that we can find know the distance of each boundary segment. For each data point, we find the boundary point with the minimum separation, and then find which of the two adjacent boundary points has the smaller separation from the datapoint. From there, we can do the above calculations to get the perpendicular bisector and compare this against the `margin_threshold`. For checking around the corners of the healpixel, where the data point may not actually form a neat triangle with it's closest boundary points, we check to see if the minimum distance is less than the margin threshold and return True if it is. This way, we'll keep any data points that fall within the margin threshold but would return NaN with our perpendicular bisector calculations.

To speed up our calculations, the inner loops of calculations is compiled with the `numba` JIT compiler.

## Spatial Index

This index is defined as a 64-bit integer with the healpix pixel of the row (at order 29)

This provides us with an increasing index, that will not overlap
between spatially partitioned data files.
5 changes: 3 additions & 2 deletions src/hats/catalog/dataset/table_properties.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
"hats_builder",
"hats_cols_sort",
"hats_cols_survey_id",
"hats_coordinate_epoch",
"hats_copyright",
"hats_creation_date",
"hats_creator",
Expand Down Expand Up @@ -85,8 +86,8 @@ class TableProperties(BaseModel):
catalog_type: CatalogType = Field(alias="dataproduct_type")
total_rows: int = Field(alias="hats_nrows")

ra_column: Optional[str] = Field(default=None, alias="hats_col_j2000_ra")
dec_column: Optional[str] = Field(default=None, alias="hats_col_j2000_dec")
ra_column: Optional[str] = Field(default=None, alias="hats_col_ra")
dec_column: Optional[str] = Field(default=None, alias="hats_col_dec")
default_columns: Optional[List[str]] = Field(default=None, alias="hats_cols_default")
"""Which columns should be read from parquet files, when user doesn't otherwise specify."""

Expand Down
1 change: 0 additions & 1 deletion src/hats/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
write_parquet_metadata_for_batches,
)
from .paths import (
create_hive_directory_name,
get_common_metadata_pointer,
get_parquet_metadata_pointer,
get_partition_info_pointer,
Expand Down
12 changes: 8 additions & 4 deletions src/hats/io/parquet_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,9 @@ def write_parquet_metadata(
]

catalog_path = get_upath(catalog_path)
dataset_subdir = catalog_path / "dataset"
(dataset_path, dataset) = file_io.read_parquet_dataset(
catalog_path,
dataset_subdir,
ignore_prefixes=ignore_prefixes,
exclude_invalid_files=True,
)
Expand All @@ -116,7 +117,7 @@ def write_parquet_metadata(

for single_file in dataset.files:
relative_path = single_file[len(dataset_path) + 1 :]
single_metadata = file_io.read_parquet_metadata(catalog_path / relative_path)
single_metadata = file_io.read_parquet_metadata(dataset_subdir / relative_path)

# Users must set the file path of each chunk before combining the metadata.
single_metadata.set_file_path(relative_path)
Expand All @@ -136,7 +137,8 @@ def write_parquet_metadata(
if order_by_healpix:
argsort = get_pixel_argsort(healpix_pixels)
metadata_collector = np.array(metadata_collector)[argsort]
catalog_base_dir = file_io.get_upath(output_path)
catalog_base_dir = get_upath(output_path)
file_io.make_directory(catalog_base_dir / "dataset", exist_ok=True)
metadata_file_pointer = paths.get_parquet_metadata_pointer(catalog_base_dir)
common_metadata_file_pointer = paths.get_common_metadata_pointer(catalog_base_dir)

Expand Down Expand Up @@ -166,9 +168,11 @@ def write_parquet_metadata_for_batches(batches: List[List[pa.RecordBatch]], outp
"""

with tempfile.TemporaryDirectory() as temp_pq_file:
temp_dataset_dir = get_upath(temp_pq_file) / "dataset"
temp_dataset_dir.mkdir()
for batch_list in batches:
temp_info_table = pa.Table.from_batches(batch_list)
pq.write_to_dataset(temp_info_table, temp_pq_file)
pq.write_to_dataset(temp_info_table, temp_dataset_dir)
return write_parquet_metadata(temp_pq_file, output_path=output_path)


Expand Down
37 changes: 9 additions & 28 deletions src/hats/io/paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
MARGIN_DIR = "margin_Dir"
MARGIN_PIXEL = "margin_Npix"

DATASET_DIR = "dataset"
PARTITION_INFO_FILENAME = "partition_info.csv"
PARTITION_JOIN_INFO_FILENAME = "partition_join_info.csv"
PARQUET_METADATA_FILENAME = "_metadata"
Expand All @@ -39,7 +40,7 @@ def pixel_directory(
One of pixel_number or directory_number is required. The directory name will
take the HiPS standard form of::

<catalog_base_dir>/Norder=<pixel_order>/Dir=<directory number>
<catalog_base_dir>/dataset/Norder=<pixel_order>/Dir=<directory number>

Where the directory number is calculated using integer division as::

Expand All @@ -61,10 +62,9 @@ def pixel_directory(
ndir = int(npix / 10_000) * 10_000
else:
raise ValueError("One of pixel_number or directory_number is required to create pixel directory")
return create_hive_directory_name(
catalog_base_dir,
[PARTITION_ORDER, PARTITION_DIR],
[norder, ndir],

return (
get_upath(catalog_base_dir) / DATASET_DIR / f"{PARTITION_ORDER}={norder}" / f"{PARTITION_DIR}={ndir}"
)


Expand Down Expand Up @@ -114,7 +114,7 @@ def pixel_catalog_files(
Returns (List[str]):
A list of paths to the pixels, in the same order as the input pixel list.
"""
catalog_base_dir = get_upath(catalog_base_dir)
catalog_base_dir = get_upath(catalog_base_dir) / DATASET_DIR
fs = catalog_base_dir.fs
base_path = str(catalog_base_dir)
if not base_path.endswith(fs.sep):
Expand Down Expand Up @@ -194,32 +194,13 @@ def pixel_catalog_file(

return (
catalog_base_dir
/ DATASET_DIR
/ f"{PARTITION_ORDER}={pixel.order}"
/ f"{PARTITION_DIR}={pixel.dir}"
/ f"{PARTITION_PIXEL}={pixel.pixel}.parquet{url_params}"
)


def create_hive_directory_name(base_dir, partition_token_names, partition_token_values):
"""Create path *pointer* for a directory with hive partitioning naming.
This will not create the directory.

The directory name will have the form of::

<catalog_base_dir>/<name_1>=<value_1>/.../<name_n>=<value_n>

Args:
catalog_base_dir (UPath): base directory of the catalog (includes catalog name)
partition_token_names (list[string]): list of partition name parts.
partition_token_values (list[string]): list of partition values that
correspond to the token name parts.
"""
partition_tokens = [
f"{name}={value}" for name, value in zip(partition_token_names, partition_token_values)
]
return get_upath(base_dir).joinpath(*partition_tokens)


def get_partition_info_pointer(catalog_base_dir: str | Path | UPath) -> UPath:
"""Get file pointer to `partition_info.csv` metadata file

Expand All @@ -239,7 +220,7 @@ def get_common_metadata_pointer(catalog_base_dir: str | Path | UPath) -> UPath:
Returns:
File Pointer to the catalog's `_common_metadata` file
"""
return get_upath(catalog_base_dir) / PARQUET_COMMON_METADATA_FILENAME
return get_upath(catalog_base_dir) / DATASET_DIR / PARQUET_COMMON_METADATA_FILENAME


def get_parquet_metadata_pointer(catalog_base_dir: str | Path | UPath) -> UPath:
Expand All @@ -250,7 +231,7 @@ def get_parquet_metadata_pointer(catalog_base_dir: str | Path | UPath) -> UPath:
Returns:
File Pointer to the catalog's `_metadata` file
"""
return get_upath(catalog_base_dir) / PARQUET_METADATA_FILENAME
return get_upath(catalog_base_dir) / DATASET_DIR / PARQUET_METADATA_FILENAME


def get_point_map_file_pointer(catalog_base_dir: str | Path | UPath) -> UPath:
Expand Down
6 changes: 3 additions & 3 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def small_sky_schema() -> pa.Schema:
pa.field("Norder", pa.uint8()),
pa.field("Dir", pa.uint64()),
pa.field("Npix", pa.uint64()),
pa.field("_healpix_29", pa.uint64()),
pa.field("_healpix_29", pa.int64()),
]
)

Expand All @@ -146,7 +146,7 @@ def small_sky_source_schema() -> pa.Schema:
pa.field("Norder", pa.uint8()),
pa.field("Dir", pa.uint64()),
pa.field("Npix", pa.uint64()),
pa.field("_healpix_29", pa.uint64()),
pa.field("_healpix_29", pa.int64()),
]
)

Expand Down Expand Up @@ -175,7 +175,7 @@ def margin_catalog_schema() -> pa.Schema:
pa.field("Norder", pa.uint8()),
pa.field("Dir", pa.uint64()),
pa.field("Npix", pa.uint64()),
pa.field("_healpix_29", pa.uint64()),
pa.field("_healpix_29", pa.int64()),
pa.field("margin_Norder", pa.uint8()),
pa.field("margin_Dir", pa.uint64()),
pa.field("margin_Npix", pa.uint64()),
Expand Down
6 changes: 6 additions & 0 deletions tests/data/generate_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,12 @@
"import pandas as pd\n",
"from hats.catalog.association_catalog.partition_join_info import PartitionJoinInfo\n",
"\n",
"import os\n",
"\n",
"try:\n",
" os.mkdir(\"small_sky_to_small_sky_order1\")\n",
"except:\n",
" pass\n",
"join_pixels = pd.DataFrame.from_dict(\n",
" {\n",
" \"Norder\": [0, 0, 0, 0],\n",
Expand Down
4 changes: 2 additions & 2 deletions tests/data/info_only/catalog/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
obs_collection=catalog
dataproduct_type=object
hats_nrows=10
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_cols_default=psf mean_mag_r
hats_max_rows=1000
hats_order=0
4 changes: 2 additions & 2 deletions tests/data/info_only/dataset/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
obs_collection=dataset
dataproduct_type=object
hats_nrows=10
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_max_rows=1000
hats_order=0
4 changes: 2 additions & 2 deletions tests/data/info_only/margin_cache/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
obs_collection=margin_cache
dataproduct_type=margin
hats_nrows=100
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_primary_table_url=catalog
hats_margin_threshold=0.5
hats_max_rows=1000
Expand Down
Binary file removed tests/data/small_sky/Norder=0/Dir=0/Npix=11.parquet
Binary file not shown.
Binary file removed tests/data/small_sky/_common_metadata
Binary file not shown.
Binary file removed tests/data/small_sky/_metadata
Binary file not shown.
Binary file not shown.
Binary file added tests/data/small_sky/dataset/_common_metadata
Binary file not shown.
Binary file added tests/data/small_sky/dataset/_metadata
Binary file not shown.
10 changes: 5 additions & 5 deletions tests/data/small_sky/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
obs_collection=small_sky
dataproduct_type=object
hats_nrows=131
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_max_rows=1000000
hats_order=0
moc_sky_fraction=0.08333
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-09-30T13\:32UTC
hats_estsize=98328
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:18UTC
hats_estsize=49177
delucchi-cmu marked this conversation as resolved.
Show resolved Hide resolved
hats_release_date=2024-09-18
hats_version=v0.1
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed tests/data/small_sky_order1/_common_metadata
Binary file not shown.
Binary file removed tests/data/small_sky_order1/_metadata
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added tests/data/small_sky_order1/dataset/_metadata
Binary file not shown.
8 changes: 4 additions & 4 deletions tests/data/small_sky_order1/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
obs_collection=small_sky_order1
dataproduct_type=object
hats_nrows=131
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_max_rows=1000000
hats_order=1
moc_sky_fraction=0.08333
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-09-30T13\:32UTC
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:18UTC
hats_estsize=48
hats_release_date=2024-09-18
hats_version=v0.1
4 changes: 2 additions & 2 deletions tests/data/small_sky_order1_id_index/properties
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ dataproduct_type=index
hats_nrows=131
hats_primary_table_url=small_sky
hats_index_column=id
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-10-01T13\:05UTC
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:18UTC
hats_estsize=8
hats_release_date=2024-09-18
hats_version=v0.1
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed tests/data/small_sky_order1_margin/_common_metadata
Binary file not shown.
Binary file removed tests/data/small_sky_order1_margin/_metadata
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
10 changes: 5 additions & 5 deletions tests/data/small_sky_order1_margin/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
obs_collection=small_sky_order1_margin
dataproduct_type=margin
hats_nrows=28
hats_col_j2000_ra=ra
hats_col_j2000_dec=dec
hats_col_ra=ra
hats_col_dec=dec
hats_primary_table_url=small_sky_order1
hats_margin_threshold=7200.0
hats_order=1
moc_sky_fraction=0.16667
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-09-30T13\:32UTC
hats_estsize=57
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:18UTC
hats_estsize=58
hats_release_date=2024-09-18
hats_version=v0.1
Binary file not shown.
Binary file removed tests/data/small_sky_source/_common_metadata
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
10 changes: 5 additions & 5 deletions tests/data/small_sky_source/properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
obs_collection=small_sky_source
dataproduct_type=source
hats_nrows=17161
hats_col_j2000_ra=source_ra
hats_col_j2000_dec=source_dec
hats_col_ra=source_ra
hats_col_dec=source_dec
hats_max_rows=3000
hats_order=2
moc_sky_fraction=0.16667
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-09-30T13\:33UTC
hats_estsize=99328
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:19UTC
hats_estsize=50207
hats_release_date=2024-09-18
hats_version=v0.1
6 changes: 3 additions & 3 deletions tests/data/small_sky_source_object_index/properties
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ dataproduct_type=index
hats_nrows=148
hats_primary_table_url=small_sky_source
hats_index_column=object_id
hats_builder=hats-import v0.3.6.dev22+g6346e4d
hats_creation_date=2024-09-30T13\:44UTC
hats_estsize=10
hats_builder=hats-import v0.3.6.dev26+g40366b4
hats_creation_date=2024-10-11T18\:19UTC
hats_estsize=9
hats_release_date=2024-09-18
hats_version=v0.1
Binary file not shown.
Loading