Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDSCatalog datasets property omits last element #309

Closed
albertotb opened this issue Jul 28, 2020 · 8 comments
Closed

TDSCatalog datasets property omits last element #309

albertotb opened this issue Jul 28, 2020 · 8 comments

Comments

@albertotb
Copy link

I have a self-hosted OpenDAP server, and I noticed that whenever I do (VAR is a subdirectory)

import siphon.catalog
cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
subdir = cat.catalog_refs[VAR].follow()
print(subdir.datasets)

the last file of the subdirectory is missing. I tried with several subdirectories and always the same result. I would be happy to try and reproduce it if you know of any public OpenDAP server; unfortunately I cannot share mine. In my example, I get:

['2020-07-22.nc', '2020-07-23.nc', '2020-07-24.nc', '2020-07-25.nc', '2020-07-26.nc', '2020-07-27.nc', '2020-07-28.nc', '2020-07-29.nc', '2020-07-30.nc', '2020-07-31.nc']

however, the server contains also the file 2020-08-01.nc (see screenshot attached).
contents

  • Platform: Linux
  • Versions:
    • Python 3.7.6
    • siphon 0.8.0
@dopplershift
Copy link
Member

Well this is a weird one. So I filled in your sample code with appropriate entries for GOES-16 satellite products from our THREDDS demonstration server:

import siphon.catalog
BASE_URL = 'https://thredds.ucar.edu/thredds/catalog/satellite/goes/east/grb/ABI/FullDisk/Channel02/'
VAR='20200804'
cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
subdir = cat.catalog_refs[VAR].follow()
print(subdir.datasets)

and I got back 30 datasets in the code, which matched what I counted (at the time I'm writing this).

Since you can't share the server, can you at least download and share the full catalog.xml? That would help debug and give us something to work from. You should be able to attach it in a comment on this issue.

@albertotb
Copy link
Author

It is actually very weird, I also was not able to replicate it on https://data.nodc.noaa.gov/thredds/catalog.html. I attach the two full catalog.xml files:

  • catalog.txt, this is the main catalog that I open with cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
  • following any reference, for instance this one waves_catalog.txt is where I'm missing the last file

Note: I renamed the files to .txt since Github does not support attaching xml files

@dopplershift
Copy link
Member

That's great. Let me dig in and see what I find...

@dopplershift
Copy link
Member

Ok so I have an idea on the problem, just not quite sure what the right fix is yet. The dataset elements here look like:

     <thredds:dataset name="2020-08-08.nc" ID="/opendap/hyrax/Waves/2020-08-08.nc">
         <thredds:dataSize units="bytes">11226431</thredds:dataSize>
         <thredds:date type="modified">2020-08-04T10:02:10Z</thredds:date>
         <thredds:access serviceName="dap" urlPath="/Waves/2020-08-08.nc"/>
     </thredds:dataset>

With the access methods and lack of urlPath for the dataset itself, it gets into looking at previous datasets in this loop:

siphon/siphon/catalog.py

Lines 313 to 325 in bd972b4

if tag_type == 'dataset':
current_dataset = child.attrib['name']
self._process_dataset(child)
if previous_dataset:
# see if the previously processed dataset has access elements as children
# if so, these datasets need to be processed specially when making
# access_urls
if self.datasets[previous_dataset].access_element_info:
self.ds_with_access_elements_to_process.append(previous_dataset)
previous_dataset = current_dataset

which never puts the last dataset in, so it gets popped here:

siphon/siphon/catalog.py

Lines 377 to 390 in bd972b4

def _process_datasets(self):
# Need to use list (of keys) because we modify the dict while iterating
for ds_name in list(self.datasets):
# check to see if dataset needs to have access urls created, if not,
# remove the dataset
has_url_path = self.datasets[ds_name].url_path is not None
is_ds_with_access_elements_to_process = \
ds_name in self.ds_with_access_elements_to_process
if has_url_path or is_ds_with_access_elements_to_process:
self.datasets[ds_name].make_access_urls(
self.base_tds_url, self.services, metadata=self.metadata)
else:
self.datasets.pop(ds_name)

I need to check with @lesserwhirls on why we do all that previous_dataset stuff.

@dopplershift
Copy link
Member

See also: #753 (comment)

@jm-cook
Copy link

jm-cook commented Dec 22, 2023

This problem is a show stopper for us. Unfortunately we will need to find a different solution.

@dopplershift
Copy link
Member

@jm-cook I'll try to take a look as soon as I can but will likely be in 2024 since the holidays are upon. Always happy to have community contributions if someone digs into this.

@dopplershift dopplershift added this to the 0.10 milestone Nov 11, 2024
@dopplershift
Copy link
Member

Pretty sure this was fixed by #760, which will be included in the imminent 0.10 release. Feel free to re-open if you find that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants