TDSCatalog datasets property omits last element #309

albertotb · 2020-07-28T07:44:50Z

I have a self-hosted OpenDAP server, and I noticed that whenever I do (VAR is a subdirectory)

import siphon.catalog
cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
subdir = cat.catalog_refs[VAR].follow()
print(subdir.datasets)

the last file of the subdirectory is missing. I tried with several subdirectories and always the same result. I would be happy to try and reproduce it if you know of any public OpenDAP server; unfortunately I cannot share mine. In my example, I get:

['2020-07-22.nc', '2020-07-23.nc', '2020-07-24.nc', '2020-07-25.nc', '2020-07-26.nc', '2020-07-27.nc', '2020-07-28.nc', '2020-07-29.nc', '2020-07-30.nc', '2020-07-31.nc']

however, the server contains also the file 2020-08-01.nc (see screenshot attached).

Platform: Linux
Versions:
- Python 3.7.6
- siphon 0.8.0

The text was updated successfully, but these errors were encountered:

dopplershift · 2020-08-04T05:13:24Z

Well this is a weird one. So I filled in your sample code with appropriate entries for GOES-16 satellite products from our THREDDS demonstration server:

import siphon.catalog
BASE_URL = 'https://thredds.ucar.edu/thredds/catalog/satellite/goes/east/grb/ABI/FullDisk/Channel02/'
VAR='20200804'
cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
subdir = cat.catalog_refs[VAR].follow()
print(subdir.datasets)

and I got back 30 datasets in the code, which matched what I counted (at the time I'm writing this).

Since you can't share the server, can you at least download and share the full catalog.xml? That would help debug and give us something to work from. You should be able to attach it in a comment on this issue.

albertotb · 2020-08-04T11:11:00Z

It is actually very weird, I also was not able to replicate it on https://data.nodc.noaa.gov/thredds/catalog.html. I attach the two full catalog.xml files:

catalog.txt, this is the main catalog that I open with cat = siphon.catalog.TDSCatalog(f'{BASE_URL}/catalog.xml')
following any reference, for instance this one waves_catalog.txt is where I'm missing the last file

Note: I renamed the files to .txt since Github does not support attaching xml files

dopplershift · 2020-08-04T21:58:17Z

That's great. Let me dig in and see what I find...

dopplershift · 2020-08-05T00:17:06Z

Ok so I have an idea on the problem, just not quite sure what the right fix is yet. The dataset elements here look like:

     <thredds:dataset name="2020-08-08.nc" ID="/opendap/hyrax/Waves/2020-08-08.nc">
         <thredds:dataSize units="bytes">11226431</thredds:dataSize>
         <thredds:date type="modified">2020-08-04T10:02:10Z</thredds:date>
         <thredds:access serviceName="dap" urlPath="/Waves/2020-08-08.nc"/>
     </thredds:dataset>

With the access methods and lack of urlPath for the dataset itself, it gets into looking at previous datasets in this loop:

siphon/siphon/catalog.py

Lines 313 to 325 in bd972b4

    
           if tag_type == 'dataset': 
        
               current_dataset = child.attrib['name'] 
        
               self._process_dataset(child) 
        
               if previous_dataset: 
        
                   # see if the previously processed dataset has access elements as children 
        
                   # if so, these datasets need to be processed specially when making 
        
                   # access_urls 
        
                   if self.datasets[previous_dataset].access_element_info: 
        
                       self.ds_with_access_elements_to_process.append(previous_dataset) 
        
               previous_dataset = current_dataset

which never puts the last dataset in, so it gets popped here:

siphon/siphon/catalog.py

Lines 377 to 390 in bd972b4

    
           def _process_datasets(self): 
        
               # Need to use list (of keys) because we modify the dict while iterating 
        
               for ds_name in list(self.datasets): 
        
                   # check to see if dataset needs to have access urls created, if not, 
        
                   # remove the dataset 
        
                   has_url_path = self.datasets[ds_name].url_path is not None 
        
                   is_ds_with_access_elements_to_process = \ 
        
                       ds_name in self.ds_with_access_elements_to_process 
        
                   if has_url_path or is_ds_with_access_elements_to_process: 
        
                       self.datasets[ds_name].make_access_urls( 
        
                           self.base_tds_url, self.services, metadata=self.metadata) 
        
                   else: 
        
                       self.datasets.pop(ds_name)

I need to check with @lesserwhirls on why we do all that previous_dataset stuff.

dopplershift · 2023-12-21T20:16:51Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDSCatalog datasets property omits last element #309

TDSCatalog datasets property omits last element #309

albertotb commented Jul 28, 2020

dopplershift commented Aug 4, 2020

albertotb commented Aug 4, 2020

dopplershift commented Aug 4, 2020

dopplershift commented Aug 5, 2020

dopplershift commented Dec 21, 2023

jm-cook commented Dec 22, 2023

dopplershift commented Dec 22, 2023

dopplershift commented Dec 11, 2024

TDSCatalog datasets property omits last element #309

TDSCatalog datasets property omits last element #309

Comments

albertotb commented Jul 28, 2020

dopplershift commented Aug 4, 2020

albertotb commented Aug 4, 2020

dopplershift commented Aug 4, 2020

dopplershift commented Aug 5, 2020

dopplershift commented Dec 21, 2023

jm-cook commented Dec 22, 2023

dopplershift commented Dec 22, 2023

dopplershift commented Dec 11, 2024