ops.readers.Numpy how to return the filename #5790

rachelglenn · 2025-01-17T13:50:22Z

Describe the question.

Hi is there a way to return the filename of the loaded data or a way to get that information, when shuffle is turned on? Thanks in advance.

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

JanuszL · 2025-01-17T13:55:40Z

Hi @rachelglenn,

Thank you for reaching out.
Have you tried the source_info method for the sample in the output batch?

o = pipe.run()
print(o[0][0].source_info())

rachelglenn · 2025-01-19T13:50:48Z

I build my pipeline with a graph and have tried using either DALIGenericIterator, DALIRaggedIterator as the iterators. I am not able to get the filename. I have tried following this issue:

``import nvidia.dali as dali
from nvidia.dali.plugin.pytorch import DALIGenericIterator

Define the DALI pipeline

class NumpyReaderPipeline(dali.Pipeline):
def init(self, batch_size, num_threads, device_id, files, seed, shuffle, shard_id, num_shards):
super(NumpyReaderPipeline, self).init(batch_size, num_threads, device_id)
self.files = files
self.seed = seed
self.shuffle = shuffle
self.shard_id = shard_id
self.num_shards = num_shards

    # Define the Numpy reader operator
    self.reader = dali.ops.readers.Numpy(
        seed=self.seed,
        files=self.files,
        device="cpu",
        read_ahead=True,
        shard_id=self.shard_id,
        pad_last_batch=True,
        num_shards=self.num_shards,
        dont_use_mmap=True,
        shuffle_after_epoch=self.shuffle,
    )

def define_graph(self):
    # Get the data from the reader
    data = self.reader()
    # Get the source_info for filenames (runtime property)
    source_info = dali.fn.get_property(data, "source_info")
    return data, source_info

Define input parameters

files = ["file1.npy", "file2.npy", "file3.npy"] # Example file paths
batch_size = 2
num_threads = 2
device_id = 0
seed = 42
shuffle = True
shard_id = 0
num_shards = 1

Create the pipeline

pipe = NumpyReaderPipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id,
files=files, seed=seed, shuffle=shuffle, shard_id=shard_id, num_shards=num_shards)

Build the pipeline

pipe.build()

Create a DALI Generic Iterator

iterator = DALIGenericIterator([pipe], output_map=["data", "source_info"], auto_reset=True)

Run the pipeline and print the data and source info (filenames)

for data in iterator:
images = data[0]["data"] # Get the image data from the iterator
source_info = data[0]["source_info"] # Get the filenames (source info)

print("Batch of images:", images)
print("Source info (filenames):", source_info)

``

JanuszL · 2025-01-19T20:11:59Z

I'm afraid the method mentioned is a property of the DALI tensor, not the Torch one, which is returned by the iterator.
Another solution you can test in this case is to use the external source operator and for each file read to return a unique numerical ID that can be mapped to the file name.

rachelglenn · 2025-01-19T23:12:20Z

Thank you for the help and a possible work around. Do you know of any examples that use the external source operator?

JanuszL · 2025-01-20T07:30:53Z

@rachelglenn have you checked this example?

rachelglenn added the question Further information is requested label Jan 17, 2025

dali-automaton assigned JanuszL Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops.readers.Numpy how to return the filename #5790

ops.readers.Numpy how to return the filename #5790

rachelglenn commented Jan 17, 2025

JanuszL commented Jan 17, 2025 •

edited

Loading

rachelglenn commented Jan 19, 2025 •

edited

Loading

JanuszL commented Jan 19, 2025

rachelglenn commented Jan 19, 2025

JanuszL commented Jan 20, 2025

ops.readers.Numpy how to return the filename #5790

ops.readers.Numpy how to return the filename #5790

Comments

rachelglenn commented Jan 17, 2025

Describe the question.

Check for duplicates

JanuszL commented Jan 17, 2025 • edited Loading

rachelglenn commented Jan 19, 2025 • edited Loading

Define the DALI pipeline

Define input parameters

Create the pipeline

Build the pipeline

Create a DALI Generic Iterator

Run the pipeline and print the data and source info (filenames)

JanuszL commented Jan 19, 2025

rachelglenn commented Jan 19, 2025

JanuszL commented Jan 20, 2025

JanuszL commented Jan 17, 2025 •

edited

Loading

rachelglenn commented Jan 19, 2025 •

edited

Loading