Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ops.readers.Numpy how to return the filename #5790

Open
1 task done
rachelglenn opened this issue Jan 17, 2025 · 5 comments
Open
1 task done

ops.readers.Numpy how to return the filename #5790

rachelglenn opened this issue Jan 17, 2025 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@rachelglenn
Copy link

Describe the question.

Hi is there a way to return the filename of the loaded data or a way to get that information, when shuffle is turned on? Thanks in advance.

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@rachelglenn rachelglenn added the question Further information is requested label Jan 17, 2025
@JanuszL
Copy link
Contributor

JanuszL commented Jan 17, 2025

Hi @rachelglenn,

Thank you for reaching out.
Have you tried the source_info method for the sample in the output batch?

o = pipe.run()
print(o[0][0].source_info())

@rachelglenn
Copy link
Author

rachelglenn commented Jan 19, 2025

I build my pipeline with a graph and have tried using either DALIGenericIterator, DALIRaggedIterator as the iterators. I am not able to get the filename. I have tried following this issue:

``import nvidia.dali as dali
from nvidia.dali.plugin.pytorch import DALIGenericIterator

Define the DALI pipeline

class NumpyReaderPipeline(dali.Pipeline):
def init(self, batch_size, num_threads, device_id, files, seed, shuffle, shard_id, num_shards):
super(NumpyReaderPipeline, self).init(batch_size, num_threads, device_id)
self.files = files
self.seed = seed
self.shuffle = shuffle
self.shard_id = shard_id
self.num_shards = num_shards

    # Define the Numpy reader operator
    self.reader = dali.ops.readers.Numpy(
        seed=self.seed,
        files=self.files,
        device="cpu",
        read_ahead=True,
        shard_id=self.shard_id,
        pad_last_batch=True,
        num_shards=self.num_shards,
        dont_use_mmap=True,
        shuffle_after_epoch=self.shuffle,
    )

def define_graph(self):
    # Get the data from the reader
    data = self.reader()
    # Get the source_info for filenames (runtime property)
    source_info = dali.fn.get_property(data, "source_info")
    return data, source_info

Define input parameters

files = ["file1.npy", "file2.npy", "file3.npy"] # Example file paths
batch_size = 2
num_threads = 2
device_id = 0
seed = 42
shuffle = True
shard_id = 0
num_shards = 1

Create the pipeline

pipe = NumpyReaderPipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id,
files=files, seed=seed, shuffle=shuffle, shard_id=shard_id, num_shards=num_shards)

Build the pipeline

pipe.build()

Create a DALI Generic Iterator

iterator = DALIGenericIterator([pipe], output_map=["data", "source_info"], auto_reset=True)

Run the pipeline and print the data and source info (filenames)

for data in iterator:
images = data[0]["data"] # Get the image data from the iterator
source_info = data[0]["source_info"] # Get the filenames (source info)

print("Batch of images:", images)
print("Source info (filenames):", source_info)

``

@JanuszL
Copy link
Contributor

JanuszL commented Jan 19, 2025

I'm afraid the method mentioned is a property of the DALI tensor, not the Torch one, which is returned by the iterator.
Another solution you can test in this case is to use the external source operator and for each file read to return a unique numerical ID that can be mapped to the file name.

@rachelglenn
Copy link
Author

Thank you for the help and a possible work around. Do you know of any examples that use the external source operator?

@JanuszL
Copy link
Contributor

JanuszL commented Jan 20, 2025

@rachelglenn have you checked this example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants