Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ImageNet example #555

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkuchnik
Copy link
Contributor

ImageNet is a common workload for benchmarking. This PR adds a recipe for running ImageNet in PyTorch with a Croissant loader, which can be useful for both system and ML characterization.

Current WIP status is:

  • ImageFolder to Croissant prototype.
  • Functional equivalence of Croissant loader with validation loader.

TODO:

  • Add training loader with shuffling.
  • Optimize system performance (e.g., startup for validation loader is ~30 seconds compared to near instant).

@mkuchnik mkuchnik added enhancement New feature or request WIP work in process labels Feb 23, 2024
@mkuchnik mkuchnik requested a review from a team as a code owner February 23, 2024 02:20
Copy link

github-actions bot commented Feb 23, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@mkuchnik mkuchnik force-pushed the imagenet_folder_recipe branch 2 times, most recently from 874f0b8 to 6b8c88f Compare February 23, 2024 03:15
@mkuchnik mkuchnik force-pushed the imagenet_folder_recipe branch from 6b8c88f to 864ac56 Compare February 23, 2024 03:23
@mkuchnik
Copy link
Contributor Author

@marcenacp Thoughts on the structure so far? I'm not sure if there is a better way to read a local directory---I essentially emulate a zip/tar by creating a no-op "extract" function. I plan to simplify this further to avoid the symbolic links. I am also wondering if you have any thoughts on how to add shuffling/sharding of rows in mlcroissant, which would help with various node and worker-level parallelism strategies.

return source.exists() and source.is_dir()


def _soft_link(source: epath.Path, target: epath.Path) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this function if _soft_link == os.symlink?

@@ -0,0 +1,771 @@
"""ImageNet training in PyTorch.

From:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy this from pytorch into mlcroissant? Could you alternatively refer the user to https://github.com/pytorch/examples/blob/2d725b6ab255e05c55e0b08925f06f171aaedc0c/imagenet/main.py?

@@ -161,6 +161,7 @@ class EncodingFormat:
"""

CSV = "text/csv"
DIR = "application/x-dir"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference with

"distribution": [
{
"@type": "cr:FileSet",
"@id": "files",
"name": "files",
"encodingFormat": "text/plain",
"includes": "data/file*.txt"
}
],
?

@marcenacp
Copy link
Contributor

@mkuchnik Sorry for the late review... I think it's super interesting and I left a few comments:

  • There is already a way to import files from a specific folder (data/). I linked to an example.
  • Maybe it's confusing to copy a whole Python file from pytorch. Can you give more instructions? Maybe it's worth having a Jupyter Notebook recipe with an end-to-end example on ImageNet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WIP work in process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants