-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prefetch examples to showcase performance gain #635
Comments
@skshetry I'd appreciate it if you could take a look |
Turns out, datachain/src/datachain/lib/file.py Lines 271 to 274 in 3bd22ad
For me, |
Can we enable caching by default? What is the reason for not doing so already? |
it might be taking a lot of space by default? But, yes seems reasonable to enable it by default. @dmpetrov wdyt? |
pre-fetch is not related to caching. We need to decouple these options. Created #647 Caching by default is a separate question - but I don't see any strong reasons of doing this. |
We have to persist them somewhere for the lifetime of the script's run. IIUC, due to the worker processes, we cannot do it in memory. |
I think the point here is that for pre-fetch all the things related to cache should be an implementation detail. E.g. cache them as needed and probably only for the scrip lifetime (or even UDF call?) and drop then. People should not care about |
Right. It's ok for prefetch to store data locally (in the cache is ok) but data has to be removed right after the data was used (only for udf-call). |
I tested on this script that has 1800 rows with following train_loader = DataLoader(
ds.to_pytorch(transform=transform),
batch_size=36,
num_workers=4,
)
As you can see, if the cache is warm, it takes just 41s for the script to run. If the cache is empty, it takes ~6 mins. The 20s difference is not very meaningful - if any, it could just be my unstable hotel Wi-Fi. Maybe prefetching is helping. The gap here is too large, potentially that could be completed in 41s takes 5 more minutes to download files. Benchmark script#! /bin/bash
trap "exit" INT
rm_cache () {
rm -rf .datachain/cache
}
run () {
echo $@
gtime -v python ./examples/get_started/torch-loader.py $@
}
rm_cache
run --prefetch 0
run --prefetch 36
rm_cache
run --prefetch 0 --cache
run --prefetch 0 --cache
rm_cache
run --prefetch 36 --cache
run --prefetch 36 --cache |
So, based on this data pre-fetch does not give any perf improvements. prefetch=36 and batch_size=36 suppose to prefetch 36*36 (1K+) items. Is this what happened? Any ideas how make a cleaner case? Like singlethreaded, no batch, prefetch=4? |
With both set to the same value, what we have prefetched will be loaded as a single batch to PyTorch. |
Apologies for the misleading benchmarks. Turns out, prefetching was not working in the Now with #664, the above script finishes in <1m15s (so, ~30s overhead), down from almost 6mins. |
@skshetry could you please share the table with the results? |
great news! 1m15s instead of 6m20s - that what's neede! |
|
can we consider this to be done? |
I tested on CIFAR10 dataset that has about 60,000 images with This was on a no-op code like follows: for _ in data_loader:
pass
FYI,
You can find an example guide here in this repository. This shard is pushed to remote and Example DataChain Code
import multiprocessing
from torch.utils.data import DataLoader
from tqdm import tqdm
from datachain import DataChain
source = "gs://datachain-cifar10/"
name = "cifar10"
if __name__ == "__main__":
try:
ds = DataChain.from_dataset(name)
except: # noqa: E722
ds = DataChain.from_storage(source).save(name)
print("created dataset", name)
else:
print("using existing dataset")
ds = ds.settings(prefetch=50)
train_loader = DataLoader(
ds.to_pytorch(),
batch_size=25,
num_workers=6,
persistent_workers=True,
multiprocessing_context=multiprocessing.get_context("spawn"),
)
with tqdm(
train_loader, disable=True, desc="Loading dataset", leave=True, position=100
) as loader:
for _ in loader:
pass Example Mosaicml code
import os
from typing import Any, Callable
import torch
from streaming import StreamingDataset
from torch.utils.data import DataLoader
from torchvision import transforms
from tqdm import tqdm
# the location of our dataset
in_root = "./dataset"
# the location of the "remote" streaming dataset (`sds`).
# Upload `out_root` to your cloud storage provider of choice.
out_root = "./sds"
out_train = "./sds/train"
out_test = "./sds/test"
# the location to download the streaming dataset during training
local = "./local"
local_train = "./local/train"
local_test = "./local/test"
# toggle shuffling in dataloader
shuffle_train = True
shuffle_test = False
# shard size limit, in bytes
size_limit = 1 << 25
# training batch size
batch_size = 32
# training hardware parameters
device = "cuda" if torch.cuda.is_available() else "cpu"
# number of training epochs
train_epochs = 2 # increase the number of epochs for greater accuracy
# Hashing algorithm to use for dataset
hashes = ["sha1", "xxh64"]
# upload location for the dataset splits (change this if you want to upload to a different location, for example, AWS S3 bucket location)
upload_location = "gs://datachain-imagenet/sds"
upload_train_location = os.path.join(upload_location, "train")
upload_test_location = os.path.join(upload_location, "test")
remote_train = upload_train_location
remote_test = upload_test_location
class CIFAR10Dataset(StreamingDataset):
def __init__(
self,
remote: str,
local: str,
shuffle: bool,
batch_size: int,
transforms: Callable,
) -> None:
super().__init__(
local=local, remote=remote, shuffle=shuffle, batch_size=batch_size
)
self.transforms = transforms
def __getitem__(self, idx: int) -> Any:
obj = super().__getitem__(idx)
x = obj["x"]
y = obj["y"]
return self.transforms(x), y
transformation = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
]
)
train_dataset = CIFAR10Dataset(
remote_train,
local_train,
shuffle_train,
batch_size=batch_size,
transforms=transformation,
)
test_dataset = CIFAR10Dataset(
remote_test,
local_test,
shuffle_test,
batch_size=batch_size,
transforms=transformation,
)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
for _ in tqdm(train_dataloader, desc="train"):
pass
for _ in tqdm(test_dataloader, desc="test"):
pass |
Description
We need examples when it's clear how prefetch helps. I tried in several examples and I don't see any difference.
An examples is below. Note, the library utilizes CPU pretty well (can utilize it 600% in my laptop), no parallelization is needed if prefetch is good.
Results:
Code:
Version Info
The text was updated successfully, but these errors were encountered: