Replies: 6 comments 2 replies
-
Could there be some variability in processing time due to concurrent processes on the machine you used rather than the data structure? I think that it would be useful to confirm the expected gain in processing time before considering systematizing this approach (by re-runing the test 10 times and/or doing it with more AOI). An estimation of the processing time required for transforming the GFC data would be useful too, don't you think? |
Beta Was this translation helpful? Give feedback.
-
I think it would be useful to get confirmation of that initial finding by someone else running the reprex on their machine and report their findings here. The presented test case is of 100 assets situated randomly on a single GFW tile using sequential processing. This is to isolate the effect of arranging the GTiff in tiles. As indicated in my comment, I do not expect these gains to be translated naively into a real-world use case. From my POV, next steps, after initial confirmation, could be to develop a more elaborate test protocol testing (some of) the mentioned assumptions. |
Beta Was this translation helpful? Give feedback.
-
@goergen95 I just ran it on my computer, I got this: Untiled: #> user system elapsed
#> 58.36 9.62 70.73 Tiled: #> user system elapsed
#> 52.78 6.52 62.61 |
Beta Was this translation helpful? Give feedback.
-
I adapted the example to avoid processing the rasters every time and to more easily switch between tiling and sequential vs. parallel processing. For this example data, the gains in computation speed seem to be reduced when parallel processing is enabled as I see 40s (untiled) vs. 37s (tiled) using 4 workers. Unfold code examplereprex::reprex({
remotes::install_github("mapme-initiative/mapme.biodiversity")
library(sf)
library(future)
library(mapme.biodiversity)
nworkers <- 4
tiled <- FALSE
# original block size: Block=40000x1
treecover_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_treecover2000_00N_070W.tif"
lossyear_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_lossyear_00N_070W.tif"
# gdal options and outdir
opts <- c("-co", "COMPRESS=LZW", "-ot", "Byte")
if (tiled) {
opts <- c(opts, "-co", "TILED=YES")
outdir <- tools::file_path_as_absolute("~/mapme-test/tiled/")
} else {
outdir <- tools::file_path_as_absolute("~/mapme-test/untiled/")
}
print(opts)
print(outdir)
# make untiled raster available locally
dir.create(outdir, recursive = TRUE)
dir.create(file.path(outdir, "gfw_treecover"))
dir.create(file.path(outdir, "gfw_lossyear"))
treecover_dst <- file.path(outdir, "gfw_treecover", basename(treecover_src))
if (!file.exists(treecover_dst)) {
sf::gdal_utils("translate", treecover_src, treecover_dst, options = opts)
}
lossyear_dst <- file.path(outdir, "gfw_lossyear", basename(lossyear_src))
if (!file.exists(lossyear_dst)) {
sf::gdal_utils("translate", lossyear_src, lossyear_dst, options = opts)
}
# create random assets
bb <- st_bbox(c(xmin = -69, ymin = -9, xmax = -61, ymax = -1), crs = 4326)
set.seed(42)
assets <- st_sample(bb, size = 100)
assets <- st_buffer(assets, dist = 10000)
assets <- st_as_sf(assets)
assets <- st_crop(assets, bb)
# timing
mapme_options(outdir = outdir, chunk_size = NULL)
if (nworkers > 1) {
plan(multicore, workers = nworkers)
}
get_resources(assets, get_gfw_lossyear(), get_gfw_treecover())
(system.time({
calc_indicators(assets, calc_treecover_area(min_size = 3, min_cover = 30))
}))
plan(sequential)
}) |
Beta Was this translation helpful? Give feedback.
-
Introducing ordering and overlap, I see a reduction of processing time of about ~10% comparing untiled-and-unordered vs. tiled-and-ordered with sequential processing. Ordering, however, with this example data seems to have an negligible effect. I suspect a different result when assets are spread out across multiple GFW tiles and larger numbers of assets because of more frequent cache invalidation. Unfold code examplereprex::reprex({
remotes::install_github("mapme-initiative/mapme.biodiversity")
library(sf)
library(future)
library(mapme.biodiversity)
nworkers <- 1
tiled <- TRUE
order <- TRUE
overlaps <- 0.2
# original block size: Block=40000x1
treecover_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_treecover2000_00N_070W.tif"
lossyear_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_lossyear_00N_070W.tif"
# gdal options and outdir
opts <- c("-co", "COMPRESS=LZW", "-ot", "Byte")
if (tiled) {
opts <- c(opts, "-co", "TILED=YES")
outdir <- tools::file_path_as_absolute("~/mapme-test/tiled/")
} else {
outdir <- tools::file_path_as_absolute("~/mapme-test/untiled/")
}
print(opts)
print(outdir)
# make untiled raster available locally
dir.create(outdir, recursive = TRUE)
dir.create(file.path(outdir, "gfw_treecover"))
dir.create(file.path(outdir, "gfw_lossyear"))
treecover_dst <- file.path(outdir, "gfw_treecover", basename(treecover_src))
if (!file.exists(treecover_dst)) {
sf::gdal_utils("translate", treecover_src, treecover_dst, options = opts)
}
lossyear_dst <- file.path(outdir, "gfw_lossyear", basename(lossyear_src))
if (!file.exists(lossyear_dst)) {
sf::gdal_utils("translate", lossyear_src, lossyear_dst, options = opts)
}
# create random assets
bb <- st_bbox(c(xmin = -69, ymin = -9, xmax = -61, ymax = -1), crs = 4326)
set.seed(42)
assets <- st_sample(bb, size = 100)
assets <- st_buffer(assets, dist = 10000)
assets <- st_as_sf(assets)
assets <- st_crop(assets, bb)
if (overlaps > 0) {
set.seed(43)
index <- sample(1:nrow(assets), round(nrow(assets) * overlaps))
assets <- rbind(assets, assets[index, ])
}
if (order) {
assets <- assets[order(st_coordinates(st_centroid(assets))[ ,1]), ]
}
# timing
mapme_options(outdir = outdir, chunk_size = NULL)
if (nworkers > 1) {
plan(multicore, workers = nworkers)
}
get_resources(assets, get_gfw_lossyear(), get_gfw_treecover())
(system.time({
calc_indicators(assets, calc_treecover_area(min_size = 3, min_cover = 30))
}))
plan(sequential)
})
|
Beta Was this translation helpful? Give feedback.
-
Scaling to ~1,000 assets also results in about a ~10% difference (260s vs 234s) with parallel processing enabled. Unfold code examplereprex::reprex({
remotes::install_github("mapme-initiative/mapme.biodiversity")
library(sf)
library(future)
library(mapme.biodiversity)
nworkers <- 4
nassets <- 1000
tiled <- FALSE
order <- TRUE
overlaps <- 0.2
# original block size: Block=40000x1
treecover_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_treecover2000_00N_070W.tif"
lossyear_src <- "/vsicurl/https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/Hansen_GFC-2023-v1.11_lossyear_00N_070W.tif"
# gdal options and outdir
opts <- c("-co", "COMPRESS=LZW", "-ot", "Byte")
if (tiled) {
opts <- c(opts, "-co", "TILED=YES")
outdir <- tools::file_path_as_absolute("~/mapme-test/tiled/")
} else {
outdir <- tools::file_path_as_absolute("~/mapme-test/untiled/")
}
print(opts)
print(outdir)
# make untiled raster available locally
dir.create(outdir, recursive = TRUE)
dir.create(file.path(outdir, "gfw_treecover"))
dir.create(file.path(outdir, "gfw_lossyear"))
treecover_dst <- file.path(outdir, "gfw_treecover", basename(treecover_src))
if (!file.exists(treecover_dst)) {
sf::gdal_utils("translate", treecover_src, treecover_dst, options = opts)
}
lossyear_dst <- file.path(outdir, "gfw_lossyear", basename(lossyear_src))
if (!file.exists(lossyear_dst)) {
sf::gdal_utils("translate", lossyear_src, lossyear_dst, options = opts)
}
# create random assets
bb <- st_bbox(c(xmin = -69, ymin = -9, xmax = -61, ymax = -1), crs = 4326)
set.seed(42)
assets <- st_sample(bb, size = nassets)
assets <- st_buffer(assets, dist = 10000)
assets <- st_as_sf(assets)
assets <- st_crop(assets, bb)
if (overlaps > 0) {
set.seed(43)
index <- sample(1:nrow(assets), round(nrow(assets) * overlaps))
assets <- rbind(assets, assets[index, ])
}
if (order) {
assets <- assets[order(st_coordinates(st_centroid(assets))[ ,1]), ]
}
# timing
mapme_options(outdir = outdir, chunk_size = NULL)
if (nworkers > 1) {
plan(multicore, workers = nworkers)
}
get_resources(assets, get_gfw_lossyear(), get_gfw_treecover())
(system.time({
calc_indicators(assets, calc_treecover_area(min_size = 3, min_cover = 30))
}))
plan(sequential)
})
|
Beta Was this translation helpful? Give feedback.
-
Since we are now relying on GDAL for data I/O we are in the position to transform raster and vector data sets into formats/layouts optimized for our use cases. Considering the original GFW data layers, these are distributed untiled with a block size of
40000x1
pixels. This means that when we access those layers, we will have to read all rows that intersect with a given asset. The below reprex compares the runtime of computing thetreecover_area
indicator for 100 randomly generated locations either with the original block size or GDAL's default block size of256x256
pixels whenTILED=YES
.For the untiled version I see:
while for the tiled version:
which is a reduction of about ~15% simply by arranging the data differently. I assume the runtime of real-world use-cases to be a function of asset overlap, clustering, order and GDAL cache size.
Unfold code example
Beta Was this translation helpful? Give feedback.
All reactions