Acquiring and Pre-Processing the WSIM-GLDAS Dataset
Overview
-
In this lesson, you will acquire Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) from the NASA Socioeconomic Data and Applications Center (SEDAC) website and will also retrieve a global administrative boundary data set called geoBoundaries data directly from an application programming interface (API). You will learn how to subset the data for a region of interest and save the subsetted data as a new file.
+
In this lesson, you will acquire the data set called Water Security Indicator Model - Global Land Data Assimilation System(WSIM-GLDAS) from the NASA Socioeconomic Data and Applications Center (SEDAC) website and will also retrieve a global administrative boundary data set called geoBoundaries data directly from an application programming interface (API). You will learn how to subset the data for a region of interest and save the result data as a new file.
Learning Objectives
After completing this lesson, you should be able to:
Retrieve WSIM-GLDAS data from SEDAC.
-
Retrieve Administrative Boundary data from geoboundaries.
+
Retrieve Administrative Boundary using the geoBoundaries API.
Subset WSIM-GLDAS data for a region and time period of interest.
Visualize geospatial data to highlight precipitation deficit patterns.
Write a pre-processed NetCDF-formatted file to disk.
@@ -4113,6 +4114,9 @@
Learning Objectives
Introduction
+
The water cycle is the constant process of circulation of water on, above, and under the Earth’s surface (NOAA 2019). Human activities produce greenhouse gas emissions, land use changes, dam and reservoir development, and groundwater extraction which have affected the natural water cycle in recent decades. The influence of these human activities on the water cycle have consequential impacts on oceanic, groundwater, and land processes, influencing phenomena such as droughts and floods (Zhou 2016).
+
Precipitation deficits can also cause drought, which is a prolonged period of little to no rainfall leading to a shortage of water. Droughts have impacts on the environment and humans, at times causing a chain reaction (Rodgers 2023). For example, California had a drought from 2012 to 2014. While it isn’t uncommon for California to have periods of low precipitation, that with a combination of sustained record high temperatures created severe water shortages. The drought subsequently dried out rivers which depleted populations of the Chinook salmon disrupting Native American tribes’ food supply (Bland 2014).
A raster is a type of geographic data in image format which has information stored in each pixel. Rasters are often referred to as grids because of their regularly-shaped matrix data structure. Rasters can store many types of information, and they usually have dimensions that include latitude, longitude, and time. NetCDF is one format for raster data; others include Geotiff, ASCII, and many more.
+
A raster is a type of geographic data in image format which has numerical information stored in each pixel. (Rasters are often referred to as grids because of their regularly-shaped matrix data structure.) Rasters can store many types of information, and they usually have dimensions that include latitude, longitude, and time. NetCDF is one format for raster data; others include Geotiff, ASCII and many more. Several raster formats like NetCDF can store multiple raster layers, or a “raster stack,” which can useful for storing and analyzing a series of rasters.
-
The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 – 2014) dataset can be download from the NASA SEDAC website (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022). Downloads are organized by a combination of thematic variables (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration periods (a temporal aggregation) of 1, 3, 6, 12 months. Each variable-integration combination consists of a NetCDF raster file with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time series. Hence, this is a large file that can take a lot of time to download and may cause computer memory issues on certain systems. This is considered BIG data.
+
The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014) dataset can be download from the NASA SEDAC website (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022b). The dataset abstract describes these data saying that WSIM-GLDAS “identifies and characterizes surpluses and deficits of freshwater, and the parameters determining these anomalies, at monthly intervals over the period January 1948 to December 2014.”
+
Downloads are organized by a combination of thematic variables (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration periods (a temporal aggregation) (1, 3, 6, 12 months). Each variable-integration combination consists of a NetCDF raster file with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time series. Hence, this is a large file that can take a lot of time to download and may cause computer memory issues on certain systems. This is considered BIG data.
+
+
+
Acquiring the Data
@@ -4141,13 +4149,15 @@
Introduction
-
The Water Security (WSIM-GLDAS) Monthly Grids dataset used in this lesson is hosted by NASA’s Socioeconomic Data and Applications Center (SEDAC), one of several Distributed Active Archive Centers (DAACs). SEDAC hosts a variety of data products including geo-spatial population data, human settlements and infrastructure, exposure and vulnerability to climate change, and satellite-based data on land use, air, and water quality. In order to download data hosted by SEDAC, you are required to have a free NASA EarthData account. You can create an account here: NASA EarthData
+
The Water Security (WSIM-GLDAS) Monthly Grids dataset used in this lesson is hosted by NASA’s Socioeconomic Data and Applications Center (SEDAC), one of several Distributed Active Archive Centers (DAACs). SEDAC hosts a variety of data products including geospatial population data, human settlements and infrastructure, exposure and vulnerability to climate change, and satellite-based data on land use, air, and water quality. In order to download data hosted by SEDAC, you are required to have a free NASA EarthData account. You can create an account here: NASA EarthData.
-
For this lesson, we will work with the WSIM-GLDAS data set Composite Anomaly Twelve-Month Return Period NetCDF file. This represents the variable “Composite Anomaly” for the integration period of one month. Please go ahead and download the file.
The water cycle is the constant process of circulation of water on, above, and under the Earth’s surface(“Water Cycle” 2019). Human activities produce greenhouse gas emissions, land use changes, dam and reservoir development, and groundwater extraction which have affected the natural water cycle in recent decades. The influence of these human activities on the water cycle have consequential impacts on oceanic, groundwater, and land processes, influencing phenomena such as droughts and floods (Zhou 2016).
-
Precipitation deficits can also cause drought, which is a prolonged period of little to no rainfall leading to a shortage of water. Droughts have impacts on the environment and humans, at times causing a chain reaction (Rodgers 2023). For example, California had a drought from 2012 to 2014. While it isn’t uncommon for California to have periods of low precipitation, that with a combination of sustained record high temperatures created severe water shortages. The drought subsequently dried out rivers which depleted populations of the Chinook salmon disrupting Native American tribes’ food supply (Bland 2014).
+
For this lesson, we will work with the WSIM-GLDAS data set Composite Anomaly Twelve-Month Return Period NetCDF file. This represents the variable “Composite Anomaly” for the integration period of twelve-month. Let’s download the file directly from the SEDAC website. The data set documentation describes the composite variables as key features of WSIM-GLDAS which combine “the return periods of multiple water parameters into composite indices of overall water surpluses and deficits (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022a)”. The composite anomaly files represent these model outputs in terms of the rarity of their return period, or how often they occur. Please go ahead and download the file.
+
+
First, go to the SEDAC website at https://sedac.ciesin.columbia.edu/. You can explore the website by themes, data sets, or collections. We will use the search bar at the top to search for “water security wsim”. Find and click on the Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 – 2014) data set. Take a moment to review the dataset’s Overview, and Documentation pages.
+
When you’re ready, click on the Data Download tab. You will be asked to sign in using your NASA EarthData account.
+
Find the Composite Class, and find and click on the Variable Composite Anomaly Twelve-Month Return Period.
+
Reading the Data
@@ -4163,12 +4173,12 @@
Reading the Data
-
This lesson uses the stars, sf, dplyr, and lubridate packages. Make sure they are installed before you begin working with the code in this document. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.
+
This lesson uses the stars, sf, lubridate, and cubelyr packages. Make sure they are installed before you begin working with the code in this document. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.
-
First, install and load the R packages required for this exercise. This is accomplished by defining the list of packages and assigning them to the new variable called “packages_to_check”. Next we loop (iterate) through each of the packages in the list to see if they are already installed. If they were, we continue to the next item, and if they aren’t, then we go ahead and install them.
+
Once you have downloaded the file to your local computer, install and load the R packages required for this exercise. This is accomplished by defining the list of packages and assigning them to the new variable called “packages_to_check”. Next we loop (iterate) through each of the packages in the list to see if they are already installed. If they are we continue to the next item, and if they aren’t then we go ahead and install them.
time 1 793 NA NA POSIXct 1948-12-01,...,2014-12-01
-
Initializing R (reading) the file with the ‘argument’ proxy = TRUE allows you to inspect basic elements of the file without reading the whole file into memory. Multidimensional raster datasets can be extremely large and bring your computing environment to a halt if you have memory limitations.
-
Now we can use the ‘print’ command to view basic information. The output tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y–longitude/latitude) and time is the 3rd dimension.
+
Initializing R (reading) the file with the argumentproxy = TRUE allows you to inspect the basic elements of the file without reading the whole file into memory. Multidimensional raster datasets can be extremely large and bring your computing environment to a halt if you have memory limitations.
+
Now we can use the print command to view basic information. The output tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y–longitude/latitude) and time is the 3rd dimension.
This means that the total number of individual raster layers in this NetCDF is 4020 (5 attributes x 804 time steps/months). Again, BIG data.
Attribute Selection
The WSIM-GLDAS data is quite large with many variables available. We can manage this large file by selecting a single variable; in this case “deficit” (drought). Read the data back in; this time with proxy = FALSE and only selecting the deficit layer.
-
#subtetting the variable 'deficit'
+
#subsetting the variable 'deficit'wsim_gldas_anoms <- stars::read_stars("composite_12mo.nc", sub ='deficit', proxy =FALSE)
deficit,
@@ -4229,7 +4239,7 @@
Attribute Selection
Time Selection
-
Specifying a temporal range of interest make the file size smaller and therefore more manageable. We’ll select every year for the range 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that list of dates to filter.
+
Specifying a temporal range of interest will make the file size smaller and therefore more manageable. We’ll select every year for the range 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that list of dates to filter.
# generate a vector of dates for subsettingkeeps<-seq(lubridate::ymd("2000-12-01"),
@@ -4253,7 +4263,7 @@
Time Selection
time 1 15 NA NA POSIXct 2000-12-01,...,2014-12-01
-
Now we’re down to a single attribute (“deficit”) with 15 time-steps. We can take a look at the first 6 years by passing the object through slice and then into plot.
+
Now we’re down to a single attribute (“deficit”) with 15 time steps. We can take a look at the first 6 years by passing the object through slice and then into plot.
Although we have now reduced the data to a single attribute with a restricted time period of interest, we can take it a step further and limit the spatial extent to a country or state of interest.
+
Although we have now reduced the data to a single attribute with a restricted time of interest, we can take it a step further and limit the spatial extent to a country or state of interest.
Spatial Selection
-
We can spatially crop the raster stack with a few different methods. Options include using a bounding box in which outer geographic coordinates (xmin, ymin, xmax, ymax) are specified, or using another raster object, or a vectorized boundary like a shapefile or geojson to set the clipping extent.
+
We can spatially crop the raster stack with a few different methods. Options include using a bounding box in which the outer geographic coordinates (xmin, ymin, xmax, ymax) are specified, or using another raster object, or a vectorized boundary like a shapefile or GeoJSON to set the clipping extent.
@@ -4279,17 +4289,17 @@
Spatial Selection
-
Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries is an online, open license (CC BY 4.0 / ODbL) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, the project has tracked approximately 1 million spatial units within more than 200 entities, including all UN member states.
+
Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries is an online, open license (CC BY 4.0 / ODbL) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, this project has tracked approximately 1 million spatial units within more than 200 entities, including all UN member states.
-
In this example, we use a vector boundary to accomplish the geoprocessing task of clipping the data to an administrative or political units. First, we acquire the data in geojson format for the United States from the geoBoundaries API. (Note that it is also possible to download the vectorized boundaries directly from www.geoboundaries.org in lieu of using the API).
-
To use the geoBoundaries’ API, the root URL below is modified to include a 3 letter code from the International Standards Organization used to identify countries (ISO3), and an administrative level for the data request. Administrative levels correspond to geographic units such as the Country (administrative level 0), the State/Province (administrative level 1) the County/District (administrative level 2) and so on.
For this example, we adjust the bolded components of the sample URL address below to specify the country we want using the ISO3 Character Country Code for the United States (USA) and the desired Administrative Level of State (ADM1).
+
In this example we use a vector boundary to accomplish the geoprocessing task of clipping the data to an administrative or political unit. First we acquire the data in GeoJSON format for the United States from the geoBoundaries API. (Note it is also possible to download the vectorized boundaries directly from https://www.geoboundaries.org/ in lieu of using the API).
+
To use the geoBoundaries’ API, the root URL below is modified to include a 3 letter code from the International Standards Organization used to identify countries (ISO3), and an administrative level for the data request. Administrative levels correspond to geographic units such as the Country (administrative level 0), the State/Province (administrative level 1), the County/District (administrative level 2) and so on:
For this example we adjust the bolded components of the sample URL address below to specify the country we want using the ISO3 Character Country Code for the United States (USA) and the desired Administrative Level of State (ADM1).
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
-
In the line of code above, we used a function called ‘httr:GET’ to obtain metadata from the URL. We assign the result to a new variable called “usa”. Next, we will examine the content.
+
In the line of code above, we used a function called httr:GET to obtain metadata from the URL. We assign the result to a new variable called “usa”. Next we will examine the content.
usa <- httr::content(usa)
@@ -4313,7 +4323,7 @@
Spatial Selection
[31] "imagePreview" "simplifiedGeometryGeoJSON"
-
The parsed content contains 32 components. Item 29 is a direct link to the geojson file (gjDownloadURL) representing the vector boundary data. Next we will obtain those vectors and visualize the results.
+
The parsed content contains 32 components. Item 29 is a direct link to the GeoJSON file (gjDownloadURL) representing the vector boundary data. Next we will obtain those vectors and visualize the results.
usa <- sf::st_read(usa$gjDownloadURL)
@@ -4331,8 +4341,8 @@
Spatial Selection
-
Upon examination, shown in the image above, one can see that it includes all US states and overseas territories. For the purpose of this demonstration, we can simplify it to the contiguous United States (of course, it could also be simplified to other areas of interest simply by adapting the code below).
-
We first create a list of the geographies we wish to remove and assign them to a variable called “drops”. Next we reassign our “usa” variable to include only those geographies in the continental US and finally we plot the results.
+
Upon examination, shown in the image above, one sees that it includes all US states and overseas territories. For this demonstration, we can simplify it to the contiguous United States. (Of course, it could also be simplified to other areas of interest simply by adapting the code below.)
+
We first create a list of the geographies we wish to remove and assign them to a variable called “drops”. Next, we reassign our “usa” variable to include only those geographies in the continental US and finally, we plot the results.
From here we can clip the WSIM GLDAS raster stack by indexing it with the stored boundary of Texas.
+
From here we can clip the WSIM-GLDAS raster stack by indexing it with the stored boundary of Texas.
@@ -4369,8 +4379,8 @@
Spatial Selection
-
Texas experienced a severe drought in 2011 that caused rivers to dry up and lakes to reach historic low levels.(StateImpact 2014) Climate experts discovered that the drought was produced by “La Niña”, a weather pattern that causes the surface temperature of the Pacific Ocean to be cooler than normal. This in turn creates drier and warmer weather in the southern United States. La Niña can occur for a year or more, and returns once every few years. The drought was further exacerbated by high temperatures related to climate change in February of 2013.(“What Are El Nino and La Nina?” 2023)
-
It is estimated that the drought cost farmers and ranchers about $8 billion in losses. Furthermore, the dry conditions fueled a series of wildfires across the state in early September of 2011, the most devastating of which occurred in Bastrop County, where 34,000 acres and 1,300 homes were destroyed. (Roeseler 2011)
+
Texas experienced a severe drought in 2011 that caused rivers to dry up and lakes to reach historic low levels (StateImpact 2014). Climate experts discovered that the drought was produced by “La Niña”, a weather pattern that causes the surface temperature of the Pacific Ocean to be cooler than normal. This, in turn, creates drier and warmer weather in the southern United States. La Niña can occur for a year or more, and returns once every few years. The drought was further exacerbated by high temperatures related to climate change in February of 2013 (NOAA 2023).
+
It is estimated that the drought cost farmers and ranchers about $8 billion in losses. Furthermore, the dry conditions fueled a series of wildfires across the state in early September of 2011, the most devastating of which occurred in Bastrop County, where 34,000 acres and 1,300 homes were destroyed (Roeseler 2011).
@@ -4391,24 +4401,48 @@
Spatial Selection
-
At this point, you may want to ask, does the data look plausible? That is, are the values being rendered in your map of interest? This simple check is helpful to make sure your subsetting has worked as expected, although you will want to use other methods to systematically evaluate the data. If the results are acceptable, the subsetted dataset may be written to disk, and saved for future modules.
+
At this point, you may want to ask, does the data look plausible? That is, are the values being rendered in your map of interest? This simple check is helpful to make sure your subsetting has worked as expected. (You will want to use other methods to systematically evaluate the data.) If the results are acceptable, the subsetted dataset may be written to disk as a NetCDF file, and saved for future modules.
The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms.
+
If you want to share an image of the plot that you created you can save it to your computer as a .png file. You can open the png() device, producing the plot, and closing the device with dev.off():
+
+
# Save the map plot as a PNG file
+# Specify file name and dimensions
+png("map_plot.png", width =4, height =4, units="in", res=300)
+wsim_gldas_anoms_tex |>
+ dplyr::slice(index =15, along ="time") |>
+plot(reset =FALSE, breaks =c(0, -5, -10, -20, -30, -50))
+plot(sf::st_geometry(texas),
+add =TRUE,
+lwd =3,
+fill =NA,
+border ='purple')
+#close png() device
+dev.off()
+
+
png
+ 2
+
+
+
Once you run this code you can find the file in the file location… This allows you to share your findings.
In this Lesson, You Learned…
Congratulations! Now you should be able to:
-
Use R packages such as dplyr and lubridate for temporal subsetting.
-
Crop a raster stack with a spatial boundary.
-
Write a subsetted dataset to disk.
+
Navigate the SEDAC website to find and download datasets.
+
+
Access administrative boundaries from geoBoundaries data using API.
+
Temporally subset a NetCDF raster stack using R packages such as dplyr and lubridate.
+
Crop a NetCDF raster stack with a spatial boundary.
+
Write a subsetted dataset to disk and create an image to share results.
Lesson 2
-
In the next lesson we will create more advanced visualizations and extract data of interest.
+
In the next lesson, we will create more advanced visualizations and extract data of interest.
+ISciences, and Center For International Earth Science Information Network-CIESIN-Columbia University. 2022a. “Documentation for the Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/X7FJ-JJ41.
+
-ISciences, and Center For International Earth Science Information Network-CIESIN-Columbia University. 2022. “Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/Z1FN-KF73.
+———. 2022b. “Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/Z1FN-KF73.
+
diff --git a/wsim-gldas-acquisition.qmd b/wsim-gldas-acquisition.qmd
index 679bc34..565782c 100644
--- a/wsim-gldas-acquisition.qmd
+++ b/wsim-gldas-acquisition.qmd
@@ -242,16 +242,22 @@ stars::write_mdim(wsim_gldas_anoms_tex, "wsim_gldas_tex.nc")
The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms.
-If you want to share an image (png?) of the plot that you created you can save it to your computer. Here is the command…
+If you want to share an image of the plot that you created you can save it to your computer as a .png file. You can open the `png()` device, producing the plot, and closing the device with `dev.off()`:
+
```{r}
# Save the map plot as a PNG file
-png("map_plot.png", width = 800, height = 600) # Specify file name and dimensions
+# Specify file name and dimensions
+png("map_plot.png", width = 4, height = 4, units="in", res=300)
+wsim_gldas_anoms_tex |>
+ dplyr::slice(index = 15, along = "time") |>
+ plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
plot(sf::st_geometry(texas),
add = TRUE,
lwd = 3,
fill = NA,
border = 'purple')
-dev.off() # Close the PNG device
+#close png() device
+dev.off()
```
Once you run this code you can find the file in the file location… This allows you to share your findings.
diff --git a/wsim-gldas-references.bib b/wsim-gldas-references.bib
index bf9212e..2eed974 100644
--- a/wsim-gldas-references.bib
+++ b/wsim-gldas-references.bib
@@ -93,6 +93,7 @@ @misc{Zhou2016
}
@misc{NOAA2019,
+ author= {NOAA},
title = {Water Cycle},
year = {2019},
date = {February 1},
@@ -101,6 +102,7 @@ @misc{NOAA2019
}
@misc{NOAA2023,
+ author= {NOAA},
title = {What are El Nino and La Nina?},
year = {2023},
date = {August 24},