Accessing Cloud-based Datasets • rSDP

The rSDP package provides functions for finding, accessing, querying, wrangling, and sub-setting a set of curated raster data products. This vignette provides some examples of how to perform the first two steps (finding and accessing), along with some example workflows for exploring and visualizing the data.

To reproduce the examples in this article, you will need to install the latest versions of the rSDP,terra and leaflet packages. For the web maps, you will need to install the development version of the leaflet package from GitHub.

remotes::install_github("rmbl-sdp/rSDP")
remotes::install_github("rstudio/leaflet")
install.packages(c("terra","tidyterra","sf"),type="binary") ## `binary` install prevents dependency issues on Mac and Windows

Loading the libraries makes the functions available.

library(rSDP)
library(terra)
library(leaflet)
library(tidyterra)
library(sf)

Finding datasets

The RMBL Spatial Data Platform provides more than 100 spatial data products, so the first challenge is finding data that is relevant to your work. There are currently two ways to find data:

Searching and browsing the web-based Data Catalog on RMBL’s website.
Searching and filtering datasets using the built-in catalog in the rSDP package.

Searching the SDP Catalog from R

You can get a data frame with the catalog information for all datasets with

cat <- sdp_get_catalog()
head(cat[,1:4])
#>   CatalogID  Release  Type                                        Product
#> 1    R1D001 Release1 Hydro  Large Stream Flowlines (Multi-flow Direction)
#> 2    R1D002 Release1 Hydro Large Stream Flowlines (Single-flow Direction)
#> 3    R1D003 Release1 Hydro        Large Watersheds (Multi-flow Direction)
#> 4    R1D004 Release1 Hydro              Multi-direction Flow Accumulation
#> 5    R1D005 Release1 Hydro       Small Watersheds (Single Flow Direction)
#> 6    R1D006 Release1 Hydro             Single Direction Flow Accumulation

If you are working in Rstudio, the best way to explore the catalog is by opening it up in the Viewer pane.

View(cat)

This allows you to use the built-in filtering and searching facilities of the RStudio Viewer.

If you want to filter the catalog programmatically, you can specify arguments to sdp_get_catalog() that return matching rows. For example, if you wan to return only the products that have to do with vegetation, you can specify:

veg_cat <- sdp_get_catalog(types="Vegetation")
head(veg_cat[,1:4])
#>    CatalogID  Release       Type                       Product
#> 25    R1D025 Release1 Vegetation               Basic Landcover
#> 26    R1D026 Release1 Vegetation Leaf-on Digital Surface Model
#> 27    R1D027 Release1 Vegetation      Vegetation Canopy Height
#> 40    R2D013 Release2 Vegetation        High-res Canopy Height
#> 41    R2D014 Release2 Vegetation      Lidar Vegetation Metrics
#> 54    R3D013 Release3 Vegetation              Understory Cover

For more advanced filtering, you can use regular expressions to match particular products.

snow_cat <- cat[grepl("Snowpack Persist",cat$Product),]
head(snow_cat[,1:4])
#>     CatalogID  Release Type
#> 75     R4D001 Release4 Snow
#> 135    R4D061 Release4 Snow
#> 136    R4D062 Release4 Snow
#> 137    R4D063 Release4 Snow
#>                                                             Product
#> 75               Snowpack Persistence Day of Year Yearly Timeseries
#> 135             Snowpack Persistence Day of Year Mean (1993 - 2022)
#> 136 Snowpack Persistence Day of Year Standard Deviation (1993-2022)
#> 137              Snowpack Persistence Uncertainty Yearly Timeseries

The “CatalogID” field provides a concise, unique identifier for each data product, and is the preferred way to specify an individual product. Here, the CatalogID “R4D001” represents the data product representing annual time-series of snow persistence across the Upper Gunnison domain.

Getting detailed metadata

To get detailed metadata for each item (beyond the basic information that is provided in the Catalog), we can use the sdp_get_metadata() function.

snow_meta <- sdp_get_metadata("R4D001")
print(snow_meta$qgis$extent$spatial)
#> list()
#> attr(,"minx")
#> [1] "305073"
#> attr(,"miny")
#> [1] "4256064"
#> attr(,"maxz")
#> [1] "0"
#> attr(,"crs")
#> [1] "EPSG:32613"
#> attr(,"maxx")
#> [1] "388098"
#> attr(,"dimensions")
#> [1] "2"
#> attr(,"maxy")
#> [1] "4328667"
#> attr(,"minz")
#> [1] "0"

Visualizing the SDP domains

We currently provide SDP data products in three different spatial domains. To visualize the boundaries of these domains, we can construct a web map that shows them. First we need to connect to these datasets stored in the cloud.

GT_bound <- sf::st_read("https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/GT_region_vect_1m.geojson")
#> Reading layer `GT_region_vect_1m' from data source 
#>   `https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/GT_region_vect_1m.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 327105 ymin: 4313359 xmax: 328312 ymax: 4314793
#> Projected CRS: WGS 84 / UTM zone 13N
UER_bound <- sf::st_read("https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/UER_region_vect_1m.geojson")
#> Reading layer `UER_region_vect_1m' from data source 
#>   `https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/UER_region_vect_1m.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 315965 ymin: 4298054 xmax: 337062 ymax: 4322641
#> Projected CRS: WGS 84 / UTM zone 13N
UG_bound <- sf::st_read("https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/UG_region_vect_1m.geojson")
#> Reading layer `UG_region_vect_1m' from data source 
#>   `https://rmbl-sdp.s3.us-east-2.amazonaws.com/data_products/supplemental/UG_region_vect_1m.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 1 feature and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 307895 ymin: 4258017 xmax: 385679 ymax: 4326087
#> Projected CRS: WGS 84 / UTM zone 13N
domain_bounds <- rbind(UG_bound,UER_bound,GT_bound)
domain_bounds$Domain <- c("Upper Gunnison (UG)","Upper East River (UER)","Gothic Townsite (GT)")

Once we’ve got them in, we can use the terra::plet() function to plot the domains on a web map.

domain_sv <- terra::vect(domain_bounds)
terra::plet(domain_sv,"Domain",tiles="Esri.NatGeoWorldMap",alpha=0.6,
     main="SDP Spatial Domains",
     col=c("#6c42f5","#c95773","#59c957"))

Connecting to data in the cloud

All SDP data products are raster datasets available on the cloud storage service Amazon S3. In many cases the simplest way to interact with a dataset is by connecting to it using the web-based file system embedded in the terra package.

To make this as straightforward as possible, we provide the function sdp_get_raster() to connect to these data. By default, this function creates an R object representing the dataset without downloading the whole thing locally.

Single datasets

To connect to a dataset that only contains a single layer, or a few related layers stored in a single file, you only need to provide sdp_get_raster with the CatalogID number for the data. Here we load an elevation map for the Upper Gunnison (UG) domain.

UG_elev <- sdp_get_raster("R3D009")
UG_elev
#> class       : SpatRaster 
#> dimensions  : 24201, 27668, 1  (nrow, ncol, nlyr)
#> resolution  : 3, 3  (x, y)
#> extent      : 305082, 388086, 4256064, 4328667  (xmin, xmax, ymin, ymax)
#> coord. ref. : WGS 84 / UTM zone 13N (EPSG:32613) 
#> source      : UG_dem_3m_v1.tif 
#> name        : UG_dem_3m_v1

We can now see more details about the structure of the dataset, including its size (2401 x 27668 pixels), resolution (3m), and coordinate reference system.

Importantly we can do many common data visualizations and manipulations on the dataset without having to download the entire thing. For example, if we want to crop the data to an area of interest (say the Gothic Townsite polygon we loaded earlier), we can accomplish this while only downloading the portion of the large raster that covers the polygon of interest.

GT_elev <- crop(UG_elev,domain_sv[3,])
GT_elev
#> class       : SpatRaster 
#> dimensions  : 478, 402, 1  (nrow, ncol, nlyr)
#> resolution  : 3, 3  (x, y)
#> extent      : 327105, 328311, 4313358, 4314792  (xmin, xmax, ymin, ymax)
#> coord. ref. : WGS 84 / UTM zone 13N (EPSG:32613) 
#> source(s)   : memory
#> name        : UG_dem_3m_v1 
#> min value   :     2864.029 
#> max value   :     3174.758

You will notice that the source(s) field now says “memory”. This means that the new dataset we created no longer lives on the cloud, but is now stored in memory on our computer. If the cropped subset is too large to fit in memory, it is written to a temporary file.

We can now plot the cropped raster dataset.

plot(GT_elev)

Time-series datasets

Single-layer datasets are relatively simple in structure, but many of the SDP data products are provided as time-series. These are maps provided at daily, monthly, or annual intervals.

For these data products, you will need to add additional arguments to sdp_get_raster() to specify which temporal subsets to return. For example specifying the years argument to annual data will return only data layers representing the desired years. This code returns a raster dataset representing snowpack persistence for the years 2018 through 2020:

    
cat[cat$CatalogID=="R4D001",1:4]
#>    CatalogID  Release Type                                            Product
#> 75    R4D001 Release4 Snow Snowpack Persistence Day of Year Yearly Timeseries
snow_years <- sdp_get_raster("R4D001",years=2018:2020,verbose=FALSE)
snow_years
#> class       : SpatRaster 
#> dimensions  : 2689, 3075, 3  (nrow, ncol, nlyr)
#> resolution  : 27, 27  (x, y)
#> extent      : 305073, 388098, 4256064, 4328667  (xmin, xmax, ymin, ymax)
#> coord. ref. : WGS 84 / UTM zone 13N (EPSG:32613) 
#> sources     : UG_snow_persistence_year_2018_27m_v1.tif  
#>               UG_snow_persistence_year_2019_27m_v1.tif  
#>               UG_snow_persistence_year_2020_27m_v1.tif  
#> names       :      2018,     2019,      2020 
#> min values  :  27.22007,  79.9462,  48.37198 
#> max values  : 202.39381, 211.7621, 214.12854

To find out which time intervals are included for each dataset, you can examine the catalog fields MinYear , MaxYear, MinDate and MaxDate.

cat[cat$TimeSeriesType %in% c("Yearly","Monthly","Daily"),c(1,4,8:11)]
#>     CatalogID                                                       Product
#> 75     R4D001            Snowpack Persistence Day of Year Yearly Timeseries
#> 76     R4D002                  Snowpack Onset Day of Year Yearly Timeseries
#> 77     R4D003                           Snowpack Duration Yearly Timeseries
#> 78     R4D004                   Maximum 2m Air Temperature Daily Timeseries
#> 79     R4D005                   Minimum 2m Air Temperature Daily Timeseries
#> 80     R4D006                 Average 2m Air Temperature Monthly Timeseries
#> 81     R4D007                 Maximum 2m Air Temperature Monthly Timeseries
#> 82     R4D008                 Minimum 2m Air Temperature Monthly Timeseries
#> 83     R4D009        Air Temperature Freezing Degree-days Annual Timeseries
#> 84     R4D010  Air Temperature Freezing Degree-days Early Season Timeseries
#> 85     R4D011   Air Temperature Freezing Degree-days Late Season Timeseries
#> 86     R4D012         Air Temperature Growing Degree-days Annual Timeseries
#> 87     R4D013   Air Temperature Growing Degree-days Early Season Timeseries
#> 88     R4D014    Air Temperature Growing Degree-days Late Season Timeseries
#> 89     R4D015              Snow-free Freezing Degree-days Annual Timeseries
#> 90     R4D016        Snow-free Freezing Degree-days Early Season Timeseries
#> 91     R4D017         Snow-free Freezing Degree-days Late Season Timeseries
#> 92     R4D018 Snow-free Freezing Degree-days 0-60 days Post-snow Timeseries
#> 93     R4D019               Snow-free Growing Degree-days Annual Timeseries
#> 94     R4D020         Snow-free Growing Degree-days Early Season Timeseries
#> 95     R4D021          Snow-free Growing Degree-days Late Season Timeseries
#> 96     R4D022   Snow-free Growing Degree-day 0-60 Days Post Snow Timeseries
#> 137    R4D063            Snowpack Persistence Uncertainty Yearly Timeseries
#> 138    R4D064                  Snowpack Onset Uncertainty Yearly Timeseries
#>        MinDate    MaxDate MinYear MaxYear
#> 75  1993-01-01 2022-09-30    1993    2022
#> 76  1992-10-01 2022-09-30    1993    2022
#> 77  1992-10-01 2022-09-30    1993    2022
#> 78  2001-11-01 2022-10-30    2001    2022
#> 79  2001-11-01 2022-10-30    2001    2022
#> 80  2001-11-01 2022-10-30    2001    2022
#> 81  2001-11-01 2022-10-30    2001    2022
#> 82  2001-11-01 2022-10-30    2001    2022
#> 83  2002-01-01 2022-12-31    2002    2022
#> 84  2002-01-01 2022-12-31    2002    2022
#> 85  2002-01-01 2022-12-31    2002    2022
#> 86  2002-01-01 2022-12-31    2002    2022
#> 87  2002-01-01 2022-12-31    2002    2022
#> 88  2002-01-01 2022-12-31    2002    2022
#> 89  2002-01-01 2022-12-31    2002    2022
#> 90  2002-01-01 2022-12-31    2002    2022
#> 91  2002-01-01 2022-12-31    2002    2022
#> 92  2002-01-01 2022-12-31    2002    2022
#> 93  2002-01-01 2022-12-31    2002    2022
#> 94  2002-01-01 2022-12-31    2002    2022
#> 95  2002-01-01 2022-12-31    2002    2022
#> 96  2002-01-01 2022-12-31    2002    2022
#> 137 1993-01-01 2022-09-30    1993    2022
#> 138 1992-10-01 2022-09-30    1993    2022

When should you download data locally?

Although you can perform many operations on cloud-based datasets without downloading them locally, some operations may be prohibitively slow. For example, sampling 100 random points from the snow raster we connected to above took approximately 25 seconds to complete over my home internet connection when the data was stored in the cloud:

start_time <- Sys.time()
snow_samp_cloud <- spatSample(snow_years,size=100)
Sys.time() - start_time
#> Time difference of 21.81979 secs

This is much slower than the same operation on locally stored data. To deal with situations like this, we have included data download capabilities in the sdp_get_raster() function. Specifying download_files=TRUE along with a local download_path creates a local copy of the data on your computer:

snow_years_local <- sdp_get_raster("R4D001",years=2018:2020,verbose=FALSE,
                                   download_files=TRUE,
                                   download_path="~/Downloads")
#> [1] "All files exist locally. Specify `overwrite=TRUE` to overwrite existing files."
#> [1] "Loading raster from local paths."

start_time <- Sys.time()
snow_samp_local <- spatSample(snow_years_local,size=100)
Sys.time() - start_time
#> Time difference of 0.524817 secs

So it’s almost 40 times faster to sample from the local dataset! In this example, the time saved doesn’t quite make up for the extra time to download the data, but it often will for larger operations.

Here are some types of operations that might benefit from downloading data locally:

Operations that use many or all of the pixels of the target datasets (e.g. resampling, reprojecting).
Operations that span a large proportion of the extent of the raster (e.g. sampling values at random points).
If you are performing multiple operations on a single source dataset (e.g. doing multiple raster algebra calculations on a single raster source).