NEON and LTER macroinverts powered by ecocomDP

Ty Tuff, ESIIL Data Scientist 2023-10-30

Introduction

In ecological synthesis projects, tasks related to finding, accessing, and vetting datasets to include in an analysis often creates an enormous time sink. However, it is crucial that investigators understand the structure of and assumptions behind each of the datasets included in a study. Here we present some tools available in the ecocomDP package for R to facilitate this initial data discovery, wrangling, and vetting process for datasets representing communities of organisms.

“ecocomDP” is both the name of an R package and a data model.

The ecocomDP data model is a “dataset design pattern for ecological community data to facilitate synthesis and reuse”. Because this data model is intended for use with community data, observations are expected to be measures of species abundance, biomass, or similar. However, some ecocomDP datasets report occurrences. The motivation for applying the ecocomDP data pattern to both NEON biodiversity data products and EDI data packages, including data from the US Long Term Ecological Research (LTER) Network and Macrosystems Biology projects, is to make these data discoverable and accessible through a single data search tool and to be delivered in a standard format. Functions in the ecocomDP package provide tools to manipulate and visualize ecocomDP formatted data objects in R. We are also in the process of developing tools to easily convert ecocomDP data packages to Darwin Core Archives (DwC-A, event core) for submission to GBIF to further facilitate NEON and LTER data discovery.

For more information on ecocomDP, see the GitHub repo here: https://github.com/EDIorg/ecocomDP.

In this tutorial, we will learn how to find and access ecological community datasets provided by the National Ecological Observatory Network (NEON) and the Environmental Data Initiative (EDI).

Find, download, and explore aquatic macroinvertebrate data from lakes

In this tutorial we will look at benthic macroinvertebrate counts from lakes across NEON and at the North Temperate Lakes (NTL) LTER site.

Load the required packages:

library(neonUtilities)
library(readr)
library(dplyr)
library(ggplot2)
library(ecocomDP)

Next, search the ecocomDP catalog for datasets that include the term “invertebrate”. If you explore the summary table, you will see benthic macroinvertebrate community data exist for NEON sites and the North Temperate Lakes (NTL) LTER site.

# search for invertebrate data packages
search_result <- search_data(text = "invertebrate")
print(search_result)

# We will work with the following two ecocomDP datasets:
# NEON aquatic macroinvertebrates package id is "neon.ecocomdp.20120.001.001"   
# LTER NTL benthic macroinvertebrates package id is "edi.290.2"

Let’s take a look at the NEON site list to identify which NEON sites are lakes.

# download NEON site list
neon_sites <- read_csv("https://www.neonscience.org/sites/default/files/NEON_Field_Site_Metadata_20230309.csv")

# let's get a list of NEON sites that are lakes
neon_lakes <- neon_sites %>% 
  filter(field_site_subtype == "Lake") %>%
  select(field_site_id) %>% 
  unlist(use.names = FALSE)

Next, download the two datasets. Note that each data package is downloaded as a list of tables. Please see the ecocomDP github page for more information about the data model.

# download NEON macroinvert data from lakes
neon_data_list <- read_data(
  id = "neon.ecocomdp.20120.001.001",
  site = neon_lakes, # NEON sites
  check.size = FALSE)

# download LTER NTL macroinvert data
ntl_data_list <- read_data(id = "edi.290.2")

The ecocomDP pacakge includes some plotting functions to explore the spatial, temporal, and taxonomic coverage of the datasets. First, take a look at the spatial and temporal coverage of the NEON dataset.

# Explore the spatial and temporal coverage of the NEON dataset
neon_data_list %>% plot_sample_space_time()

Rather than plotting each dataset separately, you can flatten, stack, and then plot the combined data.

# flatten the two datasets
flat_neon <- neon_data_list %>% flatten_data()
flat_ntl <- ntl_data_list %>% flatten_data()


# stack the two datasets
stacked_neon_ntl <- bind_rows(flat_neon, flat_ntl) %>%
  mutate(taxon_rank = tolower(taxon_rank)) # force lower case for all values

# plot spatial and temporal coverage of the two datasets
stacked_neon_ntl %>% plot_sample_space_time()

Next, check the taxonomic resolution reported in the two datasets. These datasets may not be compatible in a synthesis analysis if they aggregate their counts at different taxonomic resolutions - e.g., if one study reports counts for order, and another reports counts for species.

You can plot one dataset at a time:

# Explore the taxonomic resolution in the dataset. 
# What is the most common taxonomic resolution (rank) 
# for macroinvertebrate identifications in this dataset?
neon_data_list %>% plot_taxa_rank()

Or plot the stacked datasets:

stacked_neon_ntl %>% 
  plot_taxa_rank(
    facet_var = "package_id",
    facet_scales = "free_x")

Similar taxonomic resolutions suggests these datasets may be compatible in a synthesis study.

Last update: 2023-11-16