Skip to content

Data Processing Documentation

Overview

Brief description of the data processing objectives and scope. Reminder to adhere to data ownership and usage guidelines.

ACCESS proposal

Abstract

The Macrophenology working group, supported by the Environmental Science Innovation and Inclusion Lab (ESIIL) at the Universit of Colorado Boulder, intends to explore the interaction between phenological shifts and species range shifts in response to climate change, using multiple sources of data (remote sensing, herbarium, and field observations). This work will include expanding the "ecological community data design pattern" (ecocomDP) data model and its associated tools to interact with plant trait and phenological databases to create ecocomDP+ derived datasets. Data integration and co-production will be most successful if we can bring together subject matter experts (SMEs) on key aspects of this synthesis endeavor, including scientists, Tribal community members and scholars, data providers, environmental data scientists, and cyberinfrastructure SMEs.

Derived datasets will use the extended ecocomDP+ standard and will be stored in a shared Jetstream2 Volume using columnar file formats (e.g., parquet) to facilitate the use of cloud-native tools. Workflow development will be collaborative and reproducible using open-source software and tools (e.g., R, Python, GitHub). Initial development will use an ACCESS Explore allocation for the Jetstream2 Cloud using 200k SUs (Jetstream2 vCPU hours). Compute resources will be used for (1) compiling the ecocomDP+ datasets to feed into the analysis, (2) calculating shifts in species ranges and phenologies, and (3) conducting analyses linking range shifts and phenology to other species traits and climate change. Remotely sensed datasets, especially time-series data, are characterized by their voluminous nature, often referred to as "big data." Processing and analyzing these extensive datasets requires substantial storage capacities and high-performance computing nodes. The workflow will initially be developed for a subset of target taxa and then scaled up.

PIs (need biosketches)

  • Eric Sokol
  • Sydne Record

Associated grants

  • ESIIL
  • NEON
  • others?

Data Sources

List and describe data sources used, including links to cloud-optimized sources. Highlight permissions and compliance with data ownership guidelines.

CyVerse Discovery Environment

Instructions for setting up and using the CyVerse Discovery Environment for data processing. Tips for cloud-based data access and processing.

Data Processing Steps

Using GDAL VSI

Guidance on using GDAL VSI (Virtual System Interface) for data access and processing. Example commands or scripts:

gdal_translate /vsicurl/http://example.com/data.tif output.tif

Cloud-Optimized Data

Advantages of using cloud-optimized data formats and processing data without downloading. Instructions for such processes.

Data Storage

Information on storing processed data, with guidelines for choosing between the repository and CyVerse Data Store.

Best Practices

Recommendations for efficient and responsible data processing in the cloud. Tips to ensure data integrity and reproducibility.

Challenges and Troubleshooting

Common challenges in data processing and potential solutions. Resources for troubleshooting in the CyVerse Discovery Environment.

Conclusions

Summary of the data processing phase and its outcomes. Reflect on the methods used.

References

Citations of tools, data sources, and other references used in the data processing phase.


Last update: 2024-07-19