These first two code chunks are just to download some code packages to run our analysis. To run each chunk simply click the 'play' arrow and wait for the command to run. You'll know a step is complete when you see a green checkmark to the left of the code chunk.
Look into adding API access to stream data¶
Add text chunks throughout¶
This cell installs the neonutilities package, which is essential for working with NEON (National Ecological Observatory Network) data. It allows us to easily download, stack, and process NEON data products.
pip install neonutilities
Collecting neonutilities
Downloading neonutilities-1.2.2-py3-none-any.whl.metadata (45 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/45.4 kB ? eta -:--:--
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.4/45.4 kB 2.2 MB/s eta 0:00:00
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.12/dist-packages (from neonutilities) (7.1.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (from neonutilities) (2.2.2)
Collecting parameterized (from neonutilities)
Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Requirement already satisfied: pyarrow in /usr/local/lib/python3.12/dist-packages (from neonutilities) (18.1.0)
Requirement already satisfied: pyproj in /usr/local/lib/python3.12/dist-packages (from neonutilities) (3.7.2)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from neonutilities) (2.32.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from neonutilities) (4.67.3)
Requirement already satisfied: h5py in /usr/local/lib/python3.12/dist-packages (from neonutilities) (3.16.0)
Requirement already satisfied: google_crc32c in /usr/local/lib/python3.12/dist-packages (from neonutilities) (1.8.0)
Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.12/dist-packages (from h5py->neonutilities) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas->neonutilities) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas->neonutilities) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas->neonutilities) (2026.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from pyproj->neonutilities) (2026.4.22)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->neonutilities) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->neonutilities) (3.13)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->neonutilities) (2.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas->neonutilities) (1.17.0)
Downloading neonutilities-1.2.2-py3-none-any.whl (109 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.8/109.8 kB 4.2 MB/s eta 0:00:00
Downloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB)
Installing collected packages: parameterized, neonutilities
Successfully installed neonutilities-1.2.2 parameterized-0.9.0
Here we import all the necessary Python libraries for data manipulation, plotting, and interacting with the NEON data:
neonutilities as nu: For NEON data access and processing.matplotlib.pyplot as plt: For creating static, interactive, and animated visualizations.pandas as pd: For data manipulation and analysis, especially with DataFrames.numpy as np: For numerical operations, especially with arrays.datetime: For handling date and time objects.os: For interacting with the operating system, like file paths.seaborn as sns: For creating informative and attractive statistical graphics.
import neonutilities as nu
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datetime import datetime
import os
import seaborn as sns # Import seaborn for statistical plotting
This code uses nu.zips_by_product from neonutilities to download eddy covariance data (Data Product ID DP4.00200.001) for two sites: Niwot Ridge (NIWO) and Harvard Forest (HARV). The data covers June to July 2018. The check_size=False argument is used to bypass file size checks, which can sometimes cause issues with large downloads.
nu.zips_by_product(dpid="DP4.00200.001", package="basic",
site=["NIWO","HARV"],
startdate="2018-06", enddate="2018-07",
savepath=os.getcwd(),
check_size=False)
100%|██████████| 4/4 [00:00<00:00, 6.52it/s] 100%|██████████| 4/4 [00:17<00:00, 4.48s/it]
After downloading the zipped data, nu.stack_eddy is used to extract and combine the eddy covariance files into a more usable format (a dictionary of pandas DataFrames). filepath=os.getcwd() + "/filesToStack00200" specifies the directory where the downloaded files are located, and level="dp04" indicates that we are stacking data product level 4 (eddy covariance data).
flux = nu.stack_eddy(filepath=os.getcwd() + "/filesToStack00200",
level="dp04")
Extracting data: 100%|██████████| 4/4 [00:02<00:00, 1.98it/s] Joining data variables by file: 100%|██████████| 4/4 [00:01<00:00, 2.67it/s] Stacking files by month: 100%|██████████| 2/2 [00:00<00:00, 428.21it/s]
This line simply prints the keys of the flux dictionary. Since nu.stack_eddy stacks data by site, the keys are typically the site IDs (e.g., 'HARV', 'NIWO'), along with metadata keys like 'variables', 'objDesc', etc. This helps in understanding what data is available in the flux object.
flux.keys()
dict_keys(['HARV', 'NIWO', 'variables', 'objDesc', 'issueLog', 'citation_00200_RELEASE-2026'])
This displays the first few rows (head) of the eddy covariance data for the NIWO site. It provides a quick glance at the columns and the initial data points, including timeBgn, timeEnd, and various flux measurements like data.fluxCo2.nsae.flux.
flux['NIWO'].head
pandas.core.generic.NDFrame.head
def head(n: int=5) -> Self
Return the first `n` rows. This function returns the first `n` rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of `n`, this function returns all rows except the last `|n|` rows, equivalent to ``df[:n]``. If n is larger than the number of rows, this function returns all rows. Parameters ---------- n : int, default 5 Number of rows to select. Returns ------- same type as caller The first `n` rows of the caller object. See Also -------- DataFrame.tail: Returns the last `n` rows. Examples -------- >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra Viewing the first 5 lines >>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey Viewing the first `n` lines (three in this case) >>> df.head(3) animal 0 alligator 1 bee 2 falcon For negative values of `n` >>> df.head(-3) animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot
These lines extract unique terms from the column names of the NIWO flux data. This is done to help in filtering the objDesc DataFrame later, to find relevant metadata descriptions for the variables present in the flux data.
termlist = [f.split('.') for f in flux['NIWO'].keys()]
term = [t for sublist in termlist for t in sublist]
This code filters the objDesc DataFrame (which contains descriptions of all NEON data product objects/variables) to show only the descriptions for the terms found in the flux data's column names. This helps in understanding what each variable in our dataset represents.
objDesc = flux['objDesc']
objDesc[objDesc['Object'].isin(term)]
| Object | Description | |
|---|---|---|
| 21 | angZaxsErth | Wind direction |
| 83 | data | Represents data fields |
| 90 | distReso | Resolution of vector of distances (units = m) |
| 91 | distXaxs90 | Distance of 90% contriubtion of footprint alig... |
| 92 | distXaxsMax | Max distance of footprint contriubtion aligned... |
| 93 | distYaxs90 | Distance of 90% contriubtion of footprint alig... |
| 94 | distZaxsAbl | Atmospheric boundary layer height used in flux... |
| 99 | distZaxsMeasDisp | Measurement height minus displacement height |
| 100 | distZaxsRgh | Roughness length used in flux footprint calcul... |
| 124 | flux | General term for flux |
| 125 | fluxCo2 | CO2 flux (can be turbulent, storage, or nsae) |
| 127 | fluxH2o | H2O flux (can be turbulent, storage, or nsae) ... |
| 129 | fluxMome | Momentum flux (can be turbulent, storage, or n... |
| 131 | fluxTemp | Temperature flux (can be turbulent, storage, o... |
| 132 | foot | Footprint |
| 199 | nsae | Net surface atmospher exchange (turbulent + st... |
| 243 | qfFinl | The final quality flag indicating if the data ... |
| 273 | qfqm | Quality flag and quality metrics, represents q... |
| 450 | stat | Statistics |
| 452 | stor | Storage |
| 473 | timeBgn | The beginning time of the aggregation period |
| 474 | timeEnd | The end time of the aggregation period |
| 476 | turb | Turbulent |
| 504 | veloFric | Friction velocity/ustar (units = m s-1) |
| 511 | veloYaxsHorSd | Standard deviation of horizontal wind speed in... |
| 514 | veloZaxsHorSd | Standard deviation of horizontal wind speed in... |
This displays the variables DataFrame from the flux dictionary. This DataFrame provides more detailed metadata for each variable, including its category, system variable name, statistical type, and units. It's crucial for correctly interpreting the data.
flux['variables']
| category | system | variable | stat | units | |
|---|---|---|---|---|---|
| 0 | data | fluxCo2 | nsae | timeBgn | NA |
| 1 | data | fluxCo2 | nsae | timeEnd | NA |
| 2 | data | fluxCo2 | nsae | flux | umolCo2 m-2 s-1 |
| 3 | data | fluxCo2 | stor | timeBgn | NA |
| 4 | data | fluxCo2 | stor | timeEnd | NA |
| ... | ... | ... | ... | ... | ... |
| 73 | qfqm | fluxTemp | turb | timeEnd | NA |
| 74 | qfqm | fluxTemp | turb | qfFinl | NA |
| 75 | qfqm | foot | turb | timeBgn | NA |
| 76 | qfqm | foot | turb | timeEnd | NA |
| 77 | qfqm | foot | turb | qfFinl | NA |
78 rows × 5 columns
This creates a basic scatter plot of CO2 flux (data.fluxCo2.nsae.flux) over time (timeBgn) for the NIWO site. The plt.ylim sets the y-axis limits to focus on the typical range of flux values. This initial plot gives a raw visualization of the CO2 exchange.
plt.scatter(flux['NIWO'].timeBgn,
flux['NIWO']['data.fluxCo2.nsae.flux'],
marker='.')
plt.ylim((-20,20))
(-20.0, 20.0)
plt.show()
This cell refines the previous scatter plot by adding plt.xlim to zoom into a specific two-day period (July 7th to July 9th, 2018). This allows for a more detailed examination of the diurnal patterns in CO2 flux, helping to visualize daily cycles of carbon uptake and release.
plt.scatter(flux['NIWO'].timeBgn,
flux['NIWO']['data.fluxCo2.nsae.flux'],
marker='.')
plt.ylim((-20,20))
plt.xlim((datetime.strptime('2018-07-07 00:00:00', '%Y-%m-%d %H:%M:%S'),
datetime.strptime('2018-07-09 00:00:00', '%Y-%m-%d %H:%M:%S')))
(np.float64(17719.0), np.float64(17721.0))
This downloads Photosynthetically Active Radiation (PAR) data (Data Product ID DP1.00024.001) for the NIWO site, covering the same June-July 2018 period. PAR is a critical environmental variable for understanding CO2 flux, as it drives photosynthesis.
pr = nu.load_by_product("DP1.00024.001", site="NIWO",
timeindex=30, package="basic",
startdate="2018-06", enddate="2018-07",
check_size=False)
100%|██████████| 2/2 [00:00<00:00, 6.57it/s] 100%|██████████| 13/13 [00:02<00:00, 6.14it/s] 100%|██████████| 3/3 [00:00<00:00, 19.26it/s]
These lines process the downloaded PAR data:
par30 = pr['PARPAR_30min']: Extracts the 30-minute averaged PAR data.par30['timeBgn'] = par30['startDateTime']: Creates atimeBgncolumn for merging with flux data.prtop = par30[par30['verticalPosition']=='040']: Filters for PAR measurements at a specific vertical position (e.g., 40 meters), which is typically representative of canopy-top radiation.
par30 = pr['PARPAR_30min']
par30['timeBgn'] = par30['startDateTime']
prtop = par30[par30['verticalPosition']=='040']
This merges the NIWO eddy covariance flux data (flux['NIWO']) with the processed PAR data (prtop) based on the timeBgn column. The how='outer' merge ensures that all time points from both datasets are included, and on='timeBgn' specifies the common column for merging. This combined DataFrame (fxpr) is crucial for analyzing the relationship between light and CO2 exchange.
fxpr = pd.merge(flux['NIWO'], prtop, how='outer', on='timeBgn')
This code generates a scatter plot of CO2 flux versus PAR for the NIWO site. Key features:
plt.scatter(): Creates the scatter plot.plt.ylim((-20,20)): Sets the y-axis limits for CO2 flux.plt.title(),plt.xlabel(),plt.ylabel(): Adds descriptive labels and a title.plt.grid(): Adds a grid for readability.plt.axhline(0, color='gray', linestyle='--'): Draws a horizontal line at y=0, which is important for distinguishing between net carbon uptake (negative flux) and release (positive flux).
plt.scatter(fxpr.PARMean,
fxpr['data.fluxCo2.nsae.flux'],
marker='.', alpha=0.5)
plt.ylim((-20,20))
plt.title('CO2 Flux vs. Photosynthetically Active Radiation (PAR) at NIWO')
plt.xlabel('PAR Mean (µmol m⁻² s⁻¹)')
plt.ylabel('CO2 Flux (µmol CO2 m⁻² s⁻¹)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.axhline(0, color='gray', linestyle='--') # Add a line at y=0 for sink/source reference
plt.show()
This cell attempts to load data product level 3 (DP03) eddy covariance data using nu.stack_eddy. This data product typically includes higher-frequency raw data or different processing levels compared to DP04. This command previously raised a ValueError related to inhomogeneous array shapes during stacking, indicating an issue with combining the specific DP03 files. This product is not used in the subsequent analysis of DP04 (flux) and DP01 (PAR).
prof = nu.stack_eddy(filepath=os.getcwd() + "/filesToStack00200",
level="dp03")
Extracting data: 100%|██████████| 4/4 [00:02<00:00, 1.58it/s] Joining data variables by file: 100%|██████████| 4/4 [00:00<00:00, 10.51it/s]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_2201/3060609489.py in <cell line: 0>() ----> 1 prof = nu.stack_eddy(filepath=os.getcwd() + "/filesToStack00200", 2 level="dp03") /usr/local/lib/python3.12/dist-packages/neonutilities/stack_eddy.py in stack_eddy(filepath, level, var, avg, metadata, runLocal) 654 655 # which attributes should be extracted? --> 656 vNames = np.unique([list(verMergDict[key].columns) for key in verMergDict.keys()]) 657 if metadata: 658 if any('rtioMoleDryCo2Vali' in vN for vN in vNames): /usr/local/lib/python3.12/dist-packages/numpy/lib/_arraysetops_impl.py in unique(ar, return_index, return_inverse, return_counts, axis, equal_nan) 285 286 """ --> 287 ar = np.asanyarray(ar) 288 if axis is None: 289 ret = _unique1d(ar, return_index, return_inverse, return_counts, ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.
This is a critical setup cell for interacting with the Gemini API:
import google.generativeai as genai: Imports the Google Generative AI SDK.from google.colab import userdata: Importsuserdatafor secure API key management in Colab.GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY'): Retrieves the API key stored in Colab secrets (namedGOOGLE_API_KEY).genai.configure(api_key=GOOGLE_API_KEY): Configures the Generative AI SDK with your API key.model = genai.GenerativeModel("gemini-2.5-flash"): Initializes the Gemini model to be used for generating responses.def build_neon_context(...): Defines a helper function to create a comprehensive context string about the NEON data. This context includes site information, data product details, variable metadata, sample data rows, and descriptive statistics. It's designed to provide the AI with sufficient background to answer ecological questions accurately.NEON_CONTEXT = build_neon_context(flux, objDesc, site="NIWO"): Calls thebuild_neon_contextfunction to create the context for the NIWO site, storing it in theNEON_CONTEXTvariable.
import google.generativeai as genai
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")
def build_neon_context(flux_data, objDesc_df, site="NIWO"):
"""Build a rich context string from loaded NEON data."""
site_flux_df = flux_data[site]
# Get the actual columns present
cols = site_flux_df.columns.tolist()
# Pull relevant metadata rows from objDesc
term_parts = [f.split('.') for f in cols]
terms = list(set([t for sublist in term_parts for t in sublist]))
relevant_meta = objDesc_df[objDesc_df['Object'].isin(terms)]
meta_str = relevant_meta[['Object','Description']].to_string(index=False)
# Sample of actual data
sample = site_flux_df[['timeBgn','data.fluxCo2.nsae.flux']].dropna().head(10).to_string(index=False)
# Basic stats
stats = site_flux_df['data.fluxCo2.nsae.flux'].describe().to_string()
context = f"""
You are an expert in ecosystem carbon flux science and the NEON (National Ecological Observatory Network)
eddy covariance data system. You are helping undergraduate students analyze NEON flux data.
SITE: {site} (Niwot Ridge, Colorado — subalpine forest, ~3050m elevation)
DATA PRODUCT: DP4.00200.001 — Bundled eddy covariance data
TIME PERIOD: June–July 2018
RELEVANT VARIABLE METADATA (from NEON objDesc):
{meta_str}
SAMPLE DATA ROWS (timeBgn, CO2 flux in µmol/m²/s):
{sample}
DESCRIPTIVE STATISTICS FOR CO2 FLUX:
{stats}
The key flux variable is `data.fluxCo2.nsae.flux`: net surface-atmosphere exchange of CO2
in µmol CO2 m⁻² s⁻¹. Negative values = carbon uptake by ecosystem. Positive = carbon release.
"""
return context
# Run once after loading data for NIWO
NEON_CONTEXT = build_neon_context(flux, objDesc, site="NIWO")
This line builds the HARV_CONTEXT using the build_neon_context function, similar to how NEON_CONTEXT was created for NIWO. This prepares a dedicated context for the Harvard Forest site, which is essential for asking AI questions specific to HARV data and for comparative analysis.
# Run once after loading data for HARV
HARV_CONTEXT = build_neon_context(flux, objDesc, site="HARV")
This downloads Photosynthetically Active Radiation (PAR) data (Data Product ID DP1.00024.001) specifically for the HARV site, covering June to July 2018. This is analogous to the PAR data download for NIWO, providing light intensity measurements for Harvard Forest.
# Load PAR data for HARV
pr_harv = nu.load_by_product("DP1.00024.001", site="HARV",
timeindex=30, package="basic",
startdate="2018-06", enddate="2018-07",
check_size=False)
100%|██████████| 2/2 [00:00<00:00, 7.17it/s] 100%|██████████| 15/15 [00:01<00:00, 9.87it/s] 100%|██████████| 2/2 [00:00<00:00, 4.36it/s]
These lines process the downloaded PAR data for HARV, mirroring the steps taken for NIWO:
par30_harv = pr_harv['PARPAR_30min']: Extracts the 30-minute averaged PAR data for HARV.par30_harv['timeBgn'] = par30_harv['startDateTime']: Creates atimeBgncolumn for merging.prtop_harv = par30_harv[par30_harv['verticalPosition']=='040']: Filters for PAR measurements at the 40-meter vertical position.
par30_harv = pr_harv['PARPAR_30min']
par30_harv['timeBgn'] = par30_harv['startDateTime']
prtop_harv = par30_harv[par30_harv['verticalPosition']=='040']
This merges the HARV eddy covariance flux data (flux['HARV']) with the processed HARV PAR data (prtop_harv) on the timeBgn column. The resulting fxpr_harv DataFrame combines the CO2 flux and PAR measurements for the Harvard Forest site.
fxpr_harv = pd.merge(flux['HARV'], prtop_harv, how='outer', on='timeBgn')
This code generates a scatter plot of CO2 flux versus PAR for the HARV site, identical in structure to the NIWO plot. It allows for direct visual comparison of the light response curves and carbon dynamics between the two NEON sites. The title, labels, grid, and a zero-line are included for clarity.
plt.scatter(fxpr_harv.PARMean,
fxpr_harv['data.fluxCo2.nsae.flux'],
marker='.', alpha=0.5)
plt.ylim((-20,20))
plt.title('CO2 Flux vs. Photosynthetically Active Radiation (PAR) at HARV')
plt.xlabel('PAR Mean (µmol m⁻² s⁻¹)')
plt.ylabel('CO2 Flux (µmol CO2 m⁻² s⁻¹)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.axhline(0, color='gray', linestyle='--') # Add a line at y=0 for sink/source reference
plt.show()
This cell calculates descriptive statistics for the PAR and CO2 flux data at the HARV site and then uses the explain_plot function (defined later) to generate an AI-driven explanation of the scatter plot's patterns and ecological implications. This helps in understanding the HARV site's carbon balance in detail.
This plot shows the relationship between PAR and CO2 flux for the HARV site, allowing for visual comparison with the NIWO site's carbon dynamics.
par_flux_harv_summary = fxpr_harv[['PARMean','data.fluxCo2.nsae.flux']].describe().to_string()
print(explain_plot(
"scatter plot of PAR (photosynthetically active radiation) vs CO2 flux",
par_flux_harv_summary,
context=HARV_CONTEXT
))
Alright everyone, great work getting this plot generated! This scatter plot of Photosynthetically Active Radiation (PAR) against CO2 flux is fundamental to understanding how an ecosystem "breathes" carbon. Let's break down what we're seeing here for the HARV subalpine forest during June and July of 2018.
Remember, a negative CO2 flux means the ecosystem is taking up carbon from the atmosphere, and a positive flux means it's releasing carbon.
### 1. What pattern(s) are visible and why they make ecological sense
The plot clearly shows a classic **light response curve** of ecosystem CO2 exchange, often resembling a hyperbolic or parabolic shape.
* **At low PAR (near 0 µmol m⁻² s⁻¹),** which corresponds to nighttime or heavily overcast conditions, the CO2 flux is predominantly **positive**. For example, the sample data rows show values like 7.06, 7.23, 9.87 µmol m⁻² s⁻¹ during the early morning hours (likely local night time). This makes perfect ecological sense:
* Without light, plants cannot photosynthesize, so there's no CO2 uptake.
* However, both plants (autotrophic respiration) and soil microbes/animals (heterotrophic respiration) continue to respire, releasing CO2 into the atmosphere. This net release is what we measure as a positive flux.
* **As PAR increases,** the CO2 flux generally becomes more **negative**, indicating increasing net CO2 uptake by the ecosystem. This is due to **photosynthesis** kicking in and accelerating with more light. The plants in the subalpine forest are actively using the light energy to fix atmospheric CO2 into organic compounds.
* **At very high PAR values (e.g., >1000 µmol m⁻² s⁻¹),** the rate of CO2 uptake appears to **level off or saturate**. The most negative values, like the observed minimum of -253.8 µmol m⁻² s⁻¹, occur under high light conditions. This saturation point indicates that the photosynthetic machinery of the ecosystem (chlorophyll, enzymes) is working at or near its maximum capacity. Even if more light is available, other factors (like CO2 concentration, temperature, water availability, or enzymatic reaction rates) become limiting.
### 2. What is driving the diurnal or seasonal variation if present
* **Diurnal Variation (Daily Cycle):** The most obvious driver of the patterns seen in this plot is the **daily cycle of solar radiation (PAR)**.
* **During the day:** As the sun rises, PAR increases, driving photosynthesis and leading to progressively stronger net CO2 uptake. This is the ecosystem's "daytime activity" of carbon sequestration.
* **During the night:** As PAR drops to zero, photosynthesis stops, and respiration dominates, causing a net release of CO2. This creates the daily "breathing" rhythm of the forest. The scatter plot aggregates all these daily cycles across the two months.
* **Seasonal Variation (June-July 2018):** While the plot itself doesn't explicitly show "seasonal" trends (it's a snapshot of a single period), the *magnitude* of the fluxes is strongly influenced by the season.
* **June-July is peak growing season** for a high-elevation subalpine forest like HARV (~3050m). During this period, trees and understory vegetation are fully leafed out, temperatures are relatively warm, and moisture is generally sufficient from snowmelt and summer precipitation. This leads to **maximal photosynthetic capacity**, explaining why we see such strong CO2 uptake, with a maximum observed uptake reaching -253.8 µmol m⁻² s⁻¹. If we were looking at data from late autumn or winter, we would expect much lower or even no net uptake, with respiration dominating due to dormancy and snow cover.
### 3. One thing that looks unusual or worth investigating further
One aspect that stands out as worth investigating is the presence of **significant positive CO2 fluxes (release) at moderate-to-high PAR values.** While the general trend shows increasing uptake with increasing PAR, there are data points where PAR is, for example, 500-1000 µmol m⁻² s⁻¹, yet the CO2 flux is positive (e.g., +20 to +40 µmol m⁻² s⁻¹ or even higher, up to the max of 89.08 µmol m⁻² s⁻¹).
* **Why is this unusual?** Under conditions of substantial light, we generally expect photosynthesis to dominate and lead to net CO2 uptake. Net CO2 release at high PAR implies that ecosystem respiration is either extremely high, or photosynthesis is severely suppressed, or both.
* **Potential avenues for investigation:**
1. **Data Quality:** The first step would be to examine the `qfFinl` (final quality flag) for these specific time periods. Are these flagged as unreliable measurements (qfFinl = 1)?
2. **Environmental Stress:** Were there unusual environmental conditions during these periods? For instance, extreme heat (leading to heat stress and increased respiration or stomatal closure), severe drought (limiting water for photosynthesis), or pest/disease outbreaks could cause plants to reduce photosynthesis or even experience stress-induced carbon loss.
3. **Footprint Dynamics:** What was the `angZaxsErth` (wind direction) and `distXaxs90`/`distXaxsMax` (footprint extent) during these times? Could the eddy covariance sensor be sampling a specific area within its footprint that has a disproportionately high respiration rate (e.g., a patch of decomposing organic matter, a tree fall gap with high soil respiration, or an area with senescing vegetation) or a non-vegetated surface?
4. **Turbulence Conditions:** Check `veloFric` (friction velocity). Low friction velocity can indicate poor turbulent mixing, which can lead to unreliable flux measurements.
### 4. What this pattern tells us about this ecosystem's carbon balance
This pattern strongly suggests that, for the **June-July 2018 period, the HARV subalpine forest was a net carbon sink.**
* The overall **mean CO2 flux for this period is -6.44 µmol m⁻² s⁻¹**. This negative average unequivocally indicates that the ecosystem was, on balance, taking more CO2 out of the atmosphere than it was releasing.
* Visually, while there are periods of CO2 release (respiration at night/low PAR), the sheer magnitude and frequency of the strong negative fluxes (uptake) at higher PAR values dominate the overall carbon exchange. The maximum uptake of -253.8 µmol m⁻² s⁻¹ is a very significant rate of carbon sequestration.
* This means that during its peak growing season, this high-elevation subalpine forest is effectively contributing to **carbon sequestration**, storing carbon in its biomass and soils. This aligns with the understanding that many healthy forest ecosystems act as important carbon sinks, particularly during their active growth phases.
This prepares the data for a side-by-side boxplot comparison:
niwo_fluxandharv_flux: Extracts the CO2 flux data for each site and adds a 'Site' column to identify the origin of the data.combined_flux = pd.concat([niwo_flux, harv_flux], ignore_index=True): Concatenates the two site-specific DataFrames into a singlecombined_fluxDataFrame.ignore_index=Trueis crucial here to prevent duplicate indices, which caused a previous error.combined_flux.rename(columns={'data.fluxCo2.nsae.flux': 'CO2_Flux'}, inplace=True): Renames the flux column for easier plotting.
Now that we have both plots and their interpretations, what are your observations? How does the carbon exchange at HARV compare to NIWO, and what further comparisons or analyses would you like to perform?
# Prepare data for side-by-side boxplot
niwo_flux = fxpr[['data.fluxCo2.nsae.flux']].copy()
niwo_flux['Site'] = 'NIWO'
harv_flux = fxpr_harv[['data.fluxCo2.nsae.flux']].copy()
harv_flux['Site'] = 'HARV'
combined_flux = pd.concat([niwo_flux, harv_flux], ignore_index=True)
combined_flux.rename(columns={'data.fluxCo2.nsae.flux': 'CO2_Flux'}, inplace=True)
This code generates a side-by-side boxplot comparing the distribution of CO2 flux values between the NIWO and HARV sites. Key elements:
plt.figure(): Sets the figure size.sns.boxplot(x='Site', y='CO2_Flux', data=combined_flux, palette='viridis'): Creates the boxplot using the combined data, with 'Site' on the x-axis and 'CO2_Flux' on the y-axis.plt.ylim((-20, 20)): Ensures consistent y-axis limits for comparison.plt.title(),plt.xlabel(),plt.ylabel(): Adds descriptive labels and title.plt.grid(),plt.axhline(0, color='gray', linestyle='--'): Adds a grid and a zero-line for reference.
plt.figure(figsize=(10, 6))
sns.boxplot(x='Site', y='CO2_Flux', data=combined_flux, palette='viridis')
plt.ylim((-20, 20)) # Keep consistent y-axis limits for comparison
plt.title('CO2 Flux Comparison: NIWO vs. HARV (June-July 2018)')
plt.xlabel('NEON Site')
plt.ylabel('CO2 Flux (µmol CO2 m⁻² s⁻¹)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.axhline(0, color='gray', linestyle='--') # Add a line at y=0 for sink/source reference
plt.show()
/tmp/ipykernel_2201/3872336773.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='Site', y='CO2_Flux', data=combined_flux, palette='viridis')
This cell lists all available Gemini models that support the generateContent method. This is useful for checking which models are accessible and their exact names, which can be important for initializing the GenerativeModel correctly.
This boxplot visually compares the distribution of CO2 flux values between the NIWO and HARV sites during June-July 2018. The central line in each box represents the median, the box edges show the interquartile range (25th to 75th percentile), and the 'whiskers' extend to data points within 1.5 times the IQR. Points beyond the whiskers are outliers.
for m in genai.list_models():
if 'generateContent' in m.supported_generation_methods:
print(m.name)
models/gemini-2.5-flash models/gemini-2.5-pro models/gemini-2.0-flash models/gemini-2.0-flash-001 models/gemini-2.0-flash-lite-001 models/gemini-2.0-flash-lite models/gemini-2.5-flash-preview-tts models/gemini-2.5-pro-preview-tts models/gemma-4-26b-a4b-it models/gemma-4-31b-it models/gemini-flash-latest models/gemini-flash-lite-latest models/gemini-pro-latest models/gemini-2.5-flash-lite models/gemini-2.5-flash-image models/gemini-3-pro-preview models/gemini-3-flash-preview models/gemini-3.1-pro-preview models/gemini-3.1-pro-preview-customtools models/gemini-3.1-flash-lite-preview models/gemini-3.1-flash-lite models/gemini-3-pro-image-preview models/nano-banana-pro-preview models/gemini-3.1-flash-image-preview models/lyria-3-clip-preview models/lyria-3-pro-preview models/gemini-3.1-flash-tts-preview models/gemini-robotics-er-1.5-preview models/gemini-robotics-er-1.6-preview models/gemini-2.5-computer-use-preview-10-2025 models/deep-research-max-preview-04-2026 models/deep-research-preview-04-2026 models/deep-research-pro-preview-12-2025
This defines the ask_about_variable function. This function takes a variable_name and a context (defaulting to NEON_CONTEXT) and crafts a prompt for the Gemini model. The model is asked to explain:
- What the variable measures physically.
- Its units and typical range.
- Why an ecologist cares about it.
- One thing a student should watch out for. The function then returns the AI's generated explanation.
def ask_about_variable(variable_name, context=NEON_CONTEXT):
prompt = f"""{context}
A student is asking about this specific variable in the dataset: `{variable_name}`
Please explain:
1. What this variable measures physically
2. Its units and typical range at this site
3. Why an ecologist would care about it
4. One thing a student should watch out for when interpreting it
Keep your answer concrete — reference the actual data statistics above where relevant.
"""
response = model.generate_content(prompt)
return response.text
This defines the explain_plot function. It takes a plot_description, data_summary, and context and uses the Gemini model to provide an ecological interpretation of a given plot. It asks the AI to explain:
- Visible patterns and their ecological sense.
- Drivers of variation.
- Anything unusual or worth investigating.
- What the pattern indicates about the ecosystem's carbon balance. An example call is included that uses this function to explain the NIWO PAR vs. CO2 flux scatter plot.
def explain_plot(plot_description, data_summary, context=NEON_CONTEXT):
prompt = f"""{context}
A student just made this plot:
PLOT DESCRIPTION: {plot_description}
DATA SHOWN IN PLOT:
{data_summary}
Please explain:
1. What pattern(s) are visible and why they make ecological sense
2. What is driving the diurnal or seasonal variation if present
3. One thing that looks unusual or worth investigating further
4. What this pattern tells us about this ecosystem's carbon balance
Be specific — use the actual values and site context above.
"""
response = model.generate_content(prompt)
return response.text
# Example call after making the PAR vs flux scatter:
par_flux_summary = fxpr[['PARMean','data.fluxCo2.nsae.flux']].describe().to_string()
explain_plot(
"scatter plot of PAR (photosynthetically active radiation) vs CO2 flux",
par_flux_summary
)
'Okay, this is a great starting point for analyzing the carbon dynamics of the Niwot Ridge subalpine forest! Let\'s break down what your plot likely shows and what it means.\n\nBased on the descriptive statistics for PAR and CO2 flux during June-July 2018, here\'s what we can infer about your scatter plot of PAR vs. CO2 flux:\n\n### 1. What pattern(s) are visible and why they make ecological sense\n\nThe plot likely shows a clear, non-linear relationship:\n\n* **Positive CO2 Flux at Low PAR (Nighttime Respiration):** When PAR is near zero (nighttime or very cloudy conditions), the CO2 flux values are generally positive. The 75th percentile of CO2 flux is 0.986 µmol/m²/s, and the median is -0.362 µmol/m²/s, suggesting that a significant portion of the nighttime values are positive. This represents **ecosystem respiration** (Re) – the CO2 released by all living organisms (plants, microbes, animals) in the ecosystem when there\'s no light for photosynthesis. This makes perfect ecological sense, as all living things respire to fuel their metabolic processes, regardless of light availability.\n\n* **Decreasing (More Negative) CO2 Flux with Increasing PAR (Daytime Photosynthesis):** As PAR increases from zero, the CO2 flux values become progressively more negative. This indicates increasing **net carbon uptake** by the ecosystem. The minimum CO2 flux recorded is -14.629 µmol/m²/s, which represents a very strong uptake. This pattern is driven by **photosynthesis**, where plants use light energy (PAR) to convert atmospheric CO2 into organic carbon. As more light becomes available, plants can photosynthesize faster, leading to greater CO2 drawdown from the atmosphere.\n\n* **Light Saturation (or Approach to Saturation):** At very high PAR values (e.g., above ~1000-1500 µmol/m²/s, given the max PAR of 2391.54), the rate of increase in CO2 uptake (the slope of the curve) would likely begin to flatten out. This indicates that photosynthesis is becoming **light-saturated**, meaning other factors (like CO2 concentration, temperature, water availability, or enzyme capacity) are limiting the rate of photosynthesis, not just light. This is a common physiological response in plants.\n\nIn summary, the pattern reflects the fundamental processes of photosynthesis and respiration, showing how light availability dictates the instantaneous carbon exchange between the subalpine forest and the atmosphere.\n\n### 2. What is driving the diurnal or seasonal variation if present\n\n* **Diurnal Variation (Daily Cycle):** The most prominent driver of the patterns seen in your plot is the **diurnal cycle of solar radiation (PAR)**.\n * **Daytime:** As the sun rises, PAR increases, triggering and then accelerating photosynthesis, leading to net CO2 uptake (more negative fluxes). This creates the "bow shape" or hyperbolic curve in your plot.\n * **Nighttime:** As the sun sets and PAR drops to zero, photosynthesis ceases, and only respiration occurs, resulting in net CO2 release (positive fluxes).\n * **Temperature:** While not directly shown in this plot, temperature also plays a significant diurnal role. Respiration rates generally increase with temperature, so nighttime respiration might be higher on warmer nights and lower on colder nights, adding to the scatter around the zero-PAR point.\n\n* **Seasonal Variation (June-July context):** Your data is from June-July, which represents the **peak growing season** for a subalpine forest at ~3050m elevation.\n * **High PAR and Favorable Conditions:** During these months, day lengths are long, solar radiation is high (max PAR up to 2391 µmol/m²/s), and temperatures are typically warm enough for active growth. There\'s usually sufficient moisture from snowmelt or summer precipitation. These optimal conditions drive the strong photosynthetic uptake observed (min flux of -14.629 µmol/m²/s) and contribute to the overall net carbon sink for this period.\n * **Maximal Productivity:** This period captures the ecosystem\'s maximal carbon uptake potential for the year.\n\n### 3. One thing that looks unusual or worth investigating further\n\nThe most unusual and concerning observation is the **maximum CO2 flux value of 310.114642 µmol/m²/s**.\n\n* **Why it\'s unusual:** This is an extremely high positive CO2 flux. Typical maximum ecosystem respiration rates for a forest, even a highly productive one in summer, are usually in the range of 5-15 µmol/m²/s, perhaps occasionally up to 20-30 µmol/m²/s during very warm, wet periods or immediately after disturbance. A value over 300 µmol/m²/s is several orders of magnitude higher than expected for a natural, undisturbed forest.\n\n* **Potential reasons to investigate:**\n * **Instrument Malfunction/Spike:** This could be an instrumental error, a sensor spike, or electronic noise.\n * **Quality Flags:** You should definitely check the `qfFinl` (final quality flag) and `qfqm` (quality flag and quality metrics) variables for these high values. NEON\'s quality control algorithms are designed to flag or remove such outliers. It\'s likely these points are flagged as `1` (fail).\n * **Data Processing Error:** A unit conversion error or scaling mistake during data processing could lead to such an inflated value.\n * **Advection/Drainage Flows:** In complex mountainous terrain like Niwot Ridge, non-turbulent transport processes (e.g., cold air drainage carrying CO2 from upslope) can sometimes lead to very large, erroneous "fluxes" measured by eddy covariance, especially at night or during stable atmospheric conditions. However, even these phenomena usually don\'t produce values *this* high.\n * **Missing `qfFinl` column:** It\'s important to remember that NEON data typically comes with `qfFinl` and `qfqm` flags. These stats don\'t explicitly say if these flags were used to filter the `data.fluxCo2.nsae.flux` data yet. It\'s critical to filter the data using `qfFinl == 0` (pass) before performing detailed analysis. If you haven\'t done so, this extremely high value might disappear.\n\nAnother minor, but worth noting, point is the **minimum PAR value of -0.08**. PAR (Photosynthetically Active Radiation) cannot physically be negative. This is likely a very small sensor offset or calibration error near the sensor\'s detection limit. While negligible for your analysis, it\'s a good reminder that even high-quality data can have minor imperfections.\n\n### 4. What this pattern tells us about this ecosystem\'s carbon balance\n\nThis pattern tells us that during June-July 2018, the Niwot Ridge subalpine forest was, on average, a **net carbon sink**.\n\n* **Net Sink Status:** The mean CO2 flux of **-1.547 µmol/m²/s** is negative, meaning that over the entire period, the ecosystem was taking up more CO2 from the atmosphere than it was releasing. This indicates that the forest was actively growing and accumulating biomass.\n* **Peak Productivity:** The combination of strong maximum uptake rates (-14.629 µmol/m²/s) and the overall negative mean flux highlights that June and July are a period of high productivity for this subalpine ecosystem. This is when the forest makes its greatest contribution to reducing atmospheric CO2 concentrations on a daily basis.\n* **Dynamic Balance:** The plot visually demonstrates the dynamic balance between photosynthesis (carbon uptake) during the day and respiration (carbon release) that occurs constantly. During these summer months, photosynthesis clearly dominates over respiration, leading to a net gain of carbon for the ecosystem.'
This defines the suggest_extensions function. This function takes a description of what_student_has_done and a context and prompts the Gemini model to suggest three concrete next analyses. For each suggestion, the AI is asked to provide a title, explain the ecological question it answers, provide the exact Python/pandas code, and explain expected results. This helps guide further data exploration.
def suggest_extensions(what_student_has_done, context=NEON_CONTEXT):
prompt = f"""{context}
A student has completed this analysis so far:
{what_student_has_done}
Suggest 3 concrete next analyses they could do with this dataset. For each:
- Give it a short title
- Explain the ecological question it answers
- Write the exact Python/pandas code to do it (using the variable names from the metadata above)
- Explain what result to expect and what it would mean
Prioritize ideas that are achievable with the variables already loaded and
that reveal something ecologically meaningful about carbon exchange at a subalpine site.
"""
response = model.generate_content(prompt)
return response.text
This cell calls the ask_about_variable function for the data.fluxCo2.nsae.flux variable, using the NEON_CONTEXT. It then prints the AI's detailed explanation of this key CO2 flux variable for the NIWO site.
print(ask_about_variable('data.fluxCo2.nsae.flux'))
Hello everyone! It's great you're diving into NEON's eddy covariance data – it's incredibly powerful for understanding ecosystem carbon cycling. Let's break down that specific variable: `data.fluxCo2.nsae.flux`.
---
Here's an explanation of `data.fluxCo2.nsae.flux` for the NIWO site in June–July 2018:
1. **What this variable measures physically:**
This variable measures the **Net Surface-Atmosphere Exchange (NSAE) of Carbon Dioxide (CO2)**. In simpler terms, it's the net movement of CO2 gas between the entire subalpine forest ecosystem (soil, plants, woody debris) and the atmosphere directly above it. The "NSAE" part is key: it means it accounts for both the rapid turbulent movement of CO2 and any changes in CO2 stored within the air column *below* the measurement height (known as "storage flux").
* **Negative values** (like -14.63 µmol CO2 m⁻² s⁻¹) indicate that the ecosystem is **taking up CO2** from the atmosphere. This primarily happens during the day due to photosynthesis by plants.
* **Positive values** (like 310.11 µmol CO2 m⁻² s⁻¹) indicate that the ecosystem is **releasing CO2** to the atmosphere. This primarily occurs due to ecosystem respiration (by plants and microbes) and sometimes during periods of low photosynthesis.
2. **Its units and typical range at this site (June–July 2018):**
* **Units:** The units are **micromoles of CO2 per square meter per second (µmol CO2 m⁻² s⁻¹)**. This quantifies how many molecules of CO2 are moving across a specific area of land over a given time.
* **Typical Range:** Based on the descriptive statistics for June–July 2018 at NIWO:
* The **minimum** observed flux was **-14.63 µmol CO2 m⁻² s⁻¹**, indicating strong net CO2 uptake by the forest during that half-hour period.
* The **maximum** observed flux was **310.11 µmol CO2 m⁻² s⁻¹**, indicating a very large net release of CO2. This maximum value is significantly higher than the 75th percentile (0.99 µmol CO2 m⁻² s⁻¹), suggesting it might be an outlier or represent a very specific, intense event.
* The **mean** flux was **-1.55 µmol CO2 m⁻² s⁻¹**, and the **median** was **-0.36 µmol CO2 m⁻² s⁻¹**. Both the mean and median being negative suggest that, on average, the NIWO subalpine forest was a **net carbon sink** (absorbing more CO2 than it released) during this early summer period in 2018.
3. **Why an ecologist would care about it:**
Ecologists care deeply about this variable because it's a **direct, ecosystem-scale measurement of carbon exchange**, which is fundamental to understanding:
* **Ecosystem Metabolism:** It reveals the balance between photosynthesis (carbon uptake) and respiration (carbon release) for the entire forest. This helps determine the "health" and productivity of the ecosystem.
* **Carbon Sequestration/Source Strength:** It directly quantifies whether an ecosystem is acting as a net carbon sink (removing CO2 from the atmosphere) or a net carbon source (contributing CO2 to the atmosphere). This is critical for assessing the role of different ecosystems in global carbon cycles and climate change.
* **Responses to Environmental Change:** By tracking these fluxes over time, ecologists can observe how the forest responds to changes in temperature, precipitation, drought, growing season length, and other environmental factors.
4. **One thing a student should watch out for when interpreting it:**
**Data Quality and Outliers:** The maximum value of 310.11 µmol CO2 m⁻² s⁻¹ stands out dramatically compared to the typical range and the 75th percentile of 0.99. Eddy covariance data can sometimes contain spurious high values due to instrument malfunction, adverse weather conditions (e.g., rain on the sensor), or periods of very low turbulence.
* **Always check the quality flags!** NEON provides specific quality flags like `qfFinl` (final quality flag) and `qfqm` (quality flag and quality metrics). Before doing any quantitative analysis, filter your data to include only records where `qfFinl` indicates a "pass" (typically 0). This ensures you are working with reliable data and can help remove these potentially problematic extreme values. Failing to do so can significantly skew your results and interpretations.