Distributed Spatial-Temporal Runtime
ScienceClaw now includes a scaffold for bounded distributed analysis. The goal is an AI-native environmental workflow runtime, not an autonomous container swarm.
Architecture
ScienceClaw UI / Agent
|
v
Task Config YAML
|
v
Local Worker or Kubernetes Job
|
v
Stream STAC / COG / Zarr Data
|
v
Write Reports, Figures, Tables, Logs
|
v
Output Viewer / Workspace UI
|
v
Human Review
The sub-agent is primarily an execution boundary. It receives explicit inputs, runs a bounded task, writes durable outputs, and exits.
Local Execution
Local execution is the primary test path. It uses the same worker logic as Kubernetes and writes to the same output structure.
SCIENCECLAW_WORKER_OFFLINE=1 ./scripts/run_worker_local.sh examples/spatiotemporal/tasks/example_stac_preview.yaml --offline
python3 scripts/build_output_index.py --data-root ./data
The runner prefers Docker and falls back to direct Python mode for debugging. Direct mode is useful for smoke tests but Docker mode better resembles a Kubernetes Job.
Kubernetes Execution
Kubernetes is optional. The scaffold under deploy/kubernetes/ includes:
- namespace pattern
- persistent volume claim pattern
- service account
- narrow Role and RoleBinding
- example analysis Job
- example worker Pod spec
- resource requests and limits
- job timeout and backoff settings
- ConfigMap task injection
- Secret mounting pattern
Render a bounded manifest without applying it:
python3 runtime/job-launcher/render_job_manifest.py \
--task examples/spatiotemporal/tasks/example_stac_preview.yaml \
--job-id example-stac-preview \
--image scienceclaw-spatiotemporal-worker:local
The submit helper is print-only by default. Use --apply only after human review and only with a configured namespace and cluster context.
Task Configs
Task configs are YAML files. The initial example lives at examples/spatiotemporal/tasks/example_stac_preview.yaml.
Required fields:
task_nameoutput_dirinputsanalysisoutputs
Worker jobs should not accept arbitrary shell commands. Output directories should live under /data/outputs/jobs/<job-id>/.
Output Structure
Each worker job writes:
task.yamlstatus.jsonlogs.txtmetadata.jsonreport.mdreport.htmlfigures/tables/maps/
The output indexer scans /data/outputs/jobs and writes /data/outputs/index.html.
python3 scripts/build_output_index.py --data-root ./data
Start the browser workspace UI:
docker compose up workspace-ui
Open http://127.0.0.1:8888 and inspect /data/outputs/index.html, job folders, reports, figures, CSV tables, JSON metadata, logs, notebooks, and maps.
Stream-First Data Access
The runtime is oriented around cloud-native environmental data access:
- STAC catalog search before pixel reads
- COG window reads through HTTP range requests
- Zarr stores opened lazily with xarray
- object storage access through
fsspec,s3fs, andgcsfs - derived outputs persisted locally, not large source datasets
Installed core support includes GDAL, PROJ, GEOS, libspatialindex, rasterio, rioxarray, xarray, dask, zarr, geopandas, shapely, pyproj, fsspec, s3fs, gcsfs, aiohttp, requests, numpy, pandas, matplotlib, pyarrow, duckdb, pystac-client, odc-stac, stackstac, and folium.
Optional heavier packages to consider per deployment include leafmap, rio-cogeo, cogeo-mosaic, planetary-computer, hvplot, holoviews, and datashader.
Safety
Do not:
- allow unbounded job spawning
- grant cluster-admin permissions
- bake credentials into images
- write secrets to logs
- let worker jobs recursively launch other jobs
- require Kubernetes for normal local use
- download giant source datasets in examples
Do:
- use bounded templates
- isolate namespaces
- use explicit task configs
- write outputs to known directories
- preserve logs and metadata
- keep artifacts browser-visible
- require human review before publication or downstream action