figshare
Browse

Seamless Arrays: A Full Stack, Cloud-Native Architecture for Fast, Scalable Data Access

Download (4.98 MB)
presentation
posted on 2024-12-16, 04:55 authored by Joseph HammanJoseph Hamman, Ryan AbernatheyRyan Abernathey, Deepak Cherian
<p dir="ltr">Just about everyone agrees on what the ideal Earth system data service would provide: near-instantaneous access to any arbitrary slice of any dataset, across any spatiotemporal dimensions, to an unlimited number of concurrent users. This capability would provide an ideal foundation for nearly all downstream workloads: from data visualization, to large-scale statistics and analysis, to AI/ML model training and inference.</p><p><br></p><p dir="ltr">In reality, most data systems in use today by data providers fall quite short of this ideal. In general, such systems fall into two categories:</p><ul><li><b>File-based data distribution</b> - For most agency data providers like NASA, NOAA, and USGS, the file or “granule” is the fundamental atom of data distribution. Datasets are split into many NetCDF or TIFF files and distributed via FTP servers or, more recently, cloud storage buckets. While cloud-native formats can help make such access more efficient, the choice of how to split a dataset into files or chunks fundamentally constrains the efficiency of different access patterns. In particular, extracting long timeseries records is extremely slow and expensive for nearly all common file-based datasets where each file corresponds to a single time slice.</li><li><b>API-based data access</b> - API standards like OPeNDAP, OGC EDR, etc. allow users to request an arbitrary subset of a dataset. The underlying storage details are hidden from the end user, and the responsibility for extracting the subset is pushed to the server side. However, in many cases, this simply shifts the computational burden to the data provider without providing significant speedups for end users.</li></ul><p dir="ltr">At Earthmover, we’ve become obsessed with a vision we call “Seamless Arrays”—the ability to easily query a data cube across both spatial and temporal dimensions with consistent low latency. This talk will showcase our prototype implementation of Seamless arrays, inspired by cloud-native database systems like Snowflake, consisting of two primary components:</p><ul><li>A novel Zarr-based schema for storing data in cloud object storage optimized to support diverse query patterns</li><li>An API layer providing access to the data via OPeNDAP and OGC EDR, capable of scaling from zero to thousands of nodes</li></ul><p dir="ltr">We’ll demonstrate the performance characteristics of this system in terms of latency and throughput and also explore the costs of operation.</p>

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC