Lucenia + STAC: Navigating Massive Spatial Data

Dalton Boggs

SpatioTemporal Asset Catalog (STAC) is an extension and improvement on Open Geospatial Consortium’s Catalog Service for the Web, another standard for organizing and cataloging geospatial data. It provides a consistent and flexible way to index, search, and retrieve various spatiotemporal assets, such as satellite imagery, aerial photographs, and other geospatial datasets. STAC defines a structure for metadata, making it possible to describe a wide variety of geospatial resources in a standardized way. This metadata typically includes information about the location, time, and properties of each asset, as well as the means to access and process the data. Using STAC can help disparate geo systems talk to each other without a ton of overhead.

One big problem though – the size of datasets. The exact reason for using STAC – that it’s simply JSON – starts to hamstring attempts to query data. Looping through giant JSON objects and arrays can be very slow, which is where the Lucenia search platform can help.

Integrating STAC with Lucenia’s “do more with less” design significantly enhances the capability to search through large, distributed geospatial datasets while reducing storage cost. Lucenia provides powerful full-text, numeric, and spatial search capabilities using a simple API. When paired with STAC, Lucenia enables users to perform fast and scalable searches across a catalog of spatiotemporal assets using common attributes like time range, location (bounding box or complex shapes), and other metadata. This integration helps geospatial experts and data scientists quickly discover new insights, reducing the time spent manually sifting through and analyzing massive datasets. The combination of STAC’s rich metadata and Lucenia’s efficient search engine makes geospatial data more discoverable and usable for a wide range of applications.

Keeping with Lucenia’s theme of simplicity, this blog post marks the second in a series of tutorials to demonstrate how easy it is to leverage Lucenia for high-performance, complex spatial search and analytics on massive datasets.

Prerequisites

A barrier to getting STAC data out of object storage can be setting up data flows and storage, which is what this guide aims to show. The STAC API will be using the node set up in the previous tutorial linked here – Getting Started with Lucenia in Five Easy Steps

Once the node is running, there are a few prerequisites to begin this tutorial. Python needs to be installed, and general knowledge of using Python will be helpful. Docker will be used in this tutorial, although it is recommended to reference the README in the repository for instructions on running outside of Docker.

Pull down the Lucenia tutorial repository using the following composite command:


git clone git@github.com:lucenia/lucenia-tutorials && cd lucenia-tutorials/2_stac-demo && source .env

cp ~/Downloads/trial.crt node/config

STAC + Lucenia for Vegetation Health Analysis

Now let’s get started.

Step 1: Launch the Application

Simply use Docker Compose to spin up the application. The API will only start if there’s a Lucenia node running at the correct endpoint, so refer back to the first step of the tutorials.


docker compose up 

At the end of the logs, if the setup worked, there should be a message reading “Application startup complete.”

Step 2: Inserting Sample STAC Data

Now that your app is running, let’s add some sample STAC catalog data to the Lucenia node.

Create and activate a virtual environment:

  • Create a Python Virtual Environment
    • Navigate to the `2_stac_catalog` directory.
    • Create a virtual environment using the following command:

python3 -m venv venv
  • Activate the Virtual Environment
    • On macOS and Linux:

source venv/bin/activate
  •  On Windows:

.\venv\Scripts\activate
  • Install Dependencies:
    • With the virtual environment activated, install the required dependencies:

pip install -r requirements.txt
  • Make sure you’re inside the virtual environment (if using Docker) and run:

python3 data_loader.py \ 
--base-url http://localhost:8082 \ 
--username admin \ 
--password admin

This command will load some sample data into your Lucenia node, so you can start playing around with it.

The FastAPI server will be displayed via Swagger at localhost:8082/api.html. From here you can do various operations, a few of which we’ll explore now.

Step 3: Querying Near Infrared (NIR) Data for Vegetation Health

Enter the following query into the /search Post endpoint in the Swagger UI.


{
   "terms" : {
     "eo:bands" : [ "nir", "red" ],
     "minimumshouldmatch" : 1
   },
   "geodistance": {
      "distance": "10km",
      "proj:centroid": {
         "lat": -72.36312264511113,
         "lon": 177.49505051948472
      }
   }
}

This will allow us to explore near-infrared (NIR) and red band data within 10km of a given centroid. With vegetation in the NIR band being highly reflective due to chlorophyll, we can determine the level of photosynthetic activity in plants, a key indicator of plant health.

This application can be especially useful in precision agriculture. For instance, a user could apply this query to identify areas of a field showing signs of stress or reduced growth. By combining NIR data with geospatial analytics, agronomists could identify potential problems such as nutrient deficiencies, water stress, or disease, enabling early interventions. This not only optimizes crop yields but also minimizes the use of resources like water and fertilizer by targeting specific areas of concern.

Step 4: Filtering by Cloud Cover using Lucenia Analytics

In addition to filtering by bands and proximity, the query can be extended with a histogram aggregation to analyze cloud cover in the dataset. By adding a bucket aggregation on the eo:cloudcover field, the results can be grouped into different cloud cover intervals, say every 25%. This would look something like:


{
  "aggs" : {
    "cloud_cover_histogram" : {
      "histogram" : {
        "field" : "eo:cloudcover",
        "interval" : 25
      }
    }
  }
}

This aggregation prioritizes imagery with lower cloud cover, enabling users to find images that balance quality and availability.

Wrapping Up and Next Steps

By combining STAC with Lucenia, users can quickly search and sort through huge amounts of geospatial data to get answers to questions that might take significantly longer given other forms of analysis. This type of querying enables geospatial experts to find exactly what they need from vast catalogs, making data more actionable.

Further, not out of the realm of possibility, is leveraging LLMs and AI to convert natural language queries into structured search queries. So given a query such as, “Give me satellite imagery containing near/infrared bands around this [soybean field], ranked by cloud cover,” and getting an output, the barrier to turning heaps of data into actionable information will have been vastly lowered.