Getting Started with Lucenia in Five Easy Steps

Nick Knize

September 6, 2024

At Lucenia, our mission is to make advanced search capabilities accessible to everyone. Lucenia 0.1.0 embodies this philosophy by bringing PhD-level science and analytics into everyday hybrid search use cases, all while ensuring enterprise-grade Role-Based and Attribute-Based Access Control (RBAC and ABAC) security by default. If you missed our exclusive sneak peek, be sure to check it out. This post marks the first in a series of tutorial blog posts that will guide you from getting started with Lucenia to mastering some of its most sophisticated features.

Prerequisites

Before diving in, ensure you have satisfied the following prerequisites:

  1. Java Installation: You need Java 19 or later, but we strongly recommend using Java 21 or above to take full advantage of search optimizations, including Project Panama and SIMD.
  2. Docker Installation: While not a hard requirement, Docker is used in this post to run Lucenia. If you don’t have Docker installed, you can find the installation instructions here. Other ways to install and run Lucenia can be found in our installation guide on the Lucenia Documentation.
  3. Lucenia License: Obtain a Lucenia license by registering at https://cloud.lucenia.io. This license gives you full access to all features for 30 days. After that, you can either purchase a yearly license or stay tuned for the launch of our free developer license, which allows for a minimally scalable production cluster at no cost. Watch the video below for a quick walkthrough on how to register and obtain your Lucenia license.

Step 1 – Spinning Up Lucenia

Keeping with the theme of simplicity, getting Lucenia up and running is as simple as a single command. After first downloading your Lucenia license from your registration email to your `Downloads` folder, follow the steps below:

  1. Clone the Lucenia Tutorials Repository: Clone the `lucenia-tutorials` GitHub repository, which includes all necessary materials for this and future tutorials. We recommend bookmarking this repository as it will grow over time to include several free training and tutorial blog post material needed to run any examples. After cloning the repository, change into the `1_getting-started` directory and set the required environment variables, which includes information like admin password and bulk indexing sizes. This can be done in one composite command as follows:

git clone git@github.com:lucenia/lucenia-tutorials && cd lucenia-tutorials/1_getting-started && source env.sh

Warning: Some systems may not refresh the file cache fast enough for composite commands to work. In that case just run each command individually.

B. Copy the License File: Next, move the downloaded Lucenia license to the node config directory:


cp ~/Downloads/trial.crt node/config

Warning: If you forget this step, Docker will likely create a trial.crt/ directory when you try to start the node. In this instance just remove the directory (rm -rf trial.crt/) and recopy the license file.

C. Launch Lucenia: Finally, spin up the Lucenia node with Docker Compose using the following command:


docker compose up

That’s it! Just like that Lucenia is up and running! This command will pull the 0.1.0 image from the release container registry and launch a single production-ready node with a small 512MB heap. If you need to modify the node settings, you can edit the `docker-compose.yml` file as needed. For more information, refer to the Lucenia documentation. Verify the node is running with:


curl "https://localhost:9200" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure


For secure socket layer (SSL) network transport setup, refer to the “Generating self-signed certificates” in Lucenia’s security documentation.

Step 2 – Creating the Index

Lucenia is schema-flexible, meaning you can start indexing without predefined schemas. Simply upload your document and an index will be created with a “best effort” schema determined by introspecting the first document provided. However, for advanced field types like geospatial points, it’s advisable to use a schema, also known as a “mapping”. In this tutorial, we’ll use a predefined mapping from the `mappings.json` file.

To create the index, execute:


curl -XPUT "https://localhost:9200/nyc_taxis" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure \
--header 'Content-Type: application/json' \
--data-binary "@mappings.json"

You’ll receive an acknowledgment response from Lucenia. Verify the mappings with:


curl "https://localhost:9200/nyc_taxis/_mapping" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure

For more details on field types, visit the Lucenia documentation field types section.

Step 3 – Bulk Indexing Data

Now that your cluster is running and the index is created, it’s time to bulk load data. We’ll use the `nyc_taxis` dataset, a collection of taxi rides in and around Manhattan, New York. This dataset is commonly used for benchmarking performance since it contains 1 million documents with time series, text, dates, and geo-location types.

The dataset is compressed as `bulk-data.json.bz2` in the `1_getting-started` folder. We’ve provided a script, `index_data.sh`, to simplify the bulk upload process by indexing the documents in chunks of 100K at a time. This can be achieved by executing the following:


sh index_data.sh bulk-data.json.bz2

For more details on how the bulk API works, and how to tune it for your use case, please review the Lucenia Bulk API documentation.  

During the indexing process, the upload progress and Lucenia response can be monitored by following the `bulk_post.log` file using the command:


tail -f ./scratch/bulk_post.log

Once the upload is complete, verify the data was successfully indexed with:


curl "https://localhost:9200/nyc_taxis/_count?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure

You should see a response that includes `”count” : 1000000` indicating that all 1 million documents have been indexed.

Step 4 – Query the Data

With 1 million documents indexed, let’s run a time-series query to find all taxi rides between March 3 and March 8, 2016. Execute the following command:


curl -XGET "https://localhost:9200/nyc_taxis/_search?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure \
--header 'Content-Type: application/json' \
-d '{
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2016-03-03 00:00:00",
        "lte": "2016-03-08 00:00:00"
      }
    }
  }
}'

This uses the date-range query with minimum and maximum date values specified in the Java date/time string format of `yyy-MM-dd HH-mm-ss`. For more information on the set of queries Lucenia supports, visit the query language section of the Lucenia documentation.

Step 5 – Query the Data and Perform Analytics

Search is more powerful when combined with analytics as users often do not get enough insights into their data by running a simple query. Combined with search, Lucenia offers robust analytic capabilities, compatible with other search engines like OpenSearch, making migration from slower solutions seamless.

To demonstrate the power of analytics in this tutorial, let’s perform a `geo_distance` aggregation to analyze taxi rides grouped by incremental distance from the center of Manhattan. To achieve this powerful analysis, execute the following command:


curl -XGET "https://localhost:9200/nyc_taxis/_search?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure \
--header 'Content-Type: application/json' \
-d '{
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2016-03-03 00:00:00",
        "lte": "2016-03-08 00:00:00"
      }
    }
  },
  "aggs": {
    "manhattan_rings": {
      "geo_distance": {
        "field": "dropoff_location",
        "origin": "POINT (-73.971321 40.776676)",
        "unit": "mi",
        "ranges": [
          {
            "to": 1
          },
          {
            "from": 1,
            "to": 3
          },
          {
            "from": 3,
            "to": 5
          },
          {
            "from": 5,
            "to": 15
          },
          {
            "from": 15
          }
        ]
      }
    }
  }
}'

The response will include the original search along with newly aggregated data. In this case, separate bucket groups illustrate the distribution of taxi rides within the five distance ranges. For more on Lucenia’s analytics, we highly encourage you to check out the aggregation documentation and experiment with the data provided in this tutorial.

Conclusion and Next Steps

Congratulations!
You’ve just scratched the surface of Lucenia’s capabilities. Lucenia offers a wealth of features with over 17% performance improvements compared to alternatives like OpenSearch, all while reducing index sizes and cutting resource usage. Our “do more with less” approach ensures that you can search on your terms, with flexibility and cost savings during these financially challenging times.

Stay tuned for more blog posts in the coming weeks, including demonstrations on using Lucenia for Geo STAC catalogs, Hybrid Vector Search, Generative AI use cases, and deploying in a self-hosted Kubernetes environment. For more information, including customized pricing, partnerships, and early adopter discounts, contact us at sales@lucenia.io. To request features or join the open-source development community, join us on Slack or our Lucenia forum.

Happy Searching!