Storage to Search: Make Your Cloud Documents Searchable with Lucenia

Nick Knize

Unlock the full potential of your cloud-stored documents with Lucenia, the cutting-edge search platform built to handle complex data with ease. This two-part series will show you how to turn your cloud object storage into a powerful, searchable knowledge base. Whether it’s PDFs, Word documents, or other file types, Lucenia’s cloud accessible Document Ingest will help you extract, index, and search through massive datasets like a pro.

In this first part, we’ll take you step-by-step through deploying Lucenia, installing the Document Ingest Plugin, and setting up a seamless pipeline to parse and index your documents in the cloud or on-premises. By the end of this tutorial, you’ll have a fully operational system capable of ingesting complex document formats and delivering fast, accurate search results.

But that’s just the beginning! Along the way, we’ll showcase Lucenia’s capabilities, like phrase matching and highlighting, that let you pinpoint critical content in seconds. In Part 2, we’ll dive deeper into advanced processing, search techniques, and trend analysis, revealing insights hidden in your data. Ready to elevate your enterprise search game? Let’s get started!

Prerequisites:

  1. Java: Lucenia 0.2+, with Lucene 10 support, now requires JDK 21 or higher.
  2. Docker: Ensure Docker is installed. Follow Docker Installation Instructions.
  3. NodeJS: Ensure NodeJS is installed. Follow NodeJS installation instructions.
  4. Lucenia License: Obtain a trial license at Lucenia Cloud or via AWS Marketplace.

Step 1: Start Lucenia 0.2

a. Spin up Lucenia and source environment variables

git clone git@github.com:lucenia/lucenia-tutorials && cd lucenia-tutorials/5_document-ingest && source env.sh

b. Copy your Lucenia license to all nodes

cp ~/Downloads/trial.crt node1/config && \ 
cp ~/Downloads/trial.crt node2/config && \ 
cp ~/Downloads/trial.crt node3/config && \ 
cp ~/Downloads/trial.crt ingest1/config && \ 
cp ~/Downloads/trial.crt ingest2/config

c. Launch Lucenia 0.2

docker compose up

Step 1b: Install the Ingest Attachment Plugin 

The key component for making PDF documents searchable is the Ingest Attachment Processor Plugin. This plugin needs to first be installed on every node in the cluster. In this tutorial, we used docker compose to setup a five node cluster as shown below:

Three nodes hold the index data and serve as cluster_manager eligible, and two nodes serve as ingest only nodes. This configuration will ensure throughput on the indexing nodes is not impacted by the document parsing and processing step on the ingest nodes.

a. Install the ingest-attachment plugin on all nodes

Installing a Lucenia plugin takes two simple steps: install the plugin on each node and restart the cluster. This can be done in one chained command for all of the nodes as follows:

docker exec -it lucenia-node1 bash -c "lucenia-plugin install --batch http://artifacts.lucenia.io/ingest-attachment/ingest-attachment-0.2.1.zip" && \
docker exec -it lucenia-node2 bash -c "lucenia-plugin install --batch http://artifacts.lucenia.io/ingest-attachment/ingest-attachment-0.2.1.zip" && \
docker exec -it lucenia-node3 bash -c "lucenia-plugin install --batch http://artifacts.lucenia.io/ingest-attachment/ingest-attachment-0.2.1.zip" && \
docker exec -it lucenia-ingest1 bash -c "lucenia-plugin install --batch http://artifacts.lucenia.io/ingest-attachment/ingest-attachment-0.2.1.zip" && \
docker exec -it lucenia-ingest2 bash -c "lucenia-plugin install --batch http://artifacts.lucenia.io/ingest-attachment/ingest-attachment-0.2.1.zip" && \
docker restart lucenia-node1 lucenia-node2 lucenia-node3 lucenia-ingest1 lucenia-ingest2

b. Verify Installation

To verify the plugin was installed on all nodes, list the plugins using the  /_nodes/ingest endpoint and find the “attachment” processor listed under the ingest for each node.

curl "https://localhost:9201/_nodes/ingest?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure

Now, let's parse the documents and index the data.

Step 2: Create the Index and Document Ingest Processor

Similar to previous blogs, we will first create the index and the processor with two separate commands.

a. Create the legal_documents index with the proper mappings

To index our parsed document data with a schema necessary for useful queries, we’ll create an index with specified field mappings in the mappings.json:

curl -XPUT "https://localhost:9201/legal_documents" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure \
--header 'Content-Type: application/json' \
--data-binary "@mappings.json"

b. Create the Document Ingest Processor

The Attachment Ingest Processor enables you to extract content from various file types (e.g., PDF, Word documents) and index the text content in Lucenia. Here, we create a processor called document_ingest; it uses the Apache Tika library to parse the PDF document and index it into the legal_documents index:

curl -XPUT "https://localhost:9201/_ingest/pipeline/document_ingest" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure -H 'Content-Type: application/json' -d @- <<EOF
{
  "description": "Index PDF Documents from Object Storage",
  "processors": [
    {
      "attachment": {
        "field": "base64_data",
        "target_field": "document",
        "indexed_chars": -1
      }
    },
    {
      "script": {
        "source": "if (ctx.document != null && ctx.document.content != null) { ctx.document.content = ctx.document.content.replace('\\n', ' ').replace('\\r', ' '); }"
      }
    },
    {
      "remove": {
        "field": "base64_data"
      }
    }
  ]
}
EOF

The below information provides additional details for each section of the processor definition:

  • field: The field name containing the base64-encoded content (e.g., base64_data).
  • target_field: Where the extracted content will be stored (e.g., content).
  • indexed_chars: Setting to -1 indexes the entire content of the document.
  • The script processor normalizes the extracted content and removes formatting artifacts that might hinder searching or readability.
  • The remove processor deletes the original base64 field to reduce storage size.

Step 3: Parse and Index Legal Document Data

In this step, we’ll configure the environment to make our legal documents searchable. Alongside starting a five-node Lucenia cluster, the provided docker-compose.yml file also launches a MinIO object storage server. MinIO simulates popular cloud object stores like AWS S3, Google Cloud Storage, and Azure Blob Storage. This storage server will host the legal-docs bucket, where we’ll upload our sample documents for indexing and search.

a. Start the Object Store Monitoring Subsystem

This tutorial includes a Node.js script named s3IngestAttachment.js to monitor and index uploaded PDF files. This script continuously watches the bucket for new PDF uploads and automatically sends them to the Lucenia cluster for parsing and indexing. To get started:

  1. Open a terminal in the tutorial’s project directory:
    ./lucenia-tutorials/5_document-ingest/demo.
  1. Run the following commands to install dependencies and start the script:
npm install && node s3IngestAttachment.js

Upon successful startup, you’ll see the following output on the console:

Bucket "legal-docs" does not exist. Creating it...
Bucket "legal-docs" created successfully.
Monitoring bucket "legal-docs" for new files...
No new files to process.
No new files to process.

b. Upload the Legal Documents

With the monitoring script running, it’s time to upload the sample documents provided in this tutorial. Follow these steps:

  1. Open a browser and navigate to the MinIO web interface for the legal-docs bucket:
    http://localhost:9001/browser/legal-docs
  2. Log in using the default credentials:
    Username: minioadmin
    Password: minioadmin
  3. Once logged in, you’ll see the legal-docs bucket created by the monitoring script. Click Upload -> Upload File.
  4. In the file dialog, navigate to the data directory provided in this tutorial. This folder contains three public domain patent documents authored by Nicholas Knize, PhD, the founder of Lucenia and OpenSearch. These patents cover topics such as High-Dimensional Indexing, Retrieval-Augmented Reality, and Serverless Architecture.  It also includes two Government Documents including the Inflation Reduction Act and Homeland Security Improvement Act. Select one or more files and click Open to upload them individually or as a collection. 

Once uploaded, the Node.js script will detect the new files, parse their content, and index them as individual documents in the Lucenia cluster. The script’s console output will confirm successful indexing:

Processing new file: Patent_WO2011071688A2.pdf
Successfully indexed file: Patent_WO2011071688A2.pdf
Processing new file: Patent_US20170083567A1.pdf
Successfully indexed file: Patent_US20170083567A1.pdf
Processing new file: Patent_US8331611.pdf
Successfully indexed file: Patent_US8331611.pdf

Congratulations! The documents are now fully searchable in your Lucenia cluster. Let’s verify with a simple _count.

curl "https://localhost:9201/legal_documents/_count?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure

The result should look like this:

{
  "count" : 5,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

The data is now parsed, analyzed, and indexed! Now, let’s explore what we have.

Step 4: Search and Explore the Legal Document Data

a. Finding relevant documents by key terms

This first query below demonstrates how to efficiently search for legal documents containing specific terms and retrieve only the information you need. 

curl "https://localhost:9201/legal_documents/_search?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
-H 'Content-Type: application/json' --insecure \
-d'
{ 
  "_source": "document.date",
  "query": {
    "bool": {
      "should": [
        { "match": { "document.content": "Knize" } },
        { "match": { "document.content": "dimension" } }
      ],
      "minimum_should_match": 1
    }
  }
}'

In this example, we search the document.content field for occurrences of the terms "Knize" or "dimension", ensuring that any document matching at least one of these terms are included in the results.

Using a bool.should clause, the query is flexible enough to capture documents with either term, making it ideal for cases where multiple keywords are relevant. The results are further streamlined by limiting the returned data to just the document.date field. This focused approach enables you to quickly pinpoint when these documents were created without sifting through unnecessary details.

This query is a common example of how to optimize search results for targeted analysis, especially in large datasets where efficiency matters.

b. Finding documents by relevant phrases

This next query quickly surfaces the most relevant sections of patent documents by highlighting where key phrases occur, making it easy to identify critical content at a glance. It also supports advanced analysis by enabling pattern or trend comparisons across multiple patents, streamlining the review process for researchers and patent examiners.

For this example, when analyzing patent datasets, phrases like "method for", "system and method", or "device comprising" often indicate critical sections of the document, such as:

  • The scope of the invention.
  • Technical details of the system or process.
  • Key claims and embodiments.

However, manually sifting through large volumes of text to find these patterns is inefficient. This is where Lucenia's phrase-matching capabilities shine.

curl -XPOST "https://localhost:9201/legal_documents/_search?pretty" \
-u "admin:${LUCENIA_INITIAL_ADMIN_PASSWORD}" \
--insecure -H "Content-Type: application/json" \
-d '{
  "_source": false,
  "query": {
    "bool": {
      "should": [
        { "match_phrase": { "document.content": "method for" } },
        { 
          "match_phrase": {
            "document.content": "system and method" 
          } 
        },
        { 
          "match_phrase": { 
            "document.content": "device comprising" 
          } 
        },
        { "match_phrase": { "document.content": "process of" } },
        { 
          "match_phrase": { 
            "document.content": "apparatus for" 
          } 
        }
      ],
      "minimum_should_match": 1
    }
  },
  "highlight": {
    "fields": {
      "document.content": {}
    },
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"]
  },
  "size": 5
}'

To identify documents containing these meaningful phrases, we construct a phrase query that searches the document.content field for exact matches. Additionally, we use Lucenia's highlighting feature to display matching phrases directly in the search results for easy interpretation.

Key Phrases

For this analysis, we focus on commonly used phrases in patent documents:

  • "method for"
  • "system and method"
  • "device comprising"
  • "process of"
  • "apparatus for"

Query Features

  • Multi-Phrase Matching: The query is configured to match any of the specified phrases, using the bool.should clause.
  • Highlights Only: We limit the response to include only highlighted matches, excluding the full document content to keep the output concise.

By excluding the full document content ("_source": false), the query focuses solely on the relevant snippets. This is particularly useful when dealing with lengthy patent documents, as it allows users to pinpoint important phrases at a glance without wading through unrelated content.

Your Cloud-Stored Data, Unlocked

With Lucenia’s powerful document search capabilities, you’ve laid the groundwork for turning your local or cloud-stored documents into a lightning-fast, searchable repository of knowledge. By deploying lucenia, parsing complex document formats, and seamlessly indexing your data, you’ve built the foundation for high-performance search at scale.

But this is just the start. In Part 2, we’ll take it to the next level, diving into advanced query techniques while focusing on speed, parallelism, and cost efficiency. You’ll learn how to harness Lucenia’s full potential to optimize search performance without the skyrocketing costs and vendor lock-in associated with traditional cloud providers. Get ready to unlock the next chapter in scalable, cost-effective search!

Lucenia is now available on AWS Marketplace for streamlined access and deployment—experience the power of first-to-market Lucene 10 technology for document analysis today!