Generative AI and the Curse of Dimensionality: Lucenia’s Vector Compression

Nick Knize

In AI and Machine Learning applications the “dimensionality” of a data set refers to the number of features, often referred to as input variables, that represent some real world phenomena (e.g., physical objects, conceptual meanings). These features are typically the columns in a data table, the more columns the more features to describe the natural phenomena. The “samples”, or rows in the table, represent the specific objects or phenomena in the real world. These samples typically serve as a “training” set to produce a new data set that represents the contextual meaning and/or relationships between the objects in the original training data set. The resulting data set are referred to as vector embeddings. These vector embeddings serve as numerical representations in a multi-dimensional space and are pivotal in capturing the semantic essence of textual information and real world relationships. Each vector, typically high-dimensional, encapsulates various features that enable the Generative AI processor to comprehend and process language intelligently. In modern Generative AI applications (e.g., OpenAI), vector embeddings can range from 256 to 3072 dimensions and are often growing every new version.

In real world data science applications, statistical models assume more samples than features. But in many real world scenarios this may not be the case, and violating this assumption often leads to incorrect and/or misleading results. Consider a dataset containing information about customers in an e-commerce platform, for example. If the dataset has only a few dimensions (e.g., age, income, and purchase history), it may be relatively easy to identify patterns and build a predictive model. Now, imagine expanding the dataset to include a large number of additional dimensions, such as the number of times a customer visited the website, the time spent on each page, and various other behavioral features. As the dimensionality increases, the amount of data required to effectively model the relationships between these features grows exponentially and, in turn, so does the storage needs and associated cost. This scenario is referred to as the “curse of dimensionality.” The curse of dimensionality states, as the number of features or dimensions in a dataset increases, the volume of the feature space grows exponentially. This can lead to problems such as sparsity of data points, increased computational complexity, and difficulty in identifying meaningful patterns.

In today’s blog post, we discuss and demonstrate one (of several) techniques Lucenia uses to minimize your vector storage demand while retaining the quality of information necessary to gain meaningful and reliable results from vector embeddings.

Reducing Storage with Lucenia’s Principal Component Analysis (PCA) Analytic

Addressing the curse of dimensionality described above often involves techniques such as feature selection, dimensionality reduction, or collecting more data to improve the density of information in the feature space. In Lucenia, we realize the cost and complexity of changing feature selection and retraining data as well as increasing the density of information (the opposite of what we want to achieve in reducing storage demand). Instead, we focus on dimensionality reduction techniques as a means of reducing the redundancy in the data set. Dimensionality reduction techniques are commonly used to reduce the dimensional space while retaining as much of the information needed. Reducing the dimensional space not only reduces the financial cost and space demand for storing the embeddings, it increases the search performance during analysis and recall.

One technique in particular that is used in Lucenia is known as Principal Component Analysis(PCA). PCA holds a prominent position in numerous applications including facial recognition, image compression, financial equity analysis, and human neurological stimulus analysis, and serves as a foundational component for Generative AI applications. It identifies principal components (eigenvectors) that capture the most variance in the data. Determining the number of dimensions to retain is crucial to PCA, and one approach we offer through Lucenia’s API is to choose a threshold for the cumulative explained variance of the data, such as 0.95 (95%), a measure of the cumulative proportion explained by each component. In layman’s terms, the maximum number of components needed to accurately represent the phenomenon in the data.

Let’s explore how this is done.  

Creating the Vector Index and Indexing Embeddings in Lucenia

Before indexing the vector embeddings, it’s crucial to define the index and establish the field mappings. The mapping below defines a field named “vector_embedding” with a type of “vector” and a dimensionality of 1536.

Once the index is created the embeddings can be ingested using the` _bulk` API. This is done with the following command:


curl -s -H "Content-Type: application/json" -XPOST \
'http://host:9200/_bulk' --data-binary @/path/to/embeddings.vec'

The content of the embeddings file should follow the following format:


{ "index" : { "_index" : "embeddings", "_id" : "1" } } 
{ "vector_embedding" : [-0.033611562, -0.012532363, -0.005336334, -0.247194294.....  
{ "index" : { "_index" : "embeddings", "_id" : "2" } } 
{ "vector_embedding" :  [-0.02424533, -0.004234323, -0.016432434, -0.454924092...  ...

Example – Reducing Dimensionality of Vector Embeddings

To reduce dimensionality, run the following analytic:


{
       "aggs":   {
              "dimension_reduction":   {
                     "pca":   {
                            "field": "vector_embedding",
                            "quality": "90%"
                     }
              }
       }
}

In this example, the “pca” analytic specifies the “field” as “vector_embedding” and aims to retain 90% of information. The result indicates which dimensions to keep:


"dimension_reduction" : {
       "dimensions" : [0 : 15, 19, 21, 92 : 152, 172, 195, 210 : 225, 227, 234, 238, 240, 242, ...]
}

These dimensions retain 90% of the information in the embeddings. These results can subsequently be used in a resampling processor for creating a new index with reduced dimensionality. Continue to follow Lucenia for the latest developments as we advance our vector processing techniques, with a focus on optimizing infrastructure while ensuring unparalleled results.

Lucenia and associated marks are trademarks, logos or registered trademarks of Lucenia, Inc. All other company and product names are trademarks, logos, or registered trademarks of their respective owners.