Reducing Cloud Storage for Generative AI: Lucenia’s Approach to Vector Search

Nick Knize

May 16, 2024

As the hype around Generative AI and Retrieval Augmented Generation (RAG) grows, the number of vector dimensions a database product supports is often used as a measuring stick to compare solutions. Consequently, most search vendors prioritize increasing vector dimensions over more critical factors like signal quality or recall performance. This focus leads to higher costs for end users, as managing and storing data in the cloud becomes increasingly more expensive. Vector compression techniques offer a promising solution to this problem, but they are only one piece of the larger puzzle in the quest for efficient cloud data consolidation.

This blog post explores vector compression techniques, tracing their evolution from Lucenia’s geospatial encoding solution for Apache Lucene. While a good first step, vector compression alone cannot significantly reduce cloud storage costs for high-dimensional data. The post concludes by showing how Lucenia’s solution combines vector compression with other strategies, such as dimensionality reduction, to provide organizations and users with a “better together” solution to achieve greater savings while maintaining the quality of results expected from their search investments.

Vector Dimension Quantization

The concept of dimensional quantization and its application in Apache Lucene can trace its roots back to 2015 with the implementation of geo-point quantization. This technique was introduced to tackle the issue of exploding index storage space for 2-dimensional geo data at scale. The original contribution involved transforming decimal-degree latitude, longitude values from double-precision space (64 bits) to sortable integer space (32 bits). This lossy compression technique efficiently reduced geospatial index storage volume by 50% while retaining centimeter accuracy in real world coordinates.

Fast forward nine years to today, the problem has magnified with new vector data sets that can reach as many as 8,000+ dimensions. This explosion in dimensionality calls for advanced techniques like scalar quantization and binary quantization to manage storage costs effectively.

Technique 1: Scalar Quantization

Scalar quantization is a method of compressing high-dimensional vectors by reducing the precision of each component. Similar to the approach described in related blog posts on how Lucene handles it, scalar quantization can transform floating-point values into integers with lower precision. For example, converting 32-bit floating-point numbers to 8-bit integers (INT8) can achieve significant storage savings.

Example: Scalar Quantization in Apache Lucene

In Lucene, scalar quantization involves the following steps:

  1. Normalization: Transform the vector components into a standard range, e.g., [MIN, MAX].
  2. Quantization: Map the normalized values to a smaller set of discrete levels, e.g., 256 levels for 8-bit quantization.
  3. Storage: Store the quantized values as sortable integers, reducing the overall storage requirement.

For geospatial data the normalization range is well defined, using the minimum / maximum longitudinal and latitudinal decimal degree values (-180.0, 180.0 and -90.0, 90.0), respectively. For vector data, however, this range is dynamic as it has to be computed on a per segment basis, and then recomputed when segments merge.

This method has been shown to provide storage savings of up to 86% on moderate dimension sizes (e.g., 1024), and has been demonstrated in simulated applications across various implementations in the many well marketed vector databases.

Technique 2: Binary Quantization

Binary quantization on the other hand, involves representing vectors using binary codes. This technique leverages the a priori assumption that high-dimensional vectors often contain redundant information, which can be effectively captured using binary encoding schemes.

Example: Binary Quantization in DataStax

Consider a vector space where each vector component is either highly correlated or contains a lot of noise. Binary quantization can:

  1. Cluster: Group similar vectors together and represent them with a binary code.
  2. Encode: Assign a binary code to each cluster representative.
  3. Compress: Store the binary codes instead of the original high-dimensional vectors.

This approach can lead to substantial storage savings while maintaining a high level of accuracy in vector operations.

The Nature of Information Redundancy in High-Dimensional Space

High-dimensional vectors, such as those generated by large language models (LLMs), often contain a significant amount of redundant information. This redundancy arises because many dimensions capture similar or correlated information. For instance, in LLM embeddings, different dimensions might represent similar semantic features.

Real-World Example: LLM Vector Embeddings

In the context of LLMs, embeddings might have thousands of dimensions, many of which do not contribute uniquely to the representation. This redundancy can lead to inefficiencies and even problems such as hallucinations in AI models due to overfitting.

Problems with Redundant High-Dimensional Vectors

Redundant information in high-dimensional vectors can cause several issues:

  • Storage Inefficiency: Unnecessary storage costs due to redundant data.
  • Overfitting: Models trained on redundant data might perform well on training data but poorly on unseen data, leading to hallucinations and incorrect inferences.
  • Computational Overhead: Increased computational resources required for processing and querying high-dimensional data.

Lucenia’s Two-Part Solution: Quantization and Dimensionality Reduction

To address these challenges, Lucenia’s vector search capability offers a two-part solution that combines the compression benefits of quantization with the information redundancy removal provided by dimensionality reduction.

Part 1: Quantization

As highlighted earlier, quantization techniques like scalar and binary quantization can significantly reduce cloud storage demands of high-dimensional vectors. These methods compress the data by reducing precision and leveraging redundancy and can achieve reduction demands by upwards of 86%. Lucenia’s vector search offering leverages the implementation of Apache Lucene to provide compression benefits of scalar quantization on a per dimension basis.

Part 2: Dimensionality Reduction

Dimensionality reduction techniques, such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), further enhance storage efficiency by reducing the number of dimensions while retaining the most informative dimensions. Lucenia’s vector search offering provides efficient dimensionality reduction by combining the mathematical techniques of PCA with programming languages that leverage modern CPU parallel processing advancements through Single Instruction Multiple Data (SIMD) techniques.  

Results: Storage Benchmarks

The following benchmarks compare and contrast the quantization only approach to vector storage reduction provided by current vector databases (e.g., pinecone, DataStax, Elastic) with the two step quantization and dimensional reduction technique provided by Lucenia’s vector search offering.

Graph 1: Cloud Storage Savings using only Scalar Quantization (e.g., Elastic)

Graph 1 illustrates the cloud storage savings achieved by applying scalar quantization alone to 500 thousand vectors with 1024 dimensions. Expected savings are reasonable, coming in around 86%.

Graph 2: Cloud Storage Savings with Lucenia’s Vector Search

Graph 2 illustrates cloud storage savings using Lucenia’s vector search capabilities. This provides the combined effect of quantization and dimensionality reduction, achieving up to 99.3% storage savings while retaining 90% of the information quality.

Conclusion

While dimensional quantization is an excellent starting point for reducing cloud storage costs, it is not a standalone solution. The redundancy in high-dimensional vectors presents an opportunity for even greater savings through dimensionality reduction. Lucenia’s comprehensive vector search approach, which combines compression benefits of quantization with redundancy reduction benefits of dimensionality reduction, not only slashes storage costs but also maintains quality of information, addressing issues such as overfitting and inefficiencies that lead to hallucinations in vector recall. For more details on how to use Lucenia’s capabilities and how it can benefit your cloud storage needs, refer to our earlier blog post on Generative AI and the Curse of Dimensionality: Lucenia’s Vector Compression. By embracing these advanced techniques, organizations can manage their data more efficiently, ensuring scalability and cost-effectiveness in the ever-growing landscape of high-dimensional vector data.