Maximize Cloud Efficiency: Consolidating the Modern Data Stack

Wes Richardet

As cloud storage costs have decreased, copying and storing data in multiple locations has become increasingly common. This reduction in storage cost has resulted in a proliferation of databases and data warehouses. However, we are now undermining those cost savings by adding redundant storage to handle more complex workloads. Additionally, the cost of maintaining these databases have also increased, driving a consolidation of the modern data stack. Cloud architecture has shifted from large, centralized databases to distributed systems, with many businesses transitioning from shared databases to streaming systems. These systems move data to local disks, enhancing fault tolerance in distributed systems. Tools like Apache Kafka have made it easier to facilitate data movement between systems and enable data to be distributed across multiple read-optimized views. Streaming architectures, made possible by inexpensive storage, have enhanced the adaptability and scalability of the modern data stack.

It is now common practice to use an Online Transaction Processing (OLTP) database like Postgres for transactional data and transfer the data to a read-optimized view in a high performance key-value store like RocksDB or columnar file format such as Parquet for analytics. Building event-driven systems on a durable log enables users to replay or backfill a new data store better suited to emerging architecture needs. In instances where a Postgres database fails to scale, organizations have several options: migrate to a Postgres-compliant wire protocol database like CockroachDB or use a tool like Debezium for Change Data Capture (CDC) into a search cluster like Lucenia or an Online Analytical Processing (OLAP) datastore like Big Query.

As some businesses realize that the cost of storage is becoming unmanageable, there is a growing need for consolidation in terms of the amount of software needed and the amount of data copied across the architecture. In this blog post we dive into various cloud architectures and discuss Lucenia’s innovative approach to cloud consolidation for helping budget-conscious enterprises trim excess services and eliminate data duplication in their cloud search deployments.

Cloud Architectural Patterns

Initial software rearchitecting efforts aimed to transition on prem hardware to cloud-based infrastructure. The approach involved attaching disks to each virtual machine and replicating data to a secondary location to support additional use cases. However, this practice led to a doubling of data size in the early stages, causing data consistency issues within the software. Consequently, alternative systems were explored and adopted to address the challenge of data movement.

Command Query Responsibility Segregation (CQRS) is a proven and widely adopted approach to maintain fast queries on data. The core idea behind CQRS is to separate data mutations from data queries to ensure optimal performance. An apt analogy for this approach is that the number of people watching a football game (readers) should not impede the playing of the game (writers). In some instances, businesses may need to prioritize transactional workloads on the input side of their systems while being able to consolidate analytical workloads on the output end. These changes brought on big data processing systems like Apache Storm (Lambda architecture) and Apache Samza (Kappa architecture).

For analytical tasks, specialized solutions such as temporal, geospatial, or search data solutions can be adopted to achieve better results. While some vendors may suggest a one-size-fits-all platform for all data, companies often prefer to piece together their own data stack. Many large enterprises have started building storage solutions or using “data lake” software like Snowflake, Databricks, or Presto. However, the storage cost can become a significant issue, leading many companies to look for ways to consolidate their data stack.

Cloud Consolidation

The data landscape is evolving towards standardized open data formats and services to facilitate seamless integration across applications and use-cases. The trend emphasizes the adoption of standard on-disk representations in data processing tools and systems.Consequently, several open-source projects have expanded their scope to support this trend. For instance, Spark’s Data Fusion plugin has enabled Spark to read and transform Parquet data in its workflows. Similarly, Confluent’s Tableflow simplifies the process of reducing a Kafka topic to an Apache Iceberg table backed by Parquet. Additionally, discussions within the Apache Kafka community have explored the use of Parquet as its segment storage format. RocksDB serves as a widely adopted  storage engine for local state storage in Kafka Streams and Apache Flink.

The industry is realizing a discernible shift towards using established  formats like Parquet, Avro, and Iceberg, akin to the broad uptake of products like S3 for object storage and Postgres for databases. These trends highlight a data environment that is steadily becoming more standardized, optimized, and confident in managing large datasets.

Many companies are building on these protocols and backing them with cost-effective, scalable storage solutions. This, we anticipate, will catalyze innovation around the S3 HTTP REST API stateless protocol and enhance the efficiency of reading and writing data from object storage.

Leave Your Data Where It Is with Lucenia

At Lucenia, we are committed to revolutionizing data evolution and supporting software in hybrid architectures. The default operating procedure should evolve from copying data to enable use cases to reading data where its at and evolving it for proven and needed use cases. There should be more options than flat files and starting with data ingestion. RocksDB, Lucene Segments, and Parquet are a few technologies that fit different use cases and provide options for searching data without copying it into a proprietary or custom format. Lucenia’s capabilities enable organizations to search data where it lives in various formats enabling organizations to query data with a common search API. We aim to allow a search client to effortlessly find data in standard formats residing in Object Storage and evolve that into a searchable index. Our priority is to support Parquet in a queryable format. We are determined to make your data queryable where it currently resides and then optionally move it to a more optimized format. We are also in the process of supporting Avro and Iceberg as queryable formats. Gone are the days of moving data from one system to another to query it. We are moving towards a future where you can query your data where it lives and then move it to a more optimized format if required. At Lucenia, we are confident that we can help you evolve your data and make it easy to query, all while keeping it secure and accessible.