
Introduction
In Part I of our Log Management Cost Trap series, I explored the challenges of designing, running and managing a centralised log management solution, with a focus on data ingestion. In Part II, I focus on data storage. Part III will discuss search and will be tackled in a follow up post.
In this post, I discuss different storage types and how their characteristics can fulfil some of the requirements of log management solutions. I will also look at how data is organized within these systems, and examine the role of file formats in enabling efficient ingestion, storage, and retrieval.
Storage types
When evaluating storage options, the type of storage medium is one of the first things to consider. Different storage types, such as file systems and blob storage, come with distinct characteristics.
Disks and file systems
File systems typically operate at a lower level of abstraction and often require explicit management of storage capacity, throughput, and IOPS (Input/Output Operations Per Second). Fortunately, managed services like AWS EFS (Elastic File System) and FSx simplify some of this management. For instance, EFS supports automatic scaling of storage and throughput capacity.
One major advantage of using a file system is the ability to append data to existing files. Appending is especially relevant in scenarios like log management, where data is immutable and continuously streamed.
In Bronto, we leverage file systems for data aggregation, and in particular their ability to append to files. As aggregation is performed over a few hours before it is transferred to blob storage, this does not require storing extremely large amounts of data, and therefore makes it a cost effective option. This aggregation phase is beneficial as it prevents storing small files on blob storage, which is known for inducing performance issues during the search phase.
.png)
Blob storage
Blob storage is a popular choice for data analytics workloads due to its scalability and cost-effectiveness. Unlike file systems, blob storage typically does not support appending to existing objects. Instead, files must be rewritten entirely when modified.
The pricing model for blob storage differs significantly from file systems on remote disks. It includes costs for storage and per-transaction API operations (e.g. writing or retrieving data). Overall, blob storage tends to be more cost-efficient than remote disks.
Blob storage can also support extremely high throughput, making services like AWS S3 ideal for data-intensive workloads. S3 allow massive parallel processing of datasets, enabling high-speed data ingestion and computation. For these reasons, it is a very popular storage choice for data analytics, including AWS EMR and AWS Athena.
However, blob storage has limitations. It is not well-suited for workloads that require frequent data appends or aggregations. In such cases, alternative techniques like compaction are often used (e.g. in Datadog Husky and Clickhouse), where many small objects are written over time and later aggregated into larger ones. This helps address performance bottlenecks that can arise when processing numerous small files in parallel.
Bronto combines different types of storage, depending on the part of the system that they are involved in. For instance, blob storage is used for long term, large immutable files while file storage is used for data aggregation over short periods of time. This balance takes into consideration both performance and cost at scale.
File formats and data organization
File formats are often designed for fast data retrieval. However, file format alone is not the only factor to consider. How data is physically organized in storage plays a critical role as well. Below, we outline several techniques for formatting and organizing data on storage solutions, which can be utilized by data analytics systems and, in particular, log management tools.
Compression
File compression is a crucial aspect of any log management solution operating at scale. The primary benefit is reduced storage space, leading to lower storage costs. At large volumes, this can translate into substantial savings for the overall system.
However, pursuing maximum compression is not always ideal. To achieve higher compression ratios more CPU, memory, and time are required, leading to higher compute cost.
.png)
Row based Vs Column based format
In row-oriented databases, all rows of a table are stored sequentially on disk. In contrast, column-oriented databases store the values of each column sequentially.
Generally, row oriented formats are more suitable for unstructured data with write-intensive workloads. However, with the rise of structured logging and log shippers that can annotate data with attributes, columnar formats have become more and more relevant for storing log data.
Partitioning

Partitioning is a common technique used to optimize data access by dividing large datasets into smaller segments. The goal is to avoid scanning the entire dataset during queries by skipping irrelevant partitions. This is particularly effective when you can define a logical criterion for segmenting the data.
For instance, in a billing system where charges are calculated monthly, it makes sense to partition data by month. If bills are computed at the end of each month, then only data stored under the current partition needs to be processed, rather than all the data ever ingested. This dramatically reduces the volume of data that needs to be read, especially when data is retained over longer periods like months or years.
Reducing the amount of data to be searched via partitioning reduces search cost, and increases its performance.
Indexing
Another powerful technique for improving query performance over large datasets is indexing. Indexes act much like those in a book: instead of reading the whole book in order to find the occurrences of a word, you refer to an index to find the locations (i.e. the pages in the case of a book) where the word occurs.
Inverted indexes allow for highly efficient searches, especially for cases where the data to be searched is uncommon. Inverted indexes can however become quite large, even as large as the initial dataset in some cases. Even though this extra data can lead to great performances for some types of searches, it may increase storage cost significantly.

Predicate Pushdown
One powerful optimization technique used in data analytics is predicate pushdown. Where large datasets are split into many files, predicate pushdown enables filters to be evaluated using metadata or summary information associated with each file, avoiding downloading and inspecting the whole content of certain files. Some file formats, like Parquet, support this by including column statistics such as minimum and maximum values in each data block. By inspecting only small portions of a file (or a separate file) to determine whether a given filter condition, or predicate, could possibly match any data in that file, it is possible to determine whether the predicate is guaranteed to be false. In that case, the file can be completely ignored.
At scale, predicate pushdown can dramatically improve performance and cost, by reducing both data transfer and compute resources when working with large, distributed datasets spread across many files.
Bloom filters
Another important concept in data analytics is the use of Bloom filters, a technique that, like data partitioning, helps avoid processing irrelevant data.
A Bloom filter is a probabilistic data structure used to test whether an element is possibly present or definitely not present in a dataset. When queried, a Bloom filter will either return a definitive "no" or a "maybe." This allows systems to efficiently eliminate portions of data that do not contain the required values, without the need to scan the entire dataset.
Compared to inverted indexes, Bloom filters are smaller and more lightweight, but they do not provide exact locations of the data. Instead, they are useful for ruling out files that definitely do not contain the desired values, allowing systems to skip those files during query execution.
Dictionary encoding
Dictionary encoding is a technique that is used to optimise storage and search performance when key value pairs are involved. It is relevant to keys with low cardinality values, e.g. country names, colour, etc. A reference to the value is used in the column rather than the actual value. This reduces the amount of data required in order to store the column values. It can also provide performance improvements at search time: if the data is filtered based on values of a specified key and none of these values are listed in the file dictionary, then the whole column associated with that key can be skipped for that file.
In summary, developing a storage strategy for a large-scale log management system is a complex task that demands deep expertise and a clear understanding of data ingestion and access patterns. At scale, fine-tuning the system to optimize the balance between performance and cost is essential to ensure its effectiveness and affordability for users.
Conclusion
Storage plays a pivotal role in log management solutions. Its performance characteristics impact ingestion and search throughputs. Its costs per unit of storage also greatly impacts how cost efficient the solution is. Bronto combines different types of storage, depending on the part of the system that they are involved in. For instance, blob storage is used for long term, large immutable files while file storage is used for data aggregation over short periods of time.
Finally, beyond the storage solution itself, compression techniques and file formats used to store data play a huge role in providing performant and cost effective search capabilities. For this reason, Bronto borrows concepts from databases and data analytics engines, such as partitioning, bloom filtering, push predicates and dictionary encoding in order to achieve high search performance. In the third and last part of this series, I will focus on the approaches and economics of Search engines for log management systems, and will detail how Bronto leverages these techniques with AWS Lambda, in order to provide a fast and cost-effective way to process large amounts of data stored in AWS S3.