The Logging Cost Trap: Search

The Log Management Cost Trap: Part III - Search

In Part I (Ingestion) and Part II (Storage) of our Log Management Cost Trap series, I explored the challenges of designing, running and managing a centralised log management solution, with a focus on data ingestion and storage respectively. In Part III, I will focus on search.

Challenges of Searching Log Data

When dealing with log data, the requirements often differ from those of traditional or big data analytics. As discussed in Part II, there are two key requirements to consider: fast access to recent data and analysis over large datasets.

First we need the ability to search and analyze data that has been produced very recently. This is especially important in troubleshooting scenarios. When a system outage occurs, rapid visibility into what caused the issue is critical. To support this, log data must be searchable almost immediately after it is generated. This imposes limitations on batch processing. Specifically, batch windows must be short. Note that the constraint is not about the amount of data being batched, but the time span being batched. Obviously, batching over short amounts of time is more likely to lead to smaller batches being generated. Even if the batch contains a small amount of data, it must be made available for search quickly.

The second use case involves analyzing large volumes of data, often over extended periods. A common example is examining web or CDN access logs to identify patterns in API usage. This helps teams determine trends of slowly degrading performance for instance. In these cases, data freshness is not critical, what matters is the ability to efficiently scan and analyze large datasets spanning weeks or months.

These two distinct use cases, one prioritizing real-time access and the other emphasizing large-scale historical analysis, come with competing requirements. Making log data available quickly to ensure freshness, often involves processing small batches, which can lead to the creation of many small files. This, in turn, can significantly degrade performance when running queries across long time ranges, a challenge known as the “small file problem”.

‍

‍

To effectively support both real-time troubleshooting and long-term analytical use cases, a log management solution must strike a balance. It needs to ensure that newly ingested data is searchable right away, while also storing that data in a format that supports efficient querying over time.

Performant and Cost Effective Search

In Part II of this series, I discussed the importance of formatting and storing data in ways that enable efficient searches. Key techniques include indexing, Bloom filtering, and data partitioning.

Indexing and bloom filtering are especially valuable when searching for data that appears infrequently within a large time range. For instance, one might be looking for a specific trace ID across several terabytes of log data. As explained in Why is Bronto so fast at searching logs, well-designed indexing and Bloom filtering can significantly reduce the volume of data that needs to be scanned. By using these techniques, you can narrow down the dataset to a much smaller subset, one that is more likely to contain the trace ID, making the search process faster and more cost efficient.

In some cases, queries are not about finding a “needle in a haystack.” Instead, they require analyzing all log entries. This is common in data analytics use cases. For example, when analyzing web access or CDN logs, a user might want to know the maximum response time per endpoint over the past few months. This type of query aims to understand how maximum response times have evolved over time. In this scenario, there’s no single rare value to isolate. Every log entry contains a response time, so every entry must be examined by the query. In such cases, techniques like indexing, Bloom filtering, or data partitioning offer little to no benefit. Even predicate pushdown does not help in this example, because there are no filters narrowing the dataset: all entries contain valuable values.

It is also usually not possible to rely on pre-aggregated summaries (e.g. overall max values), unless in the example above, users anticipate that a breakdown by endpoint for that metric is of interest. Of course, if these types of queries were known ahead of time, pre-computation or dashboarding could be leveraged in this case. However, in general-purpose log management systems, it is not feasible to predict all the ways users may want to analyze their data. As a result, these systems must support queries that require scanning every log event.

‍

‍

In such situations, the only viable solution is brute-force compute power, leveraging massive parallelism and high-performance processing to deliver results quickly, even when the query demands full dataset scans. However, the ability to spawn high compute power in parallel is required in order to ensure low search latency. In very spiky search patterns, this may require provisioning enough server capacity to be able to burst in compute. A tradeoff needs to be found in order to have enough capacity to burst in a way that provides good search performance, yet not over provisioning the underlying infrastructure in order to keep the cost of search within acceptable limits.

To support these demanding, large-scale queries, while striking a good balance between performance and cost, Bronto relies on AWS Lambda functions. Lambda enables high concurrency, allowing large volumes of data to be processed in parallel, particularly when accessing data stored in Amazon S3. Since AWS Lambda is a serverless technology, there is no need to provision or manage infrastructure in advance. Instead, compute resources can dynamically burst on demand, which makes it a cost-effective solution for processing large datasets over short time windows.

The cost efficiency of this approach comes from Lambda's pricing model: you only pay for the compute time used. Cost is only introduced during function runs. There is no cost associated with functions that are not running. So even when running many functions concurrently, if execution time is short, the overall cost remains low.

However, this model is ideal primarily for bursty workloads, i.e. those that are unpredictable and intermittent. When the volume of data being searched consistently exceeds a certain threshold, it becomes more efficient to use other compute options, such as AWS EC2, which are better suited for sustained workloads where resource usage is more predictable.

In summary, brute forcing may be an effective approach to process log data in some situations. However it is not often an effective one in terms of resource consumption and cost. Techniques such as indexing, Bloom filtering, and data partitioning help in more efficiently searching data in many cases. For cases where these techniques are unsuitable, AWS Lambda is an excellent tool for handling unpredictable, high-intensity bursts of compute, while other services like EC2 are better suited to achieve a cost effective compute baseline.

High Cardinality

Log data often contains fields with high cardinality values. Typical examples of this are client IP addresses, or trace IDs. High cardinality is challenging when performing certain types of queries, e.g. counting the number of occurrences of unique IP addresses in log data. Interacting with such data may lead to slow query performance, high memory consumption and poor user experience when interacting with the result.

One approach to handle high cardinality data in logging solutions is to set a limit on the number of values that the system can handle. This is obviously not very satisfactory as users are then unable to get value out of their data. A better approach is to compute exact results up to a certain cardinality and then approximate results when cardinality really becomes too high. Probabilistic data structures and associated algorithms such as HyperLogLog, Count Min Sketch, Cuckoo Filter, and Top K, can be leveraged to compute approximate results. Such an approach becomes necessary in order to provide users with insights while managing resources, and therefore cost, to an acceptable level.

‍

Conclusion

This ends our three part series on The Log Management Cost Trap, during which I raised many challenges that have to be overcome in order to design a cost efficient log management solution at scale. I broke down those challenges into 3 categories: ingestion, storage and search, and emphasised how design decisions made in one of these categories often impacts performance and cost in others, making it quite difficult to navigate towards an optimal solution. Bronto has 150+ years of combined experience in log management solutions at scale and implements all this experience into a platform that provides its users with a cost efficient, best in breed solution, that is ready to tackle logging in the AI era.