In this post, I’m going to explore the challenges of designing, running, and managing a centralised log management solution. Centralised logging systems collect log data ranging from CDNs to applications running in high dynamic environments such as Kubernetes. For systems with low log data volumes, self-hosting open-source solutions or utilising SaaS free plans are often excellent starting points. However, as data volume inevitably grows, the complexity and costs associated with these solutions are often no longer viable.
This post is relevant if your logging costs have risen to a level where you are hesitant to send more data to your logging solution or are excluding certain sources due to the high costs they would incur. At this point you might be considering investing resources to reduce your logging costs with existing solutions. This is typically achieved by removing some of the data these solutions handle, such as reducing data retention, archiving data, etc. Alternatively, you might be considering developing your own logging system for better cost control.
For centralised log management systems, the sheer volume of data and its unstructured nature are typically the biggest factors influencing cost and complexity. I break down these challenges into three key areas:
These challenges are closely related, and design decisions taken in one of these areas impact design decisions in other areas. Trade-offs and innovation are required in order to design a system that is both performant and cost effective. This post focuses on data ingestion, while storage and search will be tackled in follow up posts.
Ingestion represents the part of the system receiving data and processing it in order to make it searchable. Due to the large volume of data that log management solutions need to handle, there are many similarities between these systems and data analytics engines, such as Hadoop or Spark. However, log management solutions have some specific requirements that distinguish them from general data analytics tools. One of the key differences is the need for data to be searchable in real time, or with minimal delay. This freshness requirement is typically much stronger in log management than in traditional data analytics platforms.
That is because log management systems support a variety of critical use cases, one of which is troubleshooting. In the event of a production issue that requires urgent investigation, users need to access the most recent log data as quickly as possible. They cannot afford to wait for data to be batched. In these situations, logs from the last few minutes must be searchable immediately.
On the other hand, there are other use cases that do not require such immediate freshness. For instance, a company might want to analyse which versions of web browsers have been accessing their system in order to focus efforts on supporting the most relevant versions, while deprecating support of the older ones. This type of analysis requires older data, which must be available for search, rather than archived. Logs from the past few months are typically relevant to answer these questions and as a result, freshness is less important.
Because log management systems focus on timestamped events and need to support both real-time troubleshooting as well as analytical queries over large datasets, they cannot rely solely on off-the-shelf data analytics platforms. The data ingested must be searchable quickly, which imposes specific requirements on the ingestion pipeline.
Additionally, log management solutions must prioritise reliability as well as ensuring logs are available to search quickly.
Upon receiving data, the system typically acknowledges its reception and ensures that it is securely handled. Mechanisms, such as data buffering must be in place to gracefully handle temporary issues. Apache Kafka, for example, is an effective and commonly used solution to handle data buffering at scale, and is often integrated into log management solutions (e.g. ELK, Datadog, Honeycomb). Having such a component in a data ingestion pipeline makes it possible for the system to gracefully handle temporary impediments in processing data. Efficient management of a Kafka cluster comes with its challenges and requires some expertise. Even with cloud offerings such as AWS MSK, the management overhead can be substantial and costly when dealing with large amounts of data.
When ingesting log data, it is important to decide how the data will be organised in the backend, as this will directly affect how the data is searched. Different backend options are available, and the choice you make will influence the level of configuration or work required.
One popular option for managing data is using a backend system based on indexing, such as Elasticsearch or OpenSearch. This approach offers good search performance as indexes provide exact locations to relevant data. It also typically requires extracting key-value pairs from logs, e.g. using Logstash within the ELK stack, in order to build an index, and the size of the index itself may be significant.
Another approach is to partition the data. In this case, no index is involved. Instead, the data is organised in such a way that large portions of it can be skipped at query time.
Log management solutions generally partition data by time ranges, as log data consists of timestamped events. These solutions often require specifying a time range within which the data should be searched, making it critical for the system to search only within that specified range to optimise both performance and cost.
Other solutions take partitioning one step further and leverage other attributes than time for this purpose. Examples of such systems are Grafana Loki, AWS Athena. In the case of Athena, data is stored on AWS S3. To avoid searching the entire dataset, the data is organised into separate prefixes on S3, with each prefix representing a different partition.
Relying on indexing or partitioning on their own can lead to increased ingestion or search cost as building indexes is an expensive task and partitioning may not narrow the data set to be searched in an efficient manner. Datadog Husky considers a hybrid approach, and we believe at Bronto that this is the best pattern to adopt, as it provides various levers to optimise for performance and cost.
As mentioned already, a log management solution must meet two key requirements. First, fresh data must be searchable. Second, large volumes of data must be searchable within a reasonable time frame. These requirements have implications for how data is ingested.
Since fresh data needs to be searchable, it must be made available to the search engine quickly, ideally within seconds. This means the data should be made available to the search engine in small increments. However, this approach can lead to an issue known as the "small files problem" in analytics workloads. Analytics workloads often involve large datasets processed with large amounts of compute power. Many computational units run in parallel, and the data is accessed over a network. When data is split into many small files, it can significantly degrade performance, as there are many small files to be accessed over the network. To address this, compaction and Append-only approaches can be used.
Compaction is a technique used by solutions such as Datadog Husky and Clickhouse, Data is first stored in small units and then consolidated into larger ones over time. Since small objects or files only apply to recent data, this approach is still suitable when searching larger datasets.
Append-only involves incrementally adding chunks of data into a larger unit of data. This can be easily achieved when storing data in a file system, where data can be appended to files. However, this becomes more problematic with object stores such as AWS S3, where appending data is not possible. Instead, the entire object needs to be rewritten, impacting performance and ingestion cost. Despite this challenge, object stores are cost-efficient as a long term storage solution and also better suited for high-parallel search access.
At Bronto, we implemented a 2 tier storage solution where data is first appended to files, making it available to the search engine, and then uploaded to an object store once it reaches a suitable size. This approach avoids having to perform any compaction while making fresh data available for search.
Log management solutions are designed to handle vast amounts of unstructured data, a task that introduces significant cost and complexity. These solutions must address diverse, often conflicting use cases: providing real-time data for troubleshooting while also managing large datasets for analytics queries. At scale, choosing how to ingest, store, and query data requires careful attention to the trade-offs that impact reliability, performance, cost, and system complexity.
In this post, I explored the key concepts involved in ingesting log data at scale, highlighting the expertise and effort needed to design, implement, and maintain a centralized logging solution. The challenges that companies building their own logging solution face are usually significant. Bronto provides a turnkey solution which is cost effective. This removes the data coverage issues that we often encounter with customers attempting to reduce their logging cost with existing offerings, as well as the cost of implementing and maintaining a home grown solution.
Subsequent posts will delve into other crucial aspects of these solutions, including storage and search.