.png)
This post explores the use of large language models (LLMs) for analyzing log data. To do so, we reproduced part of the An Assessment of ChatGPT on Log Data benchmark originally conducted in 2023, by Intel researchers, Priyanka Mudgal and Rita Wouhaybi.
While that initial benchmark used ChatGPT-3, our study instead evaluates the AWS Nova Micro model. Our goal is to assess whether more recent models, that are smaller and cheaper, can match, or even exceed, the performance of ChatGPT-3, from a few years ago. The economics aspect of this benchmark is particularly interesting as Nova Micro’s cost per input token is 14 times lower than that of GPT-3.5-turbo from 2 years ago.
In the following posts, we will:
- Describe the original benchmark and the datasets it used.
- Present the results of re-running the benchmark with AWS Nova Micro.
- Reflect on the datasets and demonstrate how LLMs can effectively handle other types of log data as well.
Benchmark Setup
The initial benchmark is described in An Assessment of ChatGPT on Log Data. This paper evaluates the performance of ChatGPT (specifically GPT-3.5-turbo) in analyzing log data across a range of tasks. The authors sought to answer ten research questions grouped into four categories:
- Log Parsing & Analytics – Can ChatGPT parse logs and identify errors, root causes, security events, anomalies? Can it also determine frequently used APIs/services?
- Prediction – Can it predict future log events based on past logs?
- Summarization – Can ChatGPT summarize single and multiple log messages?
- General Capabilities – Can it handle bulk log data, and what message lengths can it process?
Experiments were conducted using datasets from the Loghub dataset collection, which includes 2,000 labeled log messages from various systems (Windows, Linux, mobile, distributed, etc.). The team used a static version of GPT-3.5-turbo (as of March 2023) to ensure consistency and evaluated results manually due to the lack of standard benchmarks for some tasks.
Our experiment was very similar as we reused the methodology described in An Assessment of ChatGPT on Log Data benchmark and relied on the same 19 datasets from the Loghub dataset collection. As with the original benchmark,
- Results were manually evaluated by a human
- The same prompts were leveraged
Our experiment differs from the original benchmark in the following ways:
- We evaluated responses provided by the AWS Nova Micro model rather than GPT-3.5-turbo.
- We focused on the first three categories mentioned above i.e. Log Parsing & Analytics, Prediction, and Summarization, and the 7 questions related to these categories. The fourth category, named General Capabilities, relates to the amount of input data that the model can handle. LLMs have significantly improved on that front: the context window for GPT-3.5-turbo is 16,385 while it is 128,000 for AWS Nova Micro.
- We used the same number of log entries as in the original benchmark, except that where that benchmark considered several values for a question (e.g. 5, 10, then 50), we only considered the maximum value (i.e. 50), as we only focus in cases where the model has the most amount of context available.
- The actual log entries used may not be the same as with the original benchmark, since the actual subset of the data used for that benchmark is not published in the paper. Whenever possible, we included data samples with relevant entries (e.g. representing labelled issues).
In other words, we used the same approach and similar datasets as in the original benchmark, but focused on the following questions:
Evaluation of AWS Nova Micro's Performance
In this section, we present the results we achieved running our benchmark. The table below breaks it down for each of the 7 questions considered over each of the 19 datasets of the Loghub collection:
In a nutshell, our evaluation confirms the results obtained in An Assessment of ChatGPT on Log Data, where similar to ChatGPT3, Nova Micro also performs well at parsing and summarising log data. However, other types of analysis and prediction still remain challenging for LLMs.
The case of detecting malicious users, urls, IPs and connection status shows as being very successful. However, it is difficult to conclude that this is actually the case since the datasets did not make it possible to identify malicious entries. Whether the model would be able to detect such entries, should they be present in the data, is inconclusive and warrants further investigation.
However, the data indicates that the model does not provide false positives in this case. This is valuable information in itself, considering that it does provide false positives when trying to identify anomalies for instance.
Finally, it is important to note that the cost of the AWS Nova Micro is 14 times cheaper per token than GPT-3.5-turbo was 2 years ago.
This benchmark demonstrates that it is now possible to achieve parsing and summarising of log data in a much more cost effective way.
Reflection on Datasets
Having access to a curated collection of datasets, like those provided by Loghub, is invaluable. Without such datasets, it would be impossible to perform meaningful comparisons between benchmarks.
At Bronto, we frequently work with various types of log data that are commonly found in real-world environments. These include, for example, CDN logs or web access logs, as well as audit logs, such as AWS CloudTrail. These types of logs are widely used, and large language models (LLMs) typically have a solid understanding of their structure and content.
Take CDN logs as an example. Regardless of the specific CDN technology used, the content of these logs is generally consistent. They typically include information like client IP addresses, HTTP methods, request URLs, domain names, response codes, etc. Also, these logs often come with some structure, e.g. in JSON format. Structured log data has the huge advantage of associating field names to the values in each entry. These names represent very useful context for the models to rely on.
When we use prompts from questions 2 and 3 from the benchmark, along with the AWS Nova Micro model and synthetic structured CDN log data, based on real examples, we observe that the model provides significantly better answers. For instance, for question 2, the model can perfectly identify errors in this type of logs, leveraging the status code field present in the data. This is true even though the status code field never refers to “errors” or any word semantically related to them. Instead, the model is able to associate errors with status code larger than 400. The model is also able to describe the root cause associated with these errors, e.g. Client-Side Errors (400), Resource Not Found (404), Internal Server Errors (500), Service Unavailable (503), etc.
Also question 3 is very well suited for structured CDN or web access logs. In this case, the model identifies API endpoints without problem, by first identifying the field whose values represent API endpoints (reqPath in the dataset that we used).
However, counting appears very challenging for LLMs. For question 3 again, where a count of the most common APIs must be provided, we observe that the provided count is often inaccurate, regardless of the dataset used.
In summary, the type of dataset has a major impact on the model’s ability to generate accurate and relevant responses to the questions posed. From my experience, some of the datasets in Loghub are not very common and it is unclear whether LLMs would have a good understanding of the data that they contain. For instance, when asking Nova Micro to provide sample logs for HPC, HealthApp, BGL and Proxifier, the data provided does not resemble what is provided in the Loghub datasets.
Finally, even though there is still a significant portion of log data that is unstructured, there is also a large amount that is now structured. None of the Loghub datasets contain structured data. As shown above, fields provided by structured logs represent context that can really improve the performance of models.
Conclusion
This post presents a benchmark evaluating LLMs on log data, based on the benchmark performed in 2023 and described in An Assessment of ChatGPT on Log Data. The results that we obtain are very similar to the ones of the original benchmark, with the notable difference that the cost per token of the model used in our benchmark is 14 times lower. This is a very important point since log data is notoriously large, leading to consuming a large amount of tokens when provided as input to an LLM.
Finally, when reflecting on the datasets of these benchmarks, it appears that they are particularly challenging for LLMs as they only contain unstructured data. These datasets also do not appear to be representative of what Bronto users send to their log management solution. Web access logs, audit logs, and application logs would be very commonly sent by Bronto users. Also, even though there is still a significant amount of log data that is still unstructured nowadays, there is also more and more structured log data. Based on our research we believe that LLMs have the potential to improve production logging systems, particularly when used to analyze the type of common, structured data widely seen in the real world.