Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5

This post explores the use of large language models (LLMs) for analyzing log data. To do so, we reproduced part of the An Assessment of ChatGPT on Log Data benchmark originally conducted in 2023, by Intel researchers, Priyanka Mudgal and Rita Wouhaybi.

While that initial benchmark used ChatGPT-3, our study instead evaluates the AWS Nova Micro model. Our goal is to assess whether more recent models, that are smaller and cheaper, can match, or even exceed, the performance of ChatGPT-3, from a few years ago. The economics aspect of this benchmark is particularly interesting as Nova Micro’s cost per input token is 14 times lower than that of GPT-3.5-turbo from 2 years ago.

In the following posts, we will:

Describe the original benchmark and the datasets it used.
Present the results of re-running the benchmark with AWS Nova Micro.
Reflect on the datasets and demonstrate how LLMs can effectively handle other types of log data as well.

Benchmark Setup

The initial benchmark is described in An Assessment of ChatGPT on Log Data. This paper evaluates the performance of ChatGPT (specifically GPT-3.5-turbo) in analyzing log data across a range of tasks. The authors sought to answer ten research questions grouped into four categories:

Log Parsing & Analytics – Can ChatGPT parse logs and identify errors, root causes, security events, anomalies? Can it also determine frequently used APIs/services?
Prediction – Can it predict future log events based on past logs?
Summarization – Can ChatGPT summarize single and multiple log messages?
General Capabilities – Can it handle bulk log data, and what message lengths can it process?

Experiments were conducted using datasets from the Loghub dataset collection, which includes 2,000 labeled log messages from various systems (Windows, Linux, mobile, distributed, etc.). The team used a static version of GPT-3.5-turbo (as of March 2023) to ensure consistency and evaluated results manually due to the lack of standard benchmarks for some tasks.

Our experiment was very similar as we reused the methodology described in An Assessment of ChatGPT on Log Data benchmark and relied on the same 19 datasets from the Loghub dataset collection. As with the original benchmark,

Results were manually evaluated by a human
The same prompts were leveraged

Our experiment differs from the original benchmark in the following ways:

We evaluated responses provided by the AWS Nova Micro model rather than GPT-3.5-turbo.
We focused on the first three categories mentioned above i.e. Log Parsing & Analytics, Prediction, and Summarization, and the 7 questions related to these categories. The fourth category, named General Capabilities, relates to the amount of input data that the model can handle. LLMs have significantly improved on that front: the context window for GPT-3.5-turbo is 16,385 while it is 128,000 for AWS Nova Micro.
We used the same number of log entries as in the original benchmark, except that where that benchmark considered several values for a question (e.g. 5, 10, then 50), we only considered the maximum value (i.e. 50), as we only focus in cases where the model has the most amount of context available.
The actual log entries used may not be the same as with the original benchmark, since the actual subset of the data used for that benchmark is not published in the paper. Whenever possible, we included data samples with relevant entries (e.g. representing labelled issues).

In other words, we used the same approach and similar datasets as in the original benchmark, but focused on the following questions:

Category	Question	Prompt Used	Description
Log Parsing	Q1	You will be provided with a log message. Please extract the log template and variables from this log message.	How does the model perform on log parsing?
Log Analytics	Q2	Summarize the errors and warnings from these log messages and identify the root cause.	Can the model extract the errors and identify the root cause from raw log messages?
Log Analytics	Q3	Show the APIs called most with count from these log messages.	How does the model perform on advanced analytics tasks e.g., most called APIs/services?
Log Analytics	Q4	Are there any malicious users, urls, ips, and connection status from these log messages?	Can the model be used to extract security information from log messages?
Log Analytics	Q5	Detect the anomalies from the following log messages.	Is the model able to detect anomalies from log data?
Log Analytics	Q6	Predict the next 10 log events based on these log messages.	Can the model predict the next events based on previous log messages?
Log Summarisation	Q7	Summarize the log message.	Can the model summarize a single raw log message?

‍

Evaluation of AWS Nova Micro's Performance

In this section, we present the results we achieved running our benchmark. The table below breaks it down for each of the 7 questions considered over each of the 19 datasets of the Loghub collection:

Prompt Used	Correct Answers (%)	Remarks
You will be provided with a log message. Please extract the log template and variables from this log message.	17/19 (89%)	Failed on HDFS logs. IDs are not always categorised accurately, which is understandable.
Summarize the errors and warnings from these log messages and identify the root cause.	10/19 (53%)	The model erroneously reports warnings in Hadoop logs. For HPC, the model confuses timestamps and error codes. For HealthApp and Mac logs, the model eagerly finds issues in entries where there does not seem to be any.
Show the APIs called most with count from these log messages.	4/19 (21%)	Counting items appears to be very challenging for the model. Many datasets do not contain API related entries. The model eagerly reports some that do not make sense.
Are there any malicious users, URLs, IPs, and connection status from these log messages?	18/19 (95%)	The percentage of correct answers is really high here but it is hard to conclude on the general case as no obvious security issue was found in the logs sampled.
Detect the anomalies from the following log messages.	9/19 (47%)	The model reports anomalies based on criteria that are irrelevant at times e.g. entries that occur “towards the end of the sample data” or entries that “are repetitive”.
Predict the next 10 log events based on these log messages.	0/19 (0%)	Unsurprisingly, even for extremely repetitive logs (e.g. BGL logs), parts containing IDs and timestamps for instance are not correct.
Summarize the log message.	16/19 (84%)	Results are good here. Figuring the meaning of a field representing an ID in a log with no key is challenging, especially for log data that the model does not recognise.

‍

In a nutshell, our evaluation confirms the results obtained in An Assessment of ChatGPT on Log Data, where similar to ChatGPT3, Nova Micro also performs well at parsing and summarising log data. However, other types of analysis and prediction still remain challenging for LLMs.

The case of detecting malicious users, urls, IPs and connection status shows as being very successful. However, it is difficult to conclude that this is actually the case since the datasets did not make it possible to identify malicious entries. Whether the model would be able to detect such entries, should they be present in the data, is inconclusive and warrants further investigation.

However, the data indicates that the model does not provide false positives in this case. This is valuable information in itself, considering that it does provide false positives when trying to identify anomalies for instance.

Finally, it is important to note that the cost of the AWS Nova Micro is 14 times cheaper per token than GPT-3.5-turbo was 2 years ago.

This benchmark demonstrates that it is now possible to achieve parsing and summarising of log data in a much more cost effective way.

Reflection on Datasets

Having access to a curated collection of datasets, like those provided by Loghub, is invaluable. Without such datasets, it would be impossible to perform meaningful comparisons between benchmarks.

At Bronto, we frequently work with various types of log data that are commonly found in real-world environments. These include, for example, CDN logs or web access logs, as well as audit logs, such as AWS CloudTrail. These types of logs are widely used, and large language models (LLMs) typically have a solid understanding of their structure and content.

Take CDN logs as an example. Regardless of the specific CDN technology used, the content of these logs is generally consistent. They typically include information like client IP addresses, HTTP methods, request URLs, domain names, response codes, etc. Also, these logs often come with some structure, e.g. in JSON format. Structured log data has the huge advantage of associating field names to the values in each entry. These names represent very useful context for the models to rely on.

When we use prompts from questions 2 and 3 from the benchmark, along with the AWS Nova Micro model and synthetic structured CDN log data, based on real examples, we observe that the model provides significantly better answers. For instance, for question 2, the model can perfectly identify errors in this type of logs, leveraging the status code field present in the data. This is true even though the status code field never refers to “errors” or any word semantically related to them. Instead, the model is able to associate errors with status code larger than 400. The model is also able to describe the root cause associated with these errors, e.g. Client-Side Errors (400), Resource Not Found (404), Internal Server Errors (500), Service Unavailable (503), etc.

Also question 3 is very well suited for structured CDN or web access logs. In this case, the model identifies API endpoints without problem, by first identifying the field whose values represent API endpoints (reqPath in the dataset that we used).

However, counting appears very challenging for LLMs. For question 3 again, where a count of the most common APIs must be provided, we observe that the provided count is often inaccurate, regardless of the dataset used.

In summary, the type of dataset has a major impact on the model’s ability to generate accurate and relevant responses to the questions posed. From my experience, some of the datasets in Loghub are not very common and it is unclear whether LLMs would have a good understanding of the data that they contain. For instance, when asking Nova Micro to provide sample logs for HPC, HealthApp, BGL and Proxifier, the data provided does not resemble what is provided in the Loghub datasets.

Finally, even though there is still a significant portion of log data that is unstructured, there is also a large amount that is now structured. None of the Loghub datasets contain structured data. As shown above, fields provided by structured logs represent context that can really improve the performance of models.

‍

Conclusion

This post presents a benchmark evaluating LLMs on log data, based on the benchmark performed in 2023 and described in An Assessment of ChatGPT on Log Data. The results that we obtain are very similar to the ones of the original benchmark, with the notable difference that the cost per token of the model used in our benchmark is 14 times lower. This is a very important point since log data is notoriously large, leading to consuming a large amount of tokens when provided as input to an LLM.

Finally, when reflecting on the datasets of these benchmarks, it appears that they are particularly challenging for LLMs as they only contain unstructured data. These datasets also do not appear to be representative of what Bronto users send to their log management solution. Web access logs, audit logs, and application logs would be very commonly sent by Bronto users. Also, even though there is still a significant amount of log data that is still unstructured nowadays, there is also more and more structured log data. Based on our research we believe that LLMs have the potential to improve production logging systems, particularly when used to analyze the type of common, structured data widely seen in the real world.