Select Page
Streaming deduplication in syslog-ng

Streaming deduplication in syslog-ng

Log volumes are growing 25% year over year, which means they are doubling every three years. Considering that SIEMs and other log processing tools are licensed based on volume, tools and mechanisms to make log storage and processing more efficient are very much sought for.

A typical solution to this problem is the use of a dedicated log management layer, or as it is called these days: a dedicated observability pipeline. Regardless of how you name the solution in place, there are two separate gains of using these systems:

  1. you can make data more valuable by fixing up data problems or enriching data,
  2. you get to choose where the data gets stored (in the SIEM or elsewhere), thus potentially decreasing the volume of data sent to the SIEM.

As you look at the data ingested into the SIEM, you will recognize that not all of that data is displayed in dashboards or used for detecting threats. Nevertheless, organizations still collect and store this data as best practice, because a forensics investigation could potentially use this data, should an incident be discovered later.

While I believe that all data can be made valuable with enough effort, let me zoom in on the volume question.

Simple log deduplication

With something like syslog-ng, you can obviously route specific applications or severity levels somewhere else (like a set of files or an S3 bucket), simply by using filters. In addition to routing non-essential data to a separate log archive, you can also reduce redundancy between messages and combine groups of multi-line logs into single events. Or, you can transform a huge XML-based event into a neater, smaller structure.

Even with all of this in place, you may still get runaway applications sending messages in a tight loop in huge quantities, repeating the same message over and over. The original syslogd had support for suppressing such repeated messages, and syslog-ng has even improved this feature. Here’s a sample message and its suppression that follows it, as produced by syslog-ng:

Jan 23 19:23:10 bzorp sshd[3561]: Failed password for admin from 10.110.2.151 port 9807 ssh2
Jan 23 19:23:20 bzorp sshd: Last message 'Failed password for ' repeated 2 times, suppressed by syslog-ng on bzorp

 

syslog-ng improves the original syslogd functionality by keeping the $HOST / $PROGRAM values intact to make it easier to correlate the repetitions and the original message.

Let me point out that suppression like this does decrease the volume, but at the same time it also loses information. With the example above, you are losing the timestamp of the two subsequent login failure attempts, which might prove useful in a forensics investigation or when training an AI model that uses failed logins as an input.

This kind of suppression is also pretty limited: sometimes the message is not completely the same: events may differ in ways that are not material to your analytics tools, while the representation as a log message would be different. In these cases, the above suppression would not work.

Flexible streaming log deduplication

syslog-ng is a Swiss Army Knife for logs, so obviously there is a more flexible solution in its arsenal: syslog-ng can perform something I call “streaming correlation” using its grouping-by() parser (available since version 3.8.1 from 2016). A grouping-by() parser is very similar to the “GROUP BY” construct in SQL databases, but instead of tables of data, you can apply it to a stream of events. This is usually used to transform a series of events into a combined one, but this can also be used to deduplicate the log stream while ignoring unimportant changes to the message, as discussed in this GitHub thread.

Here is an example with an iptables message parsed by our iptables-parser() which has ${PROTO}, ${SRC}, ${DST} and ${DPT} fields extracted by the time it gets into this processing element:

parser p_dedup {
    grouping-by(
        key("${.iptables.PROTO}/${.iptables.SRC}/${.iptables.DST}/${.iptables.DPT}")
        aggregate(
            value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")
        )
        timeout(10)
        inject-mode(aggregate-only));
};

This configuration instructs syslog-ng to follow the log stream and “group” all messages that have the same key within a 10 second window. The key contains only proto/srcip/dstip/dstport values and omits srcport which can be considered unimportant when looking at a sequence of connections.

Once the 10 second elapses, syslog-ng reports a single event with the $MESSAGE part changed, so that it includes the number of messages that were considered the same. Do note that you can construct the “aggregate” message quite flexibly. You can

  • change any existing name-value pairs or even add new ones.
  • have repetitions in a dedicated field so it does not change $MESSAGE itself.
  • do aggregations for various fields across the group (using the $(sum) or $(average) template functions for example)

Using grouping-by() while collecting data is a lot more performant that storing the entire data set and then doing the same query from the database. It reduces the amount of data to be ingested and the CPU time required to come up with the same aggregation at search time.

One caveat is that you should probably store the raw data stream into a separate archive and only perform these kind of reductions en-route to your SIEM/analytics/dashboarding system, so that you can access to the unchanged, raw data for forensics investigations or the training of AI models.

In case you would like to play with streaming deduplication and syslog-ng, here’s a complete syslog-ng configuration that I’ve prepared while writing this blog post. If you send an iptables message to TCP port 2000, it would perform deduplication with a 10 second window.

@version: 4.0
@include "scl.conf"

parser p_dedup {
  grouping-by(
    key("${.iptables.PROTO}/${.iptables.SRC}/${.iptables.DST}/${.iptables.DPT}")
    aggregate(
      value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")
    )
    timeout(10)
    inject-mode(aggregate-only));
};

log {
  source { tcp(port(2000)); };
  parser { iptables-parser(); };
  parser(p_dedup);
  destination { file("deduplicated.log"); };
}

Just start syslog-ng with the config above in the foreground (-d tells syslog-ng to run in debug mode, which you can omit):

$ /usr/sbin/syslog-ng -F -d -f <path/to/config/file

Then post a message to port 2000 using netcat (repeat this potentially a number of times):

$ echo '<5> https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0' | nc -q0 localhost 2000

And you will get this output in deduplicated.log for 6 repetitions of the same message:

Jan 24 10:22:07 localhost https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0 REPEAT=6