Select Page
Streaming deduplication in syslog-ng

Streaming deduplication in syslog-ng

Log volumes are growing 25% year over year, which means they are doubling every three years. Considering that SIEMs and other log processing tools are licensed based on volume, tools and mechanisms to make log storage and processing more efficient are very much sought for.

A typical solution to this problem is the use of a dedicated log management layer, or as it is called these days: a dedicated observability pipeline. Regardless of how you name the solution in place, there are two separate gains of using these systems:

  1. you can make data more valuable by fixing up data problems or enriching data,
  2. you get to choose where the data gets stored (in the SIEM or elsewhere), thus potentially decreasing the volume of data sent to the SIEM.

As you look at the data ingested into the SIEM, you will recognize that not all of that data is displayed in dashboards or used for detecting threats. Nevertheless, organizations still collect and store this data as best practice, because a forensics investigation could potentially use this data, should an incident be discovered later.

While I believe that all data can be made valuable with enough effort, let me zoom in on the volume question.

Simple log deduplication

With something like syslog-ng, you can obviously route specific applications or severity levels somewhere else (like a set of files or an S3 bucket), simply by using filters. In addition to routing non-essential data to a separate log archive, you can also reduce redundancy between messages and combine groups of multi-line logs into single events. Or, you can transform a huge XML-based event into a neater, smaller structure.

Even with all of this in place, you may still get runaway applications sending messages in a tight loop in huge quantities, repeating the same message over and over. The original syslogd had support for suppressing such repeated messages, and syslog-ng has even improved this feature. Here’s a sample message and its suppression that follows it, as produced by syslog-ng:

Jan 23 19:23:10 bzorp sshd[3561]: Failed password for admin from port 9807 ssh2
Jan 23 19:23:20 bzorp sshd: Last message 'Failed password for ' repeated 2 times, suppressed by syslog-ng on bzorp


syslog-ng improves the original syslogd functionality by keeping the $HOST / $PROGRAM values intact to make it easier to correlate the repetitions and the original message.

Let me point out that suppression like this does decrease the volume, but at the same time it also loses information. With the example above, you are losing the timestamp of the two subsequent login failure attempts, which might prove useful in a forensics investigation or when training an AI model that uses failed logins as an input.

This kind of suppression is also pretty limited: sometimes the message is not completely the same: events may differ in ways that are not material to your analytics tools, while the representation as a log message would be different. In these cases, the above suppression would not work.

Flexible streaming log deduplication

syslog-ng is a Swiss Army Knife for logs, so obviously there is a more flexible solution in its arsenal: syslog-ng can perform something I call “streaming correlation” using its grouping-by() parser (available since version 3.8.1 from 2016). A grouping-by() parser is very similar to the “GROUP BY” construct in SQL databases, but instead of tables of data, you can apply it to a stream of events. This is usually used to transform a series of events into a combined one, but this can also be used to deduplicate the log stream while ignoring unimportant changes to the message, as discussed in this GitHub thread.

Here is an example with an iptables message parsed by our iptables-parser() which has ${PROTO}, ${SRC}, ${DST} and ${DPT} fields extracted by the time it gets into this processing element:

parser p_dedup {
            value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")

This configuration instructs syslog-ng to follow the log stream and “group” all messages that have the same key within a 10 second window. The key contains only proto/srcip/dstip/dstport values and omits srcport which can be considered unimportant when looking at a sequence of connections.

Once the 10 second elapses, syslog-ng reports a single event with the $MESSAGE part changed, so that it includes the number of messages that were considered the same. Do note that you can construct the “aggregate” message quite flexibly. You can

  • change any existing name-value pairs or even add new ones.
  • have repetitions in a dedicated field so it does not change $MESSAGE itself.
  • do aggregations for various fields across the group (using the $(sum) or $(average) template functions for example)

Using grouping-by() while collecting data is a lot more performant that storing the entire data set and then doing the same query from the database. It reduces the amount of data to be ingested and the CPU time required to come up with the same aggregation at search time.

One caveat is that you should probably store the raw data stream into a separate archive and only perform these kind of reductions en-route to your SIEM/analytics/dashboarding system, so that you can access to the unchanged, raw data for forensics investigations or the training of AI models.

In case you would like to play with streaming deduplication and syslog-ng, here’s a complete syslog-ng configuration that I’ve prepared while writing this blog post. If you send an iptables message to TCP port 2000, it would perform deduplication with a 10 second window.

@version: 4.0
@include "scl.conf"

parser p_dedup {
      value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")

log {
  source { tcp(port(2000)); };
  parser { iptables-parser(); };
  destination { file("deduplicated.log"); };

Just start syslog-ng with the config above in the foreground (-d tells syslog-ng to run in debug mode, which you can omit):

$ /usr/sbin/syslog-ng -F -d -f <path/to/config/file

Then post a message to port 2000 using netcat (repeat this potentially a number of times):

$ echo '<5> https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC= DST= LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0' | nc -q0 localhost 2000

And you will get this output in deduplicated.log for 6 repetitions of the same message:

Jan 24 10:22:07 localhost https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC= DST= LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0 REPEAT=6


syslog-ng 4 improves Python support

syslog-ng 4 improves Python support

It’s been a while since I personally acted as the release manager for a syslog-ng release, the last such release was 3.3.1 back in October 2011. v3.3 was an important milestone, as that was the version that introduced threaded mode and came with a completely revamped core architecture to scale up properly on computers that had multiple CPUs or cores. I released syslog-ng 4.0.1 a couple of weeks ago which brings with it the support for runtime typing, which is a significant conceptual improvement.

Apart from typing, which I have discussed at length already, the release sports important additions and improvements in syslog-ng’s support for Python, which I would like to zoom into a bit in this post.

In case you are not aware, syslog-ng has allowed you to write source and destination drivers, parsers and template functions in Python for a while now. See this post on writing a source in general and this one for writing an HTTP source.

There was one caveat in using Python though: while it was easy to extend an existing configuration and relatively easy to deploy these in a specific environment, syslog-ng lacked the infrastructure to merge such components into syslog-ng itself and expose this functionality as if it was implemented natively. For instance, to use the Python based HTTP source described in the blog post I mentioned above, you needed to write something like this to use the Python based http source:

source s_http {
      options("port", "8081")

As you can see, this syntax is pretty foreign, at least if you compare this to a native driver that would look like this:

source s_http {

A lot simpler, right? Apart from configuration syntax, there was another shortcoming though: Python code usually relies on 3rd party libraries, usually distributed using PyPI and installed using pip. Up to 4.0.0, one needed to take care about these dependencies manually. The http source example above needs you to install the “python3-twisted” package using dnf/apt-get or pip manually and only then would you be able to use it.

These short-comings are all addressed in the 4.0.0 release, so that:

  • 3rd party libraries are automatically managed once you install syslog-ng.
  • you can use native configuration syntax,
  • we can ship Python code as a part of syslog-ng,

Let’s break these down one-by-one.

Managing 3rd party Python dependencies

From now on, syslog-ng automatically creates and populates a Python virtualenv to host such 3rd party dependencies. This virtualenv is located in ${localstatedir}/venv, which expands to /var/lib/syslog-ng/venv normally. The virtualenv is created by a script named syslog-ng-update-virtualenv, which is automatically run at package installation time.

The list of packages that syslog-ng will install into this virtualenv is described by /usr/lib/syslog-ng/python/requirements.txt.

If you want to make further libraries available (for instance because your local configuration needs it), you can simply use pip to install them:

$ /var/lib/syslog-ng/python-venv/bin/pip install <pypi package>

syslog-ng will automatically activate this virtualenv at startup, no need to explicitly activate it before launching syslog-ng.

Using this mechanism, system installed Python packages will not interfere with packages that you need because of a syslog-ng related functionality.

Native configuration syntax for Python based plugins using blocks.

There are two ways of hiding the implementation complexities of a Python based component, in your configuration file:

  • using blocks to wrap the python() low level syntax, described just below
  • using Python based config generators, described in the next section

Blocks have been around for a while, they basically allow you to take a relatively complex configuration snippet and turn it into a more abstract component that can easily be reused. For instance, to allow using this syntax:

source s_http {

and turn it into a python() based source driver, you just need the following block:

block source http(port(8081)) {
          options("port", "`port`") );

The content of the block will be substituted into the configuration, whenever the name of the block is encountered. Parameters in the form of param(value) will be substituted using backticks.

In simple cases, using blocks provides just enough flexibility to hide an implementation detail (e.g. that we used Python as the implementation language) and also hides redundant configuration code.

Blocks are very similar to macros as used in other languages. This term was unfortunately already taken in the syslog-ng context, that’s why it has been named differently.

Blocks are defined in syslog-ng include files, these include files you can store as an “scl” subdirectory of the Python module.

Native configuration syntax for Python based plugins using configuration generators.

Sometimes, blocks are insufficient to properly wrap our desired functionality. Sometimes you need conditionals, in other cases you want to use a more complex mechanism or a template language to generate part of the configuration. That you can do using configuration generators.

Configuration generators have also been around for a while, but until now they were only available using external shell scripts (using the confgen module), or restricted to be used from C, syslog-ng’s base language. The changes in 4.0 allow you to write generators in Python.

Here’s an example:

@version: 4.0
python {

from syslogng import register_config_generator
def generate_foobar(args):
    return "tcp(port(2000))"
# this registers a plugin in the "source" context named "foobar"
# which would invoke the generate_foobar() function when a foobar() source
# reference is encountered.
register_config_generator("source", "foobar", generate_foobar)

log {
    # we are actually calling the generate_foobar() function in this
    # source, passing all parameters as values in the "args" dictionary
    source { foobar(this(is) a(value)); };
    destination { file("logfile"); };

syslog-ng will automatically invoke your generate_foobar() function whenever it finds a “foobar” source driver and then takes the return value for that function and substitutes back to where it was found. Parameters are passed around in the args parameter.

Shipping Python code with syslog-ng

Until now, Python was more of an “extended” configuration language, but with the features described above, it can actually become a language to write native-looking and native-behaving plugins for syslog-ng, therefore it becomes important for us to ship these.

To submit a Python implemented functionality to syslog-ng, just open a PR that places the new Python code into the modules/python-modules/syslogng/modules subdirectory. This will get installed as a part of our syslog-ng-python package. If you have 3rd party dependencies, just include them in the and requirements.txt files.

If you need an example how to use the new Python based facilities, just look at the implementation of our kubernetes() source.