Select Page
Streaming deduplication in syslog-ng

Streaming deduplication in syslog-ng

Log volumes are growing 25% year over year, which means they are doubling every three years. Considering that SIEMs and other log processing tools are licensed based on volume, tools and mechanisms to make log storage and processing more efficient are very much sought for.

A typical solution to this problem is the use of a dedicated log management layer, or as it is called these days: a dedicated observability pipeline. Regardless of how you name the solution in place, there are two separate gains of using these systems:

  1. you can make data more valuable by fixing up data problems or enriching data,
  2. you get to choose where the data gets stored (in the SIEM or elsewhere), thus potentially decreasing the volume of data sent to the SIEM.

As you look at the data ingested into the SIEM, you will recognize that not all of that data is displayed in dashboards or used for detecting threats. Nevertheless, organizations still collect and store this data as best practice, because a forensics investigation could potentially use this data, should an incident be discovered later.

While I believe that all data can be made valuable with enough effort, let me zoom in on the volume question.

Simple log deduplication

With something like syslog-ng, you can obviously route specific applications or severity levels somewhere else (like a set of files or an S3 bucket), simply by using filters. In addition to routing non-essential data to a separate log archive, you can also reduce redundancy between messages and combine groups of multi-line logs into single events. Or, you can transform a huge XML-based event into a neater, smaller structure.

Even with all of this in place, you may still get runaway applications sending messages in a tight loop in huge quantities, repeating the same message over and over. The original syslogd had support for suppressing such repeated messages, and syslog-ng has even improved this feature. Here’s a sample message and its suppression that follows it, as produced by syslog-ng:

Jan 23 19:23:10 bzorp sshd[3561]: Failed password for admin from 10.110.2.151 port 9807 ssh2
Jan 23 19:23:20 bzorp sshd: Last message 'Failed password for ' repeated 2 times, suppressed by syslog-ng on bzorp

 

syslog-ng improves the original syslogd functionality by keeping the $HOST / $PROGRAM values intact to make it easier to correlate the repetitions and the original message.

Let me point out that suppression like this does decrease the volume, but at the same time it also loses information. With the example above, you are losing the timestamp of the two subsequent login failure attempts, which might prove useful in a forensics investigation or when training an AI model that uses failed logins as an input.

This kind of suppression is also pretty limited: sometimes the message is not completely the same: events may differ in ways that are not material to your analytics tools, while the representation as a log message would be different. In these cases, the above suppression would not work.

Flexible streaming log deduplication

syslog-ng is a Swiss Army Knife for logs, so obviously there is a more flexible solution in its arsenal: syslog-ng can perform something I call “streaming correlation” using its grouping-by() parser (available since version 3.8.1 from 2016). A grouping-by() parser is very similar to the “GROUP BY” construct in SQL databases, but instead of tables of data, you can apply it to a stream of events. This is usually used to transform a series of events into a combined one, but this can also be used to deduplicate the log stream while ignoring unimportant changes to the message, as discussed in this GitHub thread.

Here is an example with an iptables message parsed by our iptables-parser() which has ${PROTO}, ${SRC}, ${DST} and ${DPT} fields extracted by the time it gets into this processing element:

parser p_dedup {
    grouping-by(
        key("${.iptables.PROTO}/${.iptables.SRC}/${.iptables.DST}/${.iptables.DPT}")
        aggregate(
            value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")
        )
        timeout(10)
        inject-mode(aggregate-only));
};

This configuration instructs syslog-ng to follow the log stream and “group” all messages that have the same key within a 10 second window. The key contains only proto/srcip/dstip/dstport values and omits srcport which can be considered unimportant when looking at a sequence of connections.

Once the 10 second elapses, syslog-ng reports a single event with the $MESSAGE part changed, so that it includes the number of messages that were considered the same. Do note that you can construct the “aggregate” message quite flexibly. You can

  • change any existing name-value pairs or even add new ones.
  • have repetitions in a dedicated field so it does not change $MESSAGE itself.
  • do aggregations for various fields across the group (using the $(sum) or $(average) template functions for example)

Using grouping-by() while collecting data is a lot more performant that storing the entire data set and then doing the same query from the database. It reduces the amount of data to be ingested and the CPU time required to come up with the same aggregation at search time.

One caveat is that you should probably store the raw data stream into a separate archive and only perform these kind of reductions en-route to your SIEM/analytics/dashboarding system, so that you can access to the unchanged, raw data for forensics investigations or the training of AI models.

In case you would like to play with streaming deduplication and syslog-ng, here’s a complete syslog-ng configuration that I’ve prepared while writing this blog post. If you send an iptables message to TCP port 2000, it would perform deduplication with a 10 second window.

@version: 4.0
@include "scl.conf"

parser p_dedup {
  grouping-by(
    key("${.iptables.PROTO}/${.iptables.SRC}/${.iptables.DST}/${.iptables.DPT}")
    aggregate(
      value("MESSAGE" "${MESSAGE} REPEAT=$(- $(context-length) 1)")
    )
    timeout(10)
    inject-mode(aggregate-only));
};

log {
  source { tcp(port(2000)); };
  parser { iptables-parser(); };
  parser(p_dedup);
  destination { file("deduplicated.log"); };
}

Just start syslog-ng with the config above in the foreground (-d tells syslog-ng to run in debug mode, which you can omit):

$ /usr/sbin/syslog-ng -F -d -f <path/to/config/file

Then post a message to port 2000 using netcat (repeat this potentially a number of times):

$ echo '<5> https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0' | nc -q0 localhost 2000

And you will get this output in deduplicated.log for 6 repetitions of the same message:

Jan 24 10:22:07 localhost https: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63370 DF PROTO=TCP SPT=46006 DPT=443 WINDOW=65495 RES=0x00 SYN URGP=0 REPEAT=6

 

syslog-ng 4 improves Python support

syslog-ng 4 improves Python support

It’s been a while since I personally acted as the release manager for a syslog-ng release, the last such release was 3.3.1 back in October 2011. v3.3 was an important milestone, as that was the version that introduced threaded mode and came with a completely revamped core architecture to scale up properly on computers that had multiple CPUs or cores. I released syslog-ng 4.0.1 a couple of weeks ago which brings with it the support for runtime typing, which is a significant conceptual improvement.

Apart from typing, which I have discussed at length already, the release sports important additions and improvements in syslog-ng’s support for Python, which I would like to zoom into a bit in this post.

In case you are not aware, syslog-ng has allowed you to write source and destination drivers, parsers and template functions in Python for a while now. See this post on writing a source in general and this one for writing an HTTP source.

There was one caveat in using Python though: while it was easy to extend an existing configuration and relatively easy to deploy these in a specific environment, syslog-ng lacked the infrastructure to merge such components into syslog-ng itself and expose this functionality as if it was implemented natively. For instance, to use the Python based HTTP source described in the blog post I mentioned above, you needed to write something like this to use the Python based http source:

source s_http {
    python(
      class("httpsource_v2.HTTPSource")
      options("port", "8081")
    );
};

As you can see, this syntax is pretty foreign, at least if you compare this to a native driver that would look like this:

source s_http {
    http(port(8081));
};

A lot simpler, right? Apart from configuration syntax, there was another shortcoming though: Python code usually relies on 3rd party libraries, usually distributed using PyPI and installed using pip. Up to 4.0.0, one needed to take care about these dependencies manually. The http source example above needs you to install the “python3-twisted” package using dnf/apt-get or pip manually and only then would you be able to use it.

These short-comings are all addressed in the 4.0.0 release, so that:

  • 3rd party libraries are automatically managed once you install syslog-ng.
  • you can use native configuration syntax,
  • we can ship Python code as a part of syslog-ng,

Let’s break these down one-by-one.

Managing 3rd party Python dependencies

From now on, syslog-ng automatically creates and populates a Python virtualenv to host such 3rd party dependencies. This virtualenv is located in ${localstatedir}/venv, which expands to /var/lib/syslog-ng/venv normally. The virtualenv is created by a script named syslog-ng-update-virtualenv, which is automatically run at package installation time.

The list of packages that syslog-ng will install into this virtualenv is described by /usr/lib/syslog-ng/python/requirements.txt.

If you want to make further libraries available (for instance because your local configuration needs it), you can simply use pip to install them:

$ /var/lib/syslog-ng/python-venv/bin/pip install <pypi package>

syslog-ng will automatically activate this virtualenv at startup, no need to explicitly activate it before launching syslog-ng.

Using this mechanism, system installed Python packages will not interfere with packages that you need because of a syslog-ng related functionality.

Native configuration syntax for Python based plugins using blocks.

There are two ways of hiding the implementation complexities of a Python based component, in your configuration file:

  • using blocks to wrap the python() low level syntax, described just below
  • using Python based config generators, described in the next section

Blocks have been around for a while, they basically allow you to take a relatively complex configuration snippet and turn it into a more abstract component that can easily be reused. For instance, to allow using this syntax:

source s_http {
    http(port(8081)); 
};

and turn it into a python() based source driver, you just need the following block:

block source http(port(8081)) {
   python(class("httpsource_v2.HTTPSource")
          options("port", "`port`") );
}

The content of the block will be substituted into the configuration, whenever the name of the block is encountered. Parameters in the form of param(value) will be substituted using backticks.

In simple cases, using blocks provides just enough flexibility to hide an implementation detail (e.g. that we used Python as the implementation language) and also hides redundant configuration code.

Blocks are very similar to macros as used in other languages. This term was unfortunately already taken in the syslog-ng context, that’s why it has been named differently.

Blocks are defined in syslog-ng include files, these include files you can store as an “scl” subdirectory of the Python module.

Native configuration syntax for Python based plugins using configuration generators.

Sometimes, blocks are insufficient to properly wrap our desired functionality. Sometimes you need conditionals, in other cases you want to use a more complex mechanism or a template language to generate part of the configuration. That you can do using configuration generators.

Configuration generators have also been around for a while, but until now they were only available using external shell scripts (using the confgen module), or restricted to be used from C, syslog-ng’s base language. The changes in 4.0 allow you to write generators in Python.

Here’s an example:

@version: 4.0
python {

from syslogng import register_config_generator
def generate_foobar(args):
    print(args)
    return "tcp(port(2000))"
#
# this registers a plugin in the "source" context named "foobar"
# which would invoke the generate_foobar() function when a foobar() source
# reference is encountered.
#
register_config_generator("source", "foobar", generate_foobar)

};
log {
    # we are actually calling the generate_foobar() function in this
    # source, passing all parameters as values in the "args" dictionary
    source { foobar(this(is) a(value)); };
    destination { file("logfile"); };
};


syslog-ng will automatically invoke your generate_foobar() function whenever it finds a “foobar” source driver and then takes the return value for that function and substitutes back to where it was found. Parameters are passed around in the args parameter.

Shipping Python code with syslog-ng

Until now, Python was more of an “extended” configuration language, but with the features described above, it can actually become a language to write native-looking and native-behaving plugins for syslog-ng, therefore it becomes important for us to ship these.

To submit a Python implemented functionality to syslog-ng, just open a PR that places the new Python code into the modules/python-modules/syslogng/modules subdirectory. This will get installed as a part of our syslog-ng-python package. If you have 3rd party dependencies, just include them in the setup.py and requirements.txt files.

If you need an example how to use the new Python based facilities, just look at the implementation of our kubernetes() source.

Rounding up syslog-ng 4 and a practical introduction to typing

Rounding up syslog-ng 4 and a practical introduction to typing

syslog-ng 4 is right around the corner and the work on the topics I listed in this blog post are nearing completion. Instead of a pile of breaking changes, we choose to improve syslog-ng in an evolutionary manner: providing fine grained compatibility with older versions along the way, so that syslog-ng 4 remains a drop-in replacement for any earlier release in the past 15 years.

What evolution in this context means in practice is that features/changes are merged into 3.x releases as we are ready with them, but they are all hidden behind a feature flag: they all come disabled by default and to enable them, one needs to use `@version: 4.0` at the top of one’s configuration file. This process is detailed in Peter Czanik’s blog post with a couple of real-world examples.

The first set of changes went into 3.36.1 released in March, some more followed in 3.37.1 (and a related blog post) released in June  and 3.38.1 is any day now with most changes accumulated already on the “master” branch (nightly snapshots).

Hopefully, this is going to be the last 3.x release before 4.x is cut, but this also depends on feedback and issues that we might encounter in this cycle.

This release was focused primarily to get 4.0 ready and as such it concentrated pretty much on finishing up the typing feature. If you have already read the linked blog post, you might already be aware that we intend to associate runtime type information to any name-value pair that we encounter, so that we can 1) allow type aware comparisons in routing decisions, 2) reproduce the original types when sending a log message to consumers. I also expect this to be an important feature as we implement more features in our long term objectives (observability, app awareness, user friendliness).

Problems that type aware comparisons attempt to solve

Probably the most important change in 3.38.1 is the introduction of type aware comparisons. Traditionally syslog-ng had two kinds of comparison operators, just like shell scripts with the “test” builtin command.

  1. one to compare strings (eq, ne, gt, lt)
  2. one to compare numbers (==, !=, >, <)

Here’s an example:

log {
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.response}" >= "500") {
    file("/logs/apache-errors");
  };
};

This example shows how to route HTTP 500 and above requests to a separate file. If you look at the if {} statement, you can see that we compared the HTTP response code to 500 and tried to capture anything that is higher than 500. But there’s a potential trap in here. ${.apache.response} is a string that contains a number. “500” in syslog-ng 3.x is also a string.

How this comparison is performed depends on the operator that we use: numeric (<, >, ==, !=) or string (lt, gt, eq, ne) focused.

If you look at the example again, you can see that we used the “numeric” operators, which means that the configuration above correctly performs the comparison, converts both “${.apache.response}” and “500” to numbers and compares them numerically.

But then, let’s see this example:

log {
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.httpversion}" == "1.0") {
    file("/logs/http-10-logs");
  };
};

Again, we are trying to compare a name-value pair against a literal string, this time checking for equality, and albeit version numbers are not strictly numeric, they are in this specific case. Also, using “==” as an operator as before.

But this example will do something that is pretty unexpected:

  • we used the numeric operators in the example, so syslog-ng would convert both “${.apache.httpversion}” and “1.0” to numbers
  • but the numeric operators only support INTEGERS, floating point numbers are not supported (actually syslog-ng uses the function atoi(3) for this conversion)
  • atoi() actually picks up the numbers before the “.” in the version number and converts that to an integer, this means “1.0” becomes 1 and “1.1” becomes 1 too!
  • so the comparison above would evaluate to TRUE to any value that starts with a 1 and then a non-digit character.
  • this means that all HTTP versions starting from 1.0 up to 1.9 end up in our file that we designated as one to hold 1.0 only traffic.

Not really the expected behaviour. But it becomes worse.

log {
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.request}" == "/wp-admin/login.php") {
    file("/logs/wordpress-logins");
  };
};

This time, we are trying to filter our data based on a string comparison, and we erroneously used the numeric operator. This is what happens:

  • Neither “${.apache.request}” nor “/wp-admin/login.php” is numeric, they don’t even have digits in front of them
  • Both values are converted to 0.
  • Zero equals to zero, so the filter expression above is always TRUE.

There are other similar cases, the ugliest one when comparing a name-value pair to an empty string with numeric operators. Results are completely unexpected.

Type aware comparisons come to the rescue

I saw numerous cases where someone got the operator incorrect when trying to compare/match something in syslog-ng. I felt that the issue has never been a user error, rather we made a poor job of providing a user friendly syntax and thus we pushed too much responsibility on those attempting to make use of these features.

But solving these kind of design mistakes is never easy. Some of our users have figured this out already. We don’t want to break their configuration, right? But we want to make this easier, more intuitive for new users or new use-cases.

The solution we implemented was to make the numeric operators (==, !=, <, >) do the right thing. Based on the types of its arguments, it can in most cases infer what would be the right thing to do. So let’s help them there. We took some inspiration from JavaScript (which operates in a similar string-heavy environment) and implemented more intuitive rules for our – previously numeric only – comparisons.

Let’s see our previous examples:

@version: 4.0
log {  
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.response}" >= 500) {
    file("/logs/apache-errors"); 
  };
};

If you compare this to my previous example, I have removed the quotes from around “500”. In syslog-ng 3.x, the quotes were mandatory. In 4.x, they are not. If you are not using quotes, the literal 500 becomes a numeric literal. And comparing a string to a number would compare those as numeric (e.g. just like JavaScript). We could even improve the Apache parser to make ${.apache.response} a number as it parses its input, but to do the right thing, it is enough that one side of the comparison is numeric.

Next example:

@version: 4.0
log {
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.httpversion}" == "1.0") {
    file("/logs/http-10-logs");
  };
};

I haven’t changed anything in this example, both “${.apache.httpversion}” and “1.0” are strings and they are compared as strings. So this time, only HTTP/1.0 would be routed to our logfile. 1.1 or even 1.9 would be  filtered out, as expected. We could use floating point based comparisons if we wanted to by removing the quotes (just like in the previous example) or by using explicit type-casting:

if ("${.apache.httpversion}" == 1.0)

OR
if (double("${.apache.httpversion}") == "1.0")

Type casting can be applied anywhere where we used template strings before to apply a type to the result of the template expansion.

And here’s the third example:

@version: 4.0
log {
  source { file("/var/log/apache2/access.log"); };
  parser { apache-accesslog-parser(); };
  if ("${.apache.request}" == "/wp-admin/login.php") { 
    file("/logs/wordpress-logins"); 
  }; 
};

Again, no changes necessary. Both sides are strings, we are comparing as strings. No need to use the “eq” operator. Just one set of operators and sometimes explicit type-casts will cover all use-cases. For compatibility reasons, the old “string” operators (eq, ne, lt, gt) remain to be available, but I hope we can forget those eventually.

Other typing related changes

This section briefly lists the various components that we needed to adapt to typing. These changes happened since 3.36.1 was released but not explicitly announced in those versions. Let me know if you are interested in any of these topics in more detail, probably there are a couple of blog posts worth of content here:

  • type aware comparisons in filter expressions: as detailed above, the previously numeric operators become type aware and the exact comparison performed will be based on types associated with the values that we compare.
  • json-parser() and $(format-json): JSON support is massively improved with the introduction of types. For one: type information is retained across input parsing->transformation->output formatting. JSON lists (arrays) are now supported and are converted to syslog-ng lists so they can be manipulated using the $(list-*) template functions. There are other important improvements in how we support JSON.
  • set(), groupset(): in any case where we allow the use of templates, support for type-casting was added and the type information is properly promoted.
  • db-parser() type support: db-parser() gets support for type casts, <value> assignments within db-parser() rules can associate types with values using the type-casting syntax, e.g. <value name=”foobar”>int32($PID)</value>. The “int32” is a type-cast that associates $foobar with an integer type. db-parser()’s internal parsers (e.g. @NUMBER@) will also associated type information with a name-value pair automatically.
  • add-contextual-data() type support: any new name-value pair that is populated using add-contextual-data() will propagate type information, similarly to db-parser().
  • map-value-pairs() type support: propagate type information
  • SQL type support: the sql() driver gained support for types, so that columns with specific types will be stored as those types.
  • template type support: templates can now be casted explicitly to a specific type, but they also propagate type information from macros/template functions and values in the template string
  • value-pairs type support: value-pairs form the backbone of specifying a set of name-value pairs and associated transformations to generate JSON or a key-value pair format. It also gained support for types, the existing type-hinting feature that was already part of value-pairs was adapted and expanded to other parts of syslog-ng.
  • on-disk serialized formats (e.g. disk buffer/logstore): we remain compatible with messages serialized with an earlier version of syslog-ng, and the format we choose remains compatible for “downgrades” as well. E.g. even if a new version of syslog-ng serialized a message, the old syslog-ng and associated tools will be able to read it (sans type information of course)
syslog-ng 3.37 released

syslog-ng 3.37 released

syslog-ng 3.37 has just been released, packages available in various platforms this week. You can get the detailed release notes on the github releases page, however I felt this would be a good opportunity to revisit my draft on the syslog-ng long term objectives and how this release builds in that direction.

The Edge: deployment and CI/CD

Being better at the edge means that we need to improve support for use-cases where syslog-ng is directly deployed on the node/server or is deployed close to such nodes or servers. One way to deploy syslog-ng is to use a .deb or .rpm package, but more and more syslog-ng is used in a container. Our production docker image is built based on Debian. Creating this image has been a partially manual process with all the issues that this entails.

With the merge of PR #4014 and #4003, Attila Szakács automated the entire workflow in a beautiful set of GitHub Action scripts, so that:

  • Official source and binary packages (for CentOS, Debian, Fedora and Ubuntu) are built automatically, once a syslog-ng release is tagged
  • The production docker image is built and pushed automatically, once the required binary packages are successfully built.

While we have pretty good, automated unit and functional tests, we did not test the installation packages themselves. Until now. András Mitzky implemented a smoke tests for the packages themselves, doing an install & upgrade and a start-stop test.

The Edge: Kubernetes

Increasingly, the edge is often running on an orchestrated, container based infrastructure, such as Kubernetes. Using syslog-ng in these systems were possible but required manual integration. With the merger of PR #4015, this is becoming more out of the box, expect another blog post on this in the coming days.

Application awareness

syslog is used as an infrastructure for logging serving a wide variety of applications. For these applications, logging is not a primary concern, unfortunately. The consequence is that they often produce invalid or incorrect data. To handle these applications well, we need to cater for these issues.

For instance, certain Aruba products use a timestamp like this:

2022-03-10 08:04:08,449

Looking at this, the problem might not even be apparent: it uses a comma to separate seconds from the fractions part.

You might argue that this is not an important problem at all, who needs fractions anyway?

There are two issues with this:

  1. Fractions might be important to some (e.g. for ordering with thousands of logs per second).
  2. It breaks the parsing the message itself (as the timestamp is embedded in a larger message), causing message related metadata to be incorrectly extracted (e.g. which device you want to attribute this message). This means that your dashboard in a SIEM may miss vital information.

And this is not the only similar case. See for this pull request for example for a similar example.

This is exactly why application awareness is important, fixing these cases means that your log data becomes more usable as a whole.

Usually it is not the programming of the solution that is difficult here, rather the difficulty lies in having to learn that the problem exists in the first place. If you have a similar parsing problem, please let us know by opening a GitHub issue. The past few such problems were submitted to us by the Splunk Connect for Syslog team, thanks for their efforts. Btw, sc4s is great if you want to feed syslog to Splunk and it uses syslog-ng internally.

On a similar note, we have improved the cisco-parser() that extracts fields from Cisco gear and added a parser for MariaDB audit logs. Both of these parsers are part of our app-parser() framework.

Others

There are a few other features I find interesting, just a short summary

  • Type support is nearing completion. We added support for types in template expressions, groupset() & map-value-pairs().
  • We improved syslog-ng’s own trace messages: we added the unique message ID (e.g. $RCPTID) as a tag in all message related trace messages, so that you can correlate trace messages to a specific message. We also included type information as a part of the type support effort.
  • We improved handling of list/array like data in this pull request.
  • We extended our set of TLS options by adding support for sigalgs & client-sigalgs.

 

syslog-ng on the long term: a draft on strategic directions

syslog-ng on the long term: a draft on strategic directions

I made a promise some posts ago that I would use this blog both for collecting feedback and to provide information about potential next steps ahead of syslog-ng. In the same post, I also promised that you, the syslog-ng community, would have a chance to steer these directions. Please read on to find out how to do that.

In the past few weeks I performed a round of discussions/interviews with syslog-ng users. I also spent time looking at other products and analyst reports on the market. Based on all this information I’ve come up with a list of potential strategic directions for syslog-ng to tackle. Focusing on these and prioritizing features that fall into one of these directions ensures that syslog-ng indeed moves ahead.

When I performed similar goal setting exercises in my previous CTO role at Balabit, our team made something similar:

  1. brainstorming on potential directions,
  2. drafting up a cleaned up conclusion document,
  3. validating that the document is a good summary of the discussion and
  4. validating via customers that they are indeed a good summary of what the customers need.

syslog-ng is an Open Source project, so I wanted to involve the community somehow. Organizing a brainstorming session sounds difficult on-line (do you know good solutions for this?). So I wanted to create an opportunity to talk with the broad community about my thoughts somehow, in a way that leads to a useful conclusion. This is the primary intent behind this post.

Once you read the directions below, please think about if you agree with my choice of directions here! Are these indeed the most important things? Have I missed something? Do you have something in mind that should be integrated somehow? Which of the directions do you consider the most important?

Please give your feedback via this form https://forms.gle/xJ2heSHeVb7ZHUHH9, write a comment  on the blog or drop me an email. Thanks.

1. The Edge

syslog-ng has traditionally been used as a tool for log aggregation, e.g. working on the server side. That’s why its CPU and memory usage has always been in focus. Being able to consume a million (sometimes millions!) of messages a second is important for server use-cases, however I think that in exchange for this focus, syslog-ng has neglected the other side of the spectrum: the Edge.

The Edge is where log messages are produced by infrastructure and applications and then sent away to a centralized logging system.

syslog-ng trackles the original “syslogd-like” deployment scenarios on the Edge, but lacks features/documentation that make it easy to deploy it in a more modern setting, e.g. as a part of a Kubernetes cluster or as a part of a cloud-native application.

Apart from the deployment questions, I consider The Edge to be also important for improving data quality and thus improving the usefulness of collected log data. I see that in a lot of cases today, log data is collected without associated meta-information. And without that meta information it becomes very difficult to understand the originating context of said log data, limiting the ability to extract insights and understanding from logs.

These are the kind of features that fall into this bucket, in no particular order:

  • Transport that is transparently carrying metadata as well as log data, plus multi-line messages (this is probably achieved by EWMM already)
  • Kubernetes (container logs, pod related meta information, official image)
  • Document GCP/AWS/Azure deployments, log data enrichment
  • non-Linux support (Windows and other UNIXes)
  • Fetch logs from Software as a Service products
  • etc

2. Cloud Native

The cloud is not just a means to deploy our existing applications to a rented infrastructure. It is a set of engineering practices that make developing applications faster and more reliable. Applications are deployed as a set of microservices, each running in its own container, potentially distributed along a cluster of compute nodes. Components of the applications managed via some kind of container orchestration system, such as Kubernetes.

Being friendly to these new environments is important, as new applications are increasingly using this paradigm.

Features in this category:

  • Container images for production
    • as a logging side-car to collect app logs and transfer them to the centralized logging function or
    • as an application specific, local logging repository (e.g. app specific server)
  • HTTP ingestion API
    • these apps tend to communicate using HTTP, so it is more native to use that even for log ingestion
    • maybe provide compatibility with other aggregation solutions (Elastic, Splunk, etc)
  • Object Storage support
  • Stateless & persistent queueing (kafka?)
  • etc

3. Observability

The term observability roots in control theory, however it is increasingly applied to the operations of IT systems. Being observable in this context means that the IT system provides an in-depth view into its inner behaviours, making it simpler to troubleshoot problems or increase performance. Observability today often implies three distinct types of data: metrics, traces and logs.

I originally met this term in relation to Prometheus, an Open Source package that collects and organizes application specific metrics in a manner that easily adapts to cloud native, elastic workloads. Traditional monitoring tools (such as Zabbix or Nagios) require a top-down, manual configuration, while Prometheus reversed this concept and pushed this responsibility to application authors. Applications should expose their important metrics so that application monitoring works “out-of-the-box”. This idea quickly gained momentum as manually configuring monitoring tools to adapt automatically scaled application components is pretty much impossible.

Albeit observability originally comes from the application monitoring space, its basic ideas can be extended to cover traces and logs as well.

Features in this category:

  • Being observable: provide a prometheus exporter so that we can become observable out-of-the-box
  • Interoperate with Observability platforms
    • Loki destination
    • Support for OpenTelemetry (source and destination)
    • convert logs from metrics/traces and vice-versa

4. Application awareness

syslog has been a great invention: it has served us in the last 40-45 years and its importance continues into the future. Operating systems, network devices, IoT, applications, containers, container orchestration systems can all push their log data to syslog. For some of those, using syslog is the only option.

In a way syslog is the common denominator of all log producing IT systems out there and as such it has become the shared infrastructure to carry logs in a lot of environments.

In my opinion, the success of syslog stems from the simplicity of using it: just send a datagram to port 514 and you are done. However this simplicity is also its biggest limitation: it is under-specified. There have been attempts at standardization (RFC3164 and RFC5424) but these serve more as “conventions” than standards.

The consequence is that incompatible message formats limit the usefulness of log data, once collected in a central repository. I regularly see issues such as:

  • unparseable and partial timestamps
  • missing or incorrect timezone information
  • missing information about the application’s name (e.g. $PROGRAM) or hostname
  • incorrectly framed multi-line messages
  • key=value data that is in a format downstream systems are unable to parse

Sometimes it’s not the individual log entry that is the problem, rather the overly verbose logging format that becomes difficult to work with once you start using it for dashboards/queries:

  • The Linux audit system produces very verbose, multi-line logs about a single OS operation
  • Mail systems emit multiple log entries for a single email transaction, sometimes a separate log entry for each attachment.
  • etc

syslog-ng has always been good in the various heuristics to properly extract information even from incorrectly formatted syslog messages, however there are extreme cases where applications omit crucial information or use a syntax so far away from the spec that even syslog-ng is unable to parse the data correctly.

Application awareness in this context means the ability of fixing up the syslog parsing with the knowledge of the application that produced it. It is difficult to craft heuristics that work with all incorrect formats, however once we start with identifying the application, then we can correctly determine what the log message was intended to look like. Fixing these issues before the message hits a consumer (e.g. SIEM) helps a lot in actually using the data we store.

Also, being application aware also implies that log routing decisions can become policy aware. “Forward me all the security logs” is a common request from any security department. However actually doing this is not simple: what should constitute as “security”? Being application aware means that it becomes possible to classify based on applications instead of individual log messages.

Features in this category:

  • classifying incoming logs per application (e.g. app-parser() and its associated application adapters)
  • fix incoming logs and make them formatted in a way that becomes easier to handle by downstream consumers (timestamps, multi-line messages, etc.)
  • translate incoming logs into a format that a downstream system best understands

5. User friendliness

syslog-ng is a domain specific language for log management. Its performance is a crucial characteristic, but the complexity of operations performed by syslog-ng, still within the log management layer has grown tremendously. Making syslog-ng easier to understand, errors and problems easier to diagnose is important in order to deal with this complexity. Having first class documentation is also important for it to succeed in any of these directions, described above.

So albeit not functionality by itself, I consider User friendliness a top-priority for syslog-ng.

Features in this category:

  • syntax improvements can go a long way of adopting a feature. syslog-ng has always been able to do conditional routing of log messages however if()/elif()/else went a long way in getting it adopted. There are other potential improvements in the syntax that could help reading/writing syslog-ng configurations easier.
  • configuration diagnostics: better location reporting in error messages, warnings, etc.
  • interactive debuggability: as syslog-ng is applied to more complex problems, the related configuration becomes more complex too. Today, you have to launch syslog-ng in foreground, inject a message and try to follow its operations using the builtin trace messages. Interactive debugging would go a long way in making the writing and testing these functionalities.

Those are roughly the directions I have in mind for the future of syslog-ng. If you disagree or have some comments, please provide feedback via the form at: https://forms.gle/xJ2heSHeVb7ZHUHH9

syslog-ng 4 theme: typing

syslog-ng 4 theme: typing

As explained in my previous post, we do have some features already in mind for syslog-ng 4, even though the work on creating a long term set of objectives for the syslog-ng project is not finished yet. One of the themes I that I have working code for already, is typing.

syslog-ng traditionally assumes that log data, even if it comes in a structured form (like RFC5424 structured data or JSON) is primarily textual in nature. For this reason, name-value pairs in syslog-ng are text values just as the log message as a whole. The need for typing however came up previously, most notably in cases where we sent data to a consumer that supported typing, such as:

  • Elastic like other similar consumers use JSON, and attributes can have non-text types
  • SQL columns have types
  • Riemann metrics can have types

Also, it happens that typing has an impact in log routing decisions. In a lot of cases, textual comparisons or regexp matches are fine, however sometimes your routing condition depends on a value being larger than or less than a numeric value. For example:

log {
   if ("${.apache.bytes}" > "10000") {
      # do something
   }
};

In this case, doing the comparison as texts is clearly incorrect, if ${.apache.bytes} was “5”, the condition above would pass, as the string “5” is larger than “10000”, which is clearly not the case if we were to compare these numerically. To allow both numeric and textual comparisons, syslog-ng has two sets of operators, the usual “<“, “=” and “>” are doing numeric comparisons, while “lt”, “eq” and “gt” are doing string comparisons. But it’s pretty easy to mix those up, even I make that mistake sometimes.

To address both problems, type support is being added to syslog-ng. The change by itself is pretty simple:

  • we add a “type” value associated with each name-value pair of the log message,
  • the value itself continues to be stored internally in their current, text based format,
  • whenever we need type information in a type aware context (e.g. when we format  a JSON or send a riemann event), we would use this type information
  • whenever we just need the name-value pair as before, in textual context, we would just continue to use the existing string based value

The consequences:

  • type aware consumers (like: JSON, Elastic, Riemann, MongoDB, etc) would use type information automatically, no need for explicit type hints
  • we can implement type aware comparisons, so that syslog-ng does the right comparison, based on types (e.g. like JavaScript).

As always, this is probably easier to understand with examples.

Type aware JSON parsing/reproduction

@version: 4.0
log {
  source { tcp(port(2000) flags(no-parse)); };
  parser { json-parser(prefix('.json.')); };
  destination { file("/tmp/json.out" template("$(format-json .json.* --shift-levels 2)\n")); };
};

This configuration expects JSON payloads, one by each line, on TCP port 2000. It parses the JSON and then reformats it using $(format-json). Let’s run this configuration:

$ /sbin/syslog-ng -Fedvtf /etc/syslog-ng/syslog-ng-typing-demo.conf

Let’s send a JSON payload to this syslog-ng instance:

$ echo '{"text": "string", "number": 5, "bool": true, "thisisnull": null, "list": [5,6,7,8]}' | nc -q0 localhost 2000

syslog-ng reports the parsing process in its debug/trace log levels:

[2022-03-03T08:40:56.408225] json-parser message processing started; input='{"text": "string", "number": 5, "bool": true, "thisisnull": null, "list": [5,6,7,8]}', prefix='.json.', marker='(null)', msg='0x7ffff00141c0'
[2022-03-03T08:40:56.408461] Setting value; name='.json.text', value='string', msg='0x7ffff00141c0'
[2022-03-03T08:40:56.408500] Setting value; name='.json.number', value='5', msg='0x7ffff00141c0'
[2022-03-03T08:40:56.408524] Setting value; name='.json.bool', value='true', msg='0x7ffff00141c0'
[2022-03-03T08:40:56.408545] Setting value; name='.json.thisisnull', value='', msg='0x7ffff00141c0'
[2022-03-03T08:40:56.408592] Setting value; name='.json.list', value='5,6,7,8', msg='0x7ffff00141c0'

Note the individial name-value pairs being set as they are extracted from the JSON format. And then this is reproduced on the output side:

{
  "thisisnull": null,
  "text": "string",
  "number": 5,
  "list": [
    "5",
    "6",
    "7",
    "8"
  ],
  "bool": true
}

Please note that “numer” is numeric and “list” contains a JSON list. One limitation that is still visible here is that list elements are not typed and are always strings when being reproduced using $(format-json), because list elements are not name-value pairs.

Associate type information with name-value pairs

It is not just JSON that can set types for name-value pairs, rewrite rules and db-parser() can also set them. In rewrite rules, set() can now take a type hint, and that type-hint gets associated with the value as its type:

#this makes $PID numeric
rewrite { set(int("$PID") value("PID")); };

Also, db-parser() would set type information depending on which parser we used extract the specific field. For instance @NUMBER@ would extract an integer.

Type information returned by macros and templates and template functions

Template functions will be able to return the type, depending on the function they perform. For instance the list handling functions like $(list-slice) would return a list. Numerical functions like $(+) would return numbers. Likewise, some macros are also being annotated with their types.

Template expressions as a whole also become typed, whenever we use an “simple” template expression (e.g. with just one ‘$’ reference, like “$PID”), the type of the template is inferred automatically and that type is propagated. If the inferred type is not correct, you can always use type-hints to “cast” the template expression to some other type.

When does it become available? When I can try it?

Since the typing behavior has the potential of changing the output in certain ways (e.g. produce a numeric value which used a string before), we are not turning this feature on automatically. As long as we are in the 3.x release train, it will stay disabled, even as parts of it are being merged. You can evaluate the feature by setting your config version (e.g. @version at the top of the config file), to 4.0, as shown with the example config above.

Then, as we release 4.0, the typing feature will be enabled by default for any configuration that uses @version: 4.0.

Most of the feature is already implemented, but not yet merged to the mainline yet. There are opened PRs on GitHub. 3.36 is expected to contain the first batch (e.g. JSON parser pieces), but not the complete change. I expect the changes to land in mainline in 1 or 2 extra release cycles, e.g. the end of April or end of June.

Stay tuned!