syslog-ng on the long term: a draft on strategic directions
I made a promise some posts ago that I would use this blog both for collecting feedback and to provide information about potential next steps ahead of syslog-ng. In the same post, I also promised that you, the syslog-ng community, would have a chance to steer these directions. Please read on to find out how to do that.
In the past few weeks I performed a round of discussions/interviews with syslog-ng users. I also spent time looking at other products and analyst reports on the market. Based on all this information I’ve come up with a list of potential strategic directions for syslog-ng to tackle. Focusing on these and prioritizing features that fall into one of these directions ensures that syslog-ng indeed moves ahead.
When I performed similar goal setting exercises in my previous CTO role at Balabit, our team made something similar:
- brainstorming on potential directions,
- drafting up a cleaned up conclusion document,
- validating that the document is a good summary of the discussion and
- validating via customers that they are indeed a good summary of what the customers need.
syslog-ng is an Open Source project, so I wanted to involve the community somehow. Organizing a brainstorming session sounds difficult on-line (do you know good solutions for this?). So I wanted to create an opportunity to talk with the broad community about my thoughts somehow, in a way that leads to a useful conclusion. This is the primary intent behind this post.
Once you read the directions below, please think about if you agree with my choice of directions here! Are these indeed the most important things? Have I missed something? Do you have something in mind that should be integrated somehow? Which of the directions do you consider the most important?
Please give your feedback via this form https://forms.gle/xJ2heSHeVb7ZHUHH9, write a comment on the blog or drop me an email. Thanks.
1. The Edge
syslog-ng has traditionally been used as a tool for log aggregation, e.g. working on the server side. That’s why its CPU and memory usage has always been in focus. Being able to consume a million (sometimes millions!) of messages a second is important for server use-cases, however I think that in exchange for this focus, syslog-ng has neglected the other side of the spectrum: the Edge.
The Edge is where log messages are produced by infrastructure and applications and then sent away to a centralized logging system.
syslog-ng trackles the original “syslogd-like” deployment scenarios on the Edge, but lacks features/documentation that make it easy to deploy it in a more modern setting, e.g. as a part of a Kubernetes cluster or as a part of a cloud-native application.
Apart from the deployment questions, I consider The Edge to be also important for improving data quality and thus improving the usefulness of collected log data. I see that in a lot of cases today, log data is collected without associated meta-information. And without that meta information it becomes very difficult to understand the originating context of said log data, limiting the ability to extract insights and understanding from logs.
These are the kind of features that fall into this bucket, in no particular order:
- Transport that is transparently carrying metadata as well as log data, plus multi-line messages (this is probably achieved by EWMM already)
- Kubernetes (container logs, pod related meta information, official image)
- Document GCP/AWS/Azure deployments, log data enrichment
- non-Linux support (Windows and other UNIXes)
- Fetch logs from Software as a Service products
2. Cloud Native
The cloud is not just a means to deploy our existing applications to a rented infrastructure. It is a set of engineering practices that make developing applications faster and more reliable. Applications are deployed as a set of microservices, each running in its own container, potentially distributed along a cluster of compute nodes. Components of the applications managed via some kind of container orchestration system, such as Kubernetes.
Being friendly to these new environments is important, as new applications are increasingly using this paradigm.
Features in this category:
- Container images for production
- as a logging side-car to collect app logs and transfer them to the centralized logging function or
- as an application specific, local logging repository (e.g. app specific server)
- HTTP ingestion API
- these apps tend to communicate using HTTP, so it is more native to use that even for log ingestion
- maybe provide compatibility with other aggregation solutions (Elastic, Splunk, etc)
- Object Storage support
- Stateless & persistent queueing (kafka?)
The term observability roots in control theory, however it is increasingly applied to the operations of IT systems. Being observable in this context means that the IT system provides an in-depth view into its inner behaviours, making it simpler to troubleshoot problems or increase performance. Observability today often implies three distinct types of data: metrics, traces and logs.
I originally met this term in relation to Prometheus, an Open Source package that collects and organizes application specific metrics in a manner that easily adapts to cloud native, elastic workloads. Traditional monitoring tools (such as Zabbix or Nagios) require a top-down, manual configuration, while Prometheus reversed this concept and pushed this responsibility to application authors. Applications should expose their important metrics so that application monitoring works “out-of-the-box”. This idea quickly gained momentum as manually configuring monitoring tools to adapt automatically scaled application components is pretty much impossible.
Albeit observability originally comes from the application monitoring space, its basic ideas can be extended to cover traces and logs as well.
Features in this category:
- Being observable: provide a prometheus exporter so that we can become observable out-of-the-box
- Interoperate with Observability platforms
- Loki destination
- Support for OpenTelemetry (source and destination)
- convert logs from metrics/traces and vice-versa
4. Application awareness
syslog has been a great invention: it has served us in the last 40-45 years and its importance continues into the future. Operating systems, network devices, IoT, applications, containers, container orchestration systems can all push their log data to syslog. For some of those, using syslog is the only option.
In a way syslog is the common denominator of all log producing IT systems out there and as such it has become the shared infrastructure to carry logs in a lot of environments.
In my opinion, the success of syslog stems from the simplicity of using it: just send a datagram to port 514 and you are done. However this simplicity is also its biggest limitation: it is under-specified. There have been attempts at standardization (RFC3164 and RFC5424) but these serve more as “conventions” than standards.
The consequence is that incompatible message formats limit the usefulness of log data, once collected in a central repository. I regularly see issues such as:
- unparseable and partial timestamps
- missing or incorrect timezone information
- missing information about the application’s name (e.g. $PROGRAM) or hostname
- incorrectly framed multi-line messages
- key=value data that is in a format downstream systems are unable to parse
Sometimes it’s not the individual log entry that is the problem, rather the overly verbose logging format that becomes difficult to work with once you start using it for dashboards/queries:
- The Linux audit system produces very verbose, multi-line logs about a single OS operation
- Mail systems emit multiple log entries for a single email transaction, sometimes a separate log entry for each attachment.
syslog-ng has always been good in the various heuristics to properly extract information even from incorrectly formatted syslog messages, however there are extreme cases where applications omit crucial information or use a syntax so far away from the spec that even syslog-ng is unable to parse the data correctly.
Application awareness in this context means the ability of fixing up the syslog parsing with the knowledge of the application that produced it. It is difficult to craft heuristics that work with all incorrect formats, however once we start with identifying the application, then we can correctly determine what the log message was intended to look like. Fixing these issues before the message hits a consumer (e.g. SIEM) helps a lot in actually using the data we store.
Also, being application aware also implies that log routing decisions can become policy aware. “Forward me all the security logs” is a common request from any security department. However actually doing this is not simple: what should constitute as “security”? Being application aware means that it becomes possible to classify based on applications instead of individual log messages.
Features in this category:
- classifying incoming logs per application (e.g. app-parser() and its associated application adapters)
- fix incoming logs and make them formatted in a way that becomes easier to handle by downstream consumers (timestamps, multi-line messages, etc.)
- translate incoming logs into a format that a downstream system best understands
5. User friendliness
syslog-ng is a domain specific language for log management. Its performance is a crucial characteristic, but the complexity of operations performed by syslog-ng, still within the log management layer has grown tremendously. Making syslog-ng easier to understand, errors and problems easier to diagnose is important in order to deal with this complexity. Having first class documentation is also important for it to succeed in any of these directions, described above.
So albeit not functionality by itself, I consider User friendliness a top-priority for syslog-ng.
Features in this category:
- syntax improvements can go a long way of adopting a feature. syslog-ng has always been able to do conditional routing of log messages however if()/elif()/else went a long way in getting it adopted. There are other potential improvements in the syntax that could help reading/writing syslog-ng configurations easier.
- configuration diagnostics: better location reporting in error messages, warnings, etc.
- interactive debuggability: as syslog-ng is applied to more complex problems, the related configuration becomes more complex too. Today, you have to launch syslog-ng in foreground, inject a message and try to follow its operations using the builtin trace messages. Interactive debugging would go a long way in making the writing and testing these functionalities.
Those are roughly the directions I have in mind for the future of syslog-ng. If you disagree or have some comments, please provide feedback via the form at: https://forms.gle/xJ2heSHeVb7ZHUHH9